Paths

Table of Contentst

Differential D6875

Add statsd metrics in JournalClient.process
AbandonedPublic
Actions

Authored by douardda on Jan 4 2022, 2:43 PM.

Details

Reviewers

None

Group Reviewers

Reviewers

Summary

trying to capture how long the journal client stays blocked waiting for
kafka message, how long it takes to actually retrieve the messages, and
how long it takes to handle them.

Diff Detail

Repository

rDJNL Journal infrastructure

Branch

master

Lint

Lint Skipped

Unit

Unit Tests Skipped

Build Status

Buildable 25800
Build 40318: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 40317: arc lint + arc unit

Event Timeline

douardda created this revision.Jan 4 2022, 2:43 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJan 4 2022, 2:43 PM

Build has FAILED

Patch application report for D6875 (id=24926)

Rebasing onto 4e5e00988a...

Current branch diff-target is up to date.

Changes applied before test

commit 9d78e7857866872ec260af37ea9a4350644facc7
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Nov 19 17:16:06 2021 +0100

    Add statsd metrics in JournalClient.process
    
    trying to capture how long the journal client stays blocked waiting for
    kafka message, how long it takes to actually retrieve the messages, and
    how long it takes to handle them.

Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/
See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 4 2022, 2:44 PM

Harbormaster failed remote builds in B25797: Diff 24926!

attempt to fix deps

Build is green

Patch application report for D6875 (id=24929)

Rebasing onto 4e5e00988a...

Current branch diff-target is up to date.

Changes applied before test

commit de55fb56ab1441a0074d66b43443bce692fa755b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Nov 19 17:16:06 2021 +0100

    Add statsd metrics in JournalClient.process
    
    trying to capture how long the journal client stays blocked waiting for
    kafka message, how long it takes to actually retrieve the messages, and
    how long it takes to handle them.

See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/187/ for more details.

Harbormaster completed remote builds in B25800: Diff 24929.Jan 4 2022, 3:02 PM

douardda requested review of this revision.Jan 4 2022, 3:02 PM

statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

keeping explicit timestamps using bare calls to time.monotonic() at specific events
calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track

swh/journal/client.py
35	Should be plural and end with `_total` rather than "number" (ref. https://prometheus.io/docs/practices/naming/)
304	shouldn't this just be `time.monotonic()`?

douardda added inline comments.Jan 4 2022, 4:45 PM

swh/journal/client.py
35	ah yes, i wanted to go & check this but hen I forgot!
304	would probably make things a bit easier to follow yes... will try

In D6875#178755, @olasd wrote:

statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

good question... need to think a bit about this.

I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

keeping explicit timestamps using bare calls to time.monotonic() at specific events

calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track

I agree

vlorentz added a subscriber: vlorentz.Jan 6 2022, 11:50 AM

vlorentz added inline comments.

swh/journal/client.py
286	I don't understand how this one works. If I'm not mistaken, this only times negligible pure-Python computation: On the first loop run, `last_object_time` is 0 and `elapsed` is the time it took to get from line 250 to line 258, which is essentially this code: start_time = time.monotonic() last_object_time = 0 total_objects_processed = 0 # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time On other loop runs, `elapsed - last_object_time` is the time it takes from line 304 to line 258, which would be this code: last_object_time = time.monotonic() - start_time # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time

douardda abandoned this revision.Jan 14 2022, 10:40 AM

douardda added inline comments.

swh/journal/client.py
286	I made a better version of this part, but in the end, it won't be merged, this is not really the metrics we want and the produced histogram are pretty much useless.