trying to capture how long the journal client stays blocked waiting for
kafka message, how long it takes to actually retrieve the messages, and
how long it takes to handle them.
Details
- Reviewers
- None
- Group Reviewers
Reviewers
Diff Detail
- Repository
- rDJNL Journal infrastructure
- Branch
- master
- Lint
Lint Skipped - Unit
Unit Tests Skipped - Build Status
Buildable 25800 Build 40318: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 40317: arc lint + arc unit
Event Timeline
Build has FAILED
Patch application report for D6875 (id=24926)
Rebasing onto 4e5e00988a...
Current branch diff-target is up to date.
Changes applied before test
commit 9d78e7857866872ec260af37ea9a4350644facc7 Author: David Douard <david.douard@sdfa3.org> Date: Fri Nov 19 17:16:06 2021 +0100 Add statsd metrics in JournalClient.process trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.
Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/
See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/console
Build is green
Patch application report for D6875 (id=24929)
Rebasing onto 4e5e00988a...
Current branch diff-target is up to date.
Changes applied before test
commit de55fb56ab1441a0074d66b43443bce692fa755b Author: David Douard <david.douard@sdfa3.org> Date: Fri Nov 19 17:16:06 2021 +0100 Add statsd metrics in JournalClient.process trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.
See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/187/ for more details.
statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?
I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:
- keeping explicit timestamps using bare calls to time.monotonic() at specific events
- calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track
swh/journal/client.py | ||
---|---|---|
35 | Should be plural and end with _total rather than "number" (ref. https://prometheus.io/docs/practices/naming/) | |
304 | shouldn't this just be time.monotonic()? |
good question... need to think a bit about this.
I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:
- keeping explicit timestamps using bare calls to time.monotonic() at specific events
- calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track
I agree
swh/journal/client.py | ||
---|---|---|
286 | I don't understand how this one works. On the first loop run, last_object_time is 0 and elapsed is the time it took to get from line 250 to line 258, which is essentially this code: start_time = time.monotonic() last_object_time = 0 total_objects_processed = 0 # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time On other loop runs, elapsed - last_object_time is the time it takes from line 304 to line 258, which would be this code: last_object_time = time.monotonic() - start_time # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time |
swh/journal/client.py | ||
---|---|---|
286 | I made a better version of this part, but in the end, it won't be merged, this is not really the metrics we want and the produced histogram are pretty much useless. |