Page MenuHomeSoftware Heritage

Add statsd metrics in JournalClient.process
AbandonedPublic

Authored by douardda on Jan 4 2022, 2:43 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

trying to capture how long the journal client stays blocked waiting for
kafka message, how long it takes to actually retrieve the messages, and
how long it takes to handle them.

Event Timeline

Build has FAILED

Patch application report for D6875 (id=24926)

Rebasing onto 4e5e00988a...

Current branch diff-target is up to date.
Changes applied before test
commit 9d78e7857866872ec260af37ea9a4350644facc7
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Nov 19 17:16:06 2021 +0100

    Add statsd metrics in JournalClient.process
    
    trying to capture how long the journal client stays blocked waiting for
    kafka message, how long it takes to actually retrieve the messages, and
    how long it takes to handle them.

Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/
See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 4 2022, 2:44 PM
Harbormaster failed remote builds in B25797: Diff 24926!

Build is green

Patch application report for D6875 (id=24929)

Rebasing onto 4e5e00988a...

Current branch diff-target is up to date.
Changes applied before test
commit de55fb56ab1441a0074d66b43443bce692fa755b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Nov 19 17:16:06 2021 +0100

    Add statsd metrics in JournalClient.process
    
    trying to capture how long the journal client stays blocked waiting for
    kafka message, how long it takes to actually retrieve the messages, and
    how long it takes to handle them.

See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/187/ for more details.

statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

  • keeping explicit timestamps using bare calls to time.monotonic() at specific events
  • calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track
swh/journal/client.py
35

Should be plural and end with _total rather than "number" (ref. https://prometheus.io/docs/practices/naming/)

304

shouldn't this just be time.monotonic()?

swh/journal/client.py
35

ah yes, i wanted to go & check this but hen I forgot!

304

would probably make things a bit easier to follow yes... will try

In D6875#178755, @olasd wrote:

statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

good question... need to think a bit about this.

I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

  • keeping explicit timestamps using bare calls to time.monotonic() at specific events
  • calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track

I agree

vlorentz added inline comments.
swh/journal/client.py
286

I don't understand how this one works.
If I'm not mistaken, this only times negligible pure-Python computation:

On the first loop run, last_object_time is 0 and elapsed is the time it took to get from line 250 to line 258, which is essentially this code:

start_time = time.monotonic()
last_object_time = 0
total_objects_processed = 0

# while True:
    # timeout for message poll
    timeout = 1.0

    elapsed = time.monotonic() - start_time

On other loop runs, elapsed - last_object_time is the time it takes from line 304 to line 258, which would be this code:

    last_object_time = time.monotonic() - start_time

# while True:
    # timeout for message poll
    timeout = 1.0

    elapsed = time.monotonic() - start_time
douardda added inline comments.
swh/journal/client.py
286

I made a better version of this part, but in the end, it won't be merged, this is not really the metrics we want and the produced histogram are pretty much useless.