Page MenuHomeSoftware Heritage

Add metric to monitor "save code now" efficiency
ClosedPublic

Authored by ardumont on Apr 8 2021, 4:33 PM.

Details

Summary

This only deals with the finalized save code now statuses (succeeded, failed) as it's
the only information referenced in the model.

Related to T1481

direct metrics:

which allows to have a rate tendency graph:

avg over time:

Test Plan

docker, save code now and checking metrics out of the endpoint and prometheus graph as in the screenshots ^

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont edited the test plan for this revision. (Show Details)

Build has FAILED

Patch application report for D5463 (id=19512)

Rebasing onto cf69ace028...

Current branch diff-target is up to date.
Changes applied before test
commit 2436835d247243de4d482485e9ff0f85c01eeca7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/663/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/663/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 8 2021, 4:40 PM
Harbormaster failed remote builds in B20549: Diff 19512!

I do not recall exactly what is is possible to do with Prometheus queries but one interesting metric to compute would be the average save code now request delay.

For the tests, I think you can inspire from https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/tests/misc/test_metrics.py

swh/web/common/origin_save.py
625

Looking at the value computed below, this does not correspond to a save request duration but rather to the delay between the request submission and the swh visit so ACCEPTED_SAVE_REQUESTS_DELAY_METRIC feels like a better name.

627

you need to change that identifier as it is the same as the gauge above.

Thanks for the suggestions, will adapt.

swh/web/common/origin_save.py
625

indeed.

627

ugh right, how did i miss that?!

692

Do you think it's sensible enough? Or should i just add that check within the if conditional just above?

swh/web/common/origin_save.py
627

ah i know, i renamed it from the original name with counter to gauge.
And i missed the clash.

swh/web/common/origin_save.py
692

Oh right, you should not assert on visit_date as it can be None before a visit is found.

Build is green

Patch application report for D5463 (id=19533)

Rebasing onto 6e8bffe487...

First, rewinding head to replay your work on top of it...
Applying: Add metric to monitor "save code now" efficiency
Changes applied before test
commit f5208cd70c40985778868d4c466c39aebff0b1eb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/664/ for more details.

Build is green

Patch application report for D5463 (id=19538)

Rebasing onto 6e8bffe487...

First, rewinding head to replay your work on top of it...
Applying: Add metric to monitor "save code now" efficiency
Changes applied before test
commit 20b25b2ca74b27110705c087962736ba88cfbfa8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/665/ for more details.

Rename metric to use the delay term

Build is green

Patch application report for D5463 (id=19546)

Rebasing onto 6e8bffe487...

Current branch diff-target is up to date.
Changes applied before test
commit cf2d177fe47c5c798aa30f76b248831d59f80ec7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/666/ for more details.

Build is green

Patch application report for D5463 (id=19552)

Rebasing onto 6e8bffe487...

Current branch diff-target is up to date.
Changes applied before test
commit 5d88f646a3bff0321a82e4de18e0d59a18c9564c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/667/ for more details.

Build is green

Patch application report for D5463 (id=19553)

Rebasing onto 6e8bffe487...

Current branch diff-target is up to date.
Changes applied before test
commit 83c18187b72036a36a010ee487977cafde22539c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/668/ for more details.

swh/web/tests/misc/test_metrics.py
120

right, i'll need to amend data to actually pass through here then...

swh/web/common/origin_save.py
642

actual time of visit by Software Heritage

visits are created at the beginning of the loading so final status should not be mentioned here.

695–698

I am wondering if computing the average delay time would not be a better choice here.

If I understand correctly, you send to prometheus the cumulative delay times for all requests with a swh visit.
Sending an average delay value will enable to see on a graph the evolution of requests processing delay over
time which I think is what we want to monitor here from my point of view.

Any thoughts on this ?

swh/web/common/origin_save.py
695–698

Yes, but i'll let prometheus/grafana do the computations, the rate (as screenshot above) and the avg for example.

swh/web/common/origin_save.py
695–698

Ack, I guess you can get the number of requests by querying the other metrics (number of requests per visit type and statuses), right ?

swh/web/common/origin_save.py
695–698

I don't think we need the number of requests, we want a tendency which we can have using
the aggregation over time functions [1].

I've updated the diff description with screenshots to demo it as well.

But yeah, i think we can compose other metrics if need be.

[1] https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_timedescription

Adapt data so the metric actually display non null values.

Build is green

Patch application report for D5463 (id=19564)

Rebasing onto 6e8bffe487...

Current branch diff-target is up to date.
Changes applied before test
commit d6237c835fce4832065f5cb9e7cf87f38e51e442
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/669/ for more details.

anlambert added inline comments.
swh/web/common/origin_save.py
642

or simply "actual time of ingestion"

695–698

I don't think we need the number of requests, we want a tendency which we can have using
the aggregation over time functions [1].

If we want an accurate average delay time, yes we do.

But this can be computed from Prometheus so current metric looks good to me.

This revision is now accepted and ready to land.Apr 9 2021, 1:36 PM
  • Rebase
  • adapt comment according to suggestion (i had forgotten that one)

Build is green

Patch application report for D5463 (id=19568)

Rebasing onto 0eb0c33a65...

Current branch diff-target is up to date.
Changes applied before test
commit 7131f6cd4993cea8802770c369b04a3f008fc779
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Apr 8 16:23:10 2021 +0200

    Add metric to monitor "save code now" efficiency
    
    This only deals with the finalized save code now statuses (succeeded, failed) as it's
    the only information referenced in the model.
    
    Related to T1481

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/671/ for more details.