Page MenuHomeSoftware Heritage

Add prometheus metrics for origin save requests
ClosedPublic

Authored by anlambert on Dec 11 2019, 6:05 PM.

Details

Summary

Use a different approach compared to my first take on this in D2419 (based on statsd).

  • rely on prometheus Python client to generate and export some metrics related to origin save requests.
  • add a new /metrics endpoint to the webapp in order for Prometheus to scrape metrics data.

A metric is generated for each of the available save request statuses: accepted, rejected, pending.

Related to T2124

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Dec 11 2019, 6:05 PM
vlorentz requested changes to this revision.Dec 11 2019, 6:17 PM
vlorentz added a subscriber: vlorentz.

Is this new endpoint rate-limited?

swh/web/common/origin_save.py
536–550

In your other diff, you used counters, with a status tag. Why did you change that?

swh/web/misc/urls.py
51

I think it should be metrics/prometheus/, in case we want to add other kinds of metrics in the future

This revision now requires changes to proceed.Dec 11 2019, 6:17 PM

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

swh/web/common/origin_save.py
536–550

To avoid having too much labels combination and useless metrics. For instance a rejected request will never have a loading task created so the load_task_status label value will always be "not created" here. So I think it is better to have a separate metric for each save request status.

swh/web/misc/urls.py
51

ack

anlambert updated this revision to Diff 8608.Dec 12 2019, 11:46 AM

Update according to reviews

vlorentz added inline comments.Dec 12 2019, 1:30 PM
swh/web/common/origin_save.py
536–550

Will Prometheus create a time-series for these extra combinations if we never report a value for them, though?

The problem with using different metric names is that, afaik, they can't be aggregated. And we probably want to aggregate accepted and rejected requests, to get the total number of handled requests. (and maybe pending requests too, to have the total number of requests received)

swh/web/misc/urls.py
51

I think it's missing a trailing slash now

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

Why?

anlambert added inline comments.Dec 12 2019, 2:23 PM
swh/web/common/origin_save.py
536–550

How about dropping the REJECTED_SAVE_REQUESTS_METRIC and PENDING_SAVE_REQUESTS_METRIC and merge them in the following gauge ?

Gauge(
    name=SAVE_REQUESTS_METRIC,
    documentation='Number of origin save requests',
    labelnames=['status', 'visit_type'],
    registry=SWH_WEB_METRICS_REGISTRY)

I think keeping the metric about accepted save requests is interesting as for instance it could be used to calculate the ratio of successful requests.

swh/web/misc/urls.py
51

right

vlorentz added inline comments.Dec 12 2019, 4:00 PM
swh/web/common/origin_save.py
536–550

Can you calculate such a ratio in prometheus between different metrics?

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

Why?

Because there is no interest to rate limit the browse part of the webapp as the user experience will be awful otherwise.

swh/web/common/origin_save.py
536–550

Sure, you can write this query in PromQL:

sum(swh_web_accepted_save_requests{load_task_status="succeed"}) / sum(swh_web_submitted_save_requests)
vlorentz accepted this revision.Dec 12 2019, 5:06 PM

Because there is no interest to rate limit the browse part of the webapp as the user experience will be awful otherwise.

Not if the rate-limit is high enough.

There is little point in having a rate-limit at all if half the requests aren't rate-limited.

swh/web/common/origin_save.py
536–550

ok

This revision is now accepted and ready to land.Dec 12 2019, 5:06 PM
anlambert updated this revision to Diff 8639.Dec 12 2019, 5:09 PM

Rebase and replace swh_web_rejected_save_requests / swh_web_pending_save_requests metrics
by a more generic one swh_web_submitted_save_requests.