Page MenuHomeSoftware Heritage

Add prometheus metrics for origin save requests
ClosedPublic

Authored by anlambert on Dec 11 2019, 6:05 PM.

Details

Summary

Use a different approach compared to my first take on this in D2419 (based on statsd).

  • rely on prometheus Python client to generate and export some metrics related to origin save requests.
  • add a new /metrics endpoint to the webapp in order for Prometheus to scrape metrics data.

A metric is generated for each of the available save request statuses: accepted, rejected, pending.

Related to T2124

Diff Detail

Repository
rDWAPPS Web applications
Branch
add-save-requests-metrics
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 9682
Build 14271: Cypress tests for swh-web diffsJenkins
Build 14270: tox-on-jenkinsJenkins
Build 14269: arc lint + arc unit

Event Timeline

vlorentz added a subscriber: vlorentz.

Is this new endpoint rate-limited?

swh/web/common/origin_save.py
536–550

In your other diff, you used counters, with a status tag. Why did you change that?

swh/web/misc/urls.py
51

I think it should be metrics/prometheus/, in case we want to add other kinds of metrics in the future

This revision now requires changes to proceed.Dec 11 2019, 6:17 PM

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

swh/web/common/origin_save.py
536–550

To avoid having too much labels combination and useless metrics. For instance a rejected request will never have a loading task created so the load_task_status label value will always be "not created" here. So I think it is better to have a separate metric for each save request status.

swh/web/misc/urls.py
51

ack

Update according to reviews

swh/web/common/origin_save.py
536–550

Will Prometheus create a time-series for these extra combinations if we never report a value for them, though?

The problem with using different metric names is that, afaik, they can't be aggregated. And we probably want to aggregate accepted and rejected requests, to get the total number of handled requests. (and maybe pending requests too, to have the total number of requests received)

swh/web/misc/urls.py
51

I think it's missing a trailing slash now

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

Why?

swh/web/common/origin_save.py
536–550

How about dropping the REJECTED_SAVE_REQUESTS_METRIC and PENDING_SAVE_REQUESTS_METRIC and merge them in the following gauge ?

Gauge(
    name=SAVE_REQUESTS_METRIC,
    documentation='Number of origin save requests',
    labelnames=['status', 'visit_type'],
    registry=SWH_WEB_METRICS_REGISTRY)

I think keeping the metric about accepted save requests is interesting as for instance it could be used to calculate the ratio of successful requests.

swh/web/misc/urls.py
51

right

swh/web/common/origin_save.py
536–550

Can you calculate such a ratio in prometheus between different metrics?

Is this new endpoint rate-limited?

Nope, only api endpoints are rate limited.

Why?

Because there is no interest to rate limit the browse part of the webapp as the user experience will be awful otherwise.

swh/web/common/origin_save.py
536–550

Sure, you can write this query in PromQL:

sum(swh_web_accepted_save_requests{load_task_status="succeed"}) / sum(swh_web_submitted_save_requests)

Because there is no interest to rate limit the browse part of the webapp as the user experience will be awful otherwise.

Not if the rate-limit is high enough.

There is little point in having a rate-limit at all if half the requests aren't rate-limited.

swh/web/common/origin_save.py
536–550

ok

This revision is now accepted and ready to land.Dec 12 2019, 5:06 PM

Rebase and replace swh_web_rejected_save_requests / swh_web_pending_save_requests metrics
by a more generic one swh_web_submitted_save_requests.