Changeset View
Standalone View
swh/web/common/origin_save.py
Show First 20 Lines • Show All 616 Lines • ▼ Show 20 Lines | |||||
_accepted_save_requests_gauge = Gauge( | _accepted_save_requests_gauge = Gauge( | ||||
name=ACCEPTED_SAVE_REQUESTS_METRIC, | name=ACCEPTED_SAVE_REQUESTS_METRIC, | ||||
documentation="Number of accepted origin save requests", | documentation="Number of accepted origin save requests", | ||||
labelnames=["load_task_status", "visit_type"], | labelnames=["load_task_status", "visit_type"], | ||||
registry=SWH_WEB_METRICS_REGISTRY, | registry=SWH_WEB_METRICS_REGISTRY, | ||||
) | ) | ||||
# Cumulative metric on the duration of save code now request per status and visit_type | |||||
anlambert: Looking at the value computed below, this does not correspond to a save request duration but… | |||||
Done Inline Actionsindeed. ardumont: indeed. | |||||
ACCEPTED_SAVE_REQUESTS_DELAY_METRIC = "swh_web_save_requests_duration_seconds" | |||||
Not Done Inline Actionsyou need to change that identifier as it is the same as the gauge above. anlambert: you need to change that identifier as it is the same as the gauge above. | |||||
Done Inline Actionsugh right, how did i miss that?! ardumont: ugh right, how did i miss that?! | |||||
Done Inline Actionsah i know, i renamed it from the original name with counter to gauge. ardumont: ah i know, i renamed it from the original name with counter to gauge.
And i missed the clash.
| |||||
_accepted_save_requests_delay_gauge = Gauge( | |||||
name=ACCEPTED_SAVE_REQUESTS_DELAY_METRIC, | |||||
documentation="Save Requests Duration", | |||||
labelnames=["load_task_status", "visit_type"], | |||||
registry=SWH_WEB_METRICS_REGISTRY, | |||||
) | |||||
def compute_save_requests_metrics() -> None: | def compute_save_requests_metrics() -> None: | ||||
"""Compute a couple of Prometheus metrics related to | """Compute Prometheus metrics related to origin save requests: | ||||
origin save requests""" | |||||
- Number of submitted origin save requests | |||||
- Number of accepted origin save requests | |||||
- Save Code Now requests duration | |||||
Not Done Inline Actionsactual time of visit by Software Heritage visits are created at the beginning of the loading so final status should not be mentioned here. anlambert: actual time of visit by Software Heritage
visits are created at the beginning of the loading… | |||||
Not Done Inline Actionsor simply "actual time of ingestion" anlambert: or simply "actual time of ingestion" | |||||
""" | |||||
request_statuses = ( | request_statuses = ( | ||||
SAVE_REQUEST_ACCEPTED, | SAVE_REQUEST_ACCEPTED, | ||||
SAVE_REQUEST_REJECTED, | SAVE_REQUEST_REJECTED, | ||||
SAVE_REQUEST_PENDING, | SAVE_REQUEST_PENDING, | ||||
) | ) | ||||
load_task_statuses = ( | load_task_statuses = ( | ||||
Show All 12 Lines | def compute_save_requests_metrics() -> None: | ||||
for labels in labels_set: | for labels in labels_set: | ||||
_submitted_save_requests_gauge.labels(*labels).set(0) | _submitted_save_requests_gauge.labels(*labels).set(0) | ||||
labels_set = product(load_task_statuses, visit_types) | labels_set = product(load_task_statuses, visit_types) | ||||
for labels in labels_set: | for labels in labels_set: | ||||
_accepted_save_requests_gauge.labels(*labels).set(0) | _accepted_save_requests_gauge.labels(*labels).set(0) | ||||
duration_load_task_statuses = ( | |||||
SAVE_TASK_FAILED, | |||||
SAVE_TASK_SUCCEEDED, | |||||
) | |||||
for labels in product(duration_load_task_statuses, visit_types): | |||||
_accepted_save_requests_delay_gauge.labels(*labels).set(0) | |||||
for sor in SaveOriginRequest.objects.all(): | for sor in SaveOriginRequest.objects.all(): | ||||
if sor.status == SAVE_REQUEST_ACCEPTED: | if sor.status == SAVE_REQUEST_ACCEPTED: | ||||
_accepted_save_requests_gauge.labels( | _accepted_save_requests_gauge.labels( | ||||
load_task_status=sor.loading_task_status, visit_type=sor.visit_type | load_task_status=sor.loading_task_status, visit_type=sor.visit_type | ||||
).inc() | ).inc() | ||||
_submitted_save_requests_gauge.labels( | _submitted_save_requests_gauge.labels( | ||||
status=sor.status, visit_type=sor.visit_type | status=sor.status, visit_type=sor.visit_type | ||||
).inc() | ).inc() | ||||
if ( | |||||
sor.loading_task_status in (SAVE_TASK_SUCCEEDED, SAVE_TASK_FAILED) | |||||
Done Inline ActionsDo you think it's sensible enough? Or should i just add that check within the if conditional just above? ardumont: Do you think it's sensible enough? Or should i just add that check within the if conditional… | |||||
Not Done Inline ActionsOh right, you should not assert on visit_date as it can be None before a visit is found. anlambert: Oh right, you should not assert on visit_date as it can be `None` before a visit is found. | |||||
and sor.visit_date is not None | |||||
and sor.request_date is not None | |||||
): | |||||
duration = sor.visit_date.timestamp() - sor.request_date.timestamp() | |||||
_accepted_save_requests_delay_gauge.labels( | |||||
load_task_status=sor.loading_task_status, visit_type=sor.visit_type | |||||
Not Done Inline ActionsI am wondering if computing the average delay time would not be a better choice here. If I understand correctly, you send to prometheus the cumulative delay times for all requests with a swh visit. Any thoughts on this ? anlambert: I am wondering if computing the average delay time would not be a better choice here.
If I… | |||||
Done Inline ActionsYes, but i'll let prometheus/grafana do the computations, the rate (as screenshot above) and the avg for example. ardumont: Yes, but i'll let prometheus/grafana do the computations, the rate (as screenshot above) and… | |||||
Not Done Inline ActionsAck, I guess you can get the number of requests by querying the other metrics (number of requests per visit type and statuses), right ? anlambert: Ack, I guess you can get the number of requests by querying the other metrics (number of… | |||||
Done Inline ActionsI don't think we need the number of requests, we want a tendency which we can have using I've updated the diff description with screenshots to demo it as well. But yeah, i think we can compose other metrics if need be. [1] https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_timedescription ardumont: I don't think we need the number of requests, we want a tendency which we can have using
the… | |||||
Not Done Inline Actions
If we want an accurate average delay time, yes we do. But this can be computed from Prometheus so current metric looks good to me. anlambert: > I don't think we need the number of requests, we want a tendency which we can have using
>… | |||||
).inc(duration) |
Looking at the value computed below, this does not correspond to a save request duration but rather to the delay between the request submission and the swh visit so ACCEPTED_SAVE_REQUESTS_DELAY_METRIC feels like a better name.