as per title, we'd like to know how long does it takes (on average or not) to completely process a "save code now" request (including "take snapshot now")
Description
Revisions and Commits
rDWAPPS Web applications | |||
D5463 | rDWAPPS7131f6cd4993 Add metric to monitor "save code now" efficiency |
Status | Assigned | Task | ||
---|---|---|---|---|
Open | None | T1408 More/better Metrics | ||
Open | None | T3082 Improve Save Code Now handling | ||
Open | ardumont | T1481 add metric to monitor "save code now" efficiency |
Event Timeline
The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.
[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_save.py$0-625
[2]
swh-web=> \d save_origin_request Table "public.save_origin_request" Column | Type | Collation | Nullable | Default ---------------------+--------------------------+-----------+----------+------------------------------------------------- id | bigint | | not null | nextval('save_origin_request_id_seq'::regclass) request_date | timestamp with time zone | | not null | visit_type | character varying(200) | | not null | origin_url | character varying(200) | | not null | status | text | | not null | loading_task_id | integer | | not null | visit_date | timestamp with time zone | | | loading_task_status | text | | not null | Indexes: "save_origin_request_pkey" PRIMARY KEY, btree (id) "save_origin_origin__b46350_idx" btree (origin_url, status)
process a "save code now" request (including "take snapshot now")
I don't see anything allowing to differentiate between scheduling a new save code now
origin and the rescheduling of an already ingested one (which i gather is pinned behind
the term take snapshot now ;) in the current model.
I'm not sure the difference between the 2 is worth spending too much effort on it (well
at least right now, in the context of T3084). They will be using exactly the same
mechanism whether that's a new origin or an already ingested one.
So I'll focus on the main part first, adding a metric for the "save code now" time.
As a heads up, we can already determine some basic metrics out of the postgres db.
Current status is roughly (over the course of all save code now requests [1]):
- ~6 hours on average for a successful ingestion (so the task T3084 was right in its description, a "few" hours).
- ~16 hours for a failed ingestion
Over the last 4 months, we got better though:
- ~3.5 hours for successful ingestion
- ~7 hours for failed one
[1] P1001 for the details
I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.
I've added a rate and an avg_over_time panels there...
The panels look not that much readable to me right now but i gather
the tendency towards faster ingestion will eventually show up when
addressing [2]
[1] https://grafana.softwareheritage.org/goto/DT2H4qlGz
[2] T3084