Page MenuHomeSoftware Heritage

add metric to monitor "save code now" efficiency
Open, HighPublic

Description

as per title, we'd like to know how long does it takes (on average or not) to completely process a "save code now" request (including "take snapshot now")

Event Timeline

zack triaged this task as High priority.Jan 21 2019, 11:14 AM
zack created this task.

The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_save.py$0-625

[2]

swh-web=> \d save_origin_request
                                           Table "public.save_origin_request"
       Column        |           Type           | Collation | Nullable |                     Default
---------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                  | bigint                   |           | not null | nextval('save_origin_request_id_seq'::regclass)
 request_date        | timestamp with time zone |           | not null |
 visit_type          | character varying(200)   |           | not null |
 origin_url          | character varying(200)   |           | not null |
 status              | text                     |           | not null |
 loading_task_id     | integer                  |           | not null |
 visit_date          | timestamp with time zone |           |          |
 loading_task_status | text                     |           | not null |
Indexes:
    "save_origin_request_pkey" PRIMARY KEY, btree (id)
    "save_origin_origin__b46350_idx" btree (origin_url, status)

process a "save code now" request (including "take snapshot now")

I don't see anything allowing to differentiate between scheduling a new save code now
origin and the rescheduling of an already ingested one (which i gather is pinned behind
the term take snapshot now ;) in the current model.

I'm not sure the difference between the 2 is worth spending too much effort on it (well
at least right now, in the context of T3084). They will be using exactly the same
mechanism whether that's a new origin or an already ingested one.

So I'll focus on the main part first, adding a metric for the "save code now" time.

As a heads up, we can already determine some basic metrics out of the postgres db.

Current status is roughly (over the course of all save code now requests [1]):

  • ~6 hours on average for a successful ingestion (so the task T3084 was right in its description, a "few" hours).
  • ~16 hours for a failed ingestion

Over the last 4 months, we got better though:

  • ~3.5 hours for successful ingestion
  • ~7 hours for failed one

[1] P1001 for the details

I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.

I've added a rate and an avg_over_time panels there...

The panels look not that much readable to me right now but i gather
the tendency towards faster ingestion will eventually show up when
addressing [2]

[1] https://grafana.softwareheritage.org/goto/DT2H4qlGz

[2] T3084