add metric to monitor "save code now" efficiency
as per title, we'd like to know how long does it takes (on average or not) to completely process a "save code now" request (including "take snapshot now")

zack triaged this task as High priority.Jan 21 2019, 11:14 AM
zack created this task.

The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.



swh-web=> \d save_origin_request
                                           Table "public.save_origin_request"
       Column        |           Type           | Collation | Nullable |                     Default
 id                  | bigint                   |           | not null | nextval('save_origin_request_id_seq'::regclass)
 request_date        | timestamp with time zone |           | not null |
 visit_type          | character varying(200)   |           | not null |
 origin_url          | character varying(200)   |           | not null |
 status              | text                     |           | not null |
 loading_task_id     | integer                  |           | not null |
 visit_date          | timestamp with time zone |           |          |
 loading_task_status | text                     |           | not null |
    "save_origin_request_pkey" PRIMARY KEY, btree (id)
    "save_origin_origin__b46350_idx" btree (origin_url, status)

process a "save code now" request (including "take snapshot now")

I don't see anything allowing to differentiate between scheduling a new save code now
origin and the rescheduling of an already ingested one (which i gather is pinned behind
the term take snapshot now ;) in the current model.

I'm not sure the difference between the 2 is worth spending too much effort on it (well
at least right now, in the context of T3084). They will be using exactly the same
mechanism whether that's a new origin or an already ingested one.

So I'll focus on the main part first, adding a metric for the "save code now" time.

As a heads up, we can already determine some basic metrics out of the postgres db.

Current status is roughly (over the course of all save code now requests [1]):

  • ~6 hours on average for a successful ingestion (so the task T3084 was right in its description, a "few" hours).
  • ~16 hours for a failed ingestion

Over the last 4 months, we got better though:

  • ~3.5 hours for successful ingestion
  • ~7 hours for failed one

I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.

I've added a rate and an avg_over_time panels there...

The panels look not that much readable to me right now but i gather
the tendency towards faster ingestion will eventually show up when
addressing [2]


