Page MenuHomeSoftware Heritage

add metric to monitor "save code now" efficiency
Closed, MigratedEdits Locked

Description

as per title, we'd like to know how long does it takes (on average or not) to completely process a "save code now" request (including "take snapshot now")

Event Timeline

zack triaged this task as High priority.Jan 21 2019, 11:14 AM
zack created this task.

The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_save.py$0-625

[2]

swh-web=> \d save_origin_request
                                           Table "public.save_origin_request"
       Column        |           Type           | Collation | Nullable |                     Default
---------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                  | bigint                   |           | not null | nextval('save_origin_request_id_seq'::regclass)
 request_date        | timestamp with time zone |           | not null |
 visit_type          | character varying(200)   |           | not null |
 origin_url          | character varying(200)   |           | not null |
 status              | text                     |           | not null |
 loading_task_id     | integer                  |           | not null |
 visit_date          | timestamp with time zone |           |          |
 loading_task_status | text                     |           | not null |
Indexes:
    "save_origin_request_pkey" PRIMARY KEY, btree (id)
    "save_origin_origin__b46350_idx" btree (origin_url, status)

process a "save code now" request (including "take snapshot now")

I don't see anything allowing to differentiate between scheduling a new save code now
origin and the rescheduling of an already ingested one (which i gather is pinned behind
the term take snapshot now ;) in the current model.

I'm not sure the difference between the 2 is worth spending too much effort on it (well
at least right now, in the context of T3084). They will be using exactly the same
mechanism whether that's a new origin or an already ingested one.

So I'll focus on the main part first, adding a metric for the "save code now" time.

As a heads up, we can already determine some basic metrics out of the postgres db.

Current status is roughly (over the course of all save code now requests [1]):

  • ~6 hours on average for a successful ingestion (so the task T3084 was right in its description, a "few" hours).
  • ~16 hours for a failed ingestion

Over the last 4 months, we got better though:

  • ~3.5 hours for successful ingestion
  • ~7 hours for failed one

[1] P1001 for the details

I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.

I've added a rate and an avg_over_time panels there...

The panels look not that much readable to me right now but i gather
the tendency towards faster ingestion will eventually show up when
addressing [2]

[1] https://grafana.softwareheritage.org/goto/DT2H4qlGz

[2] T3084

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Since there is already a graph dedicated to pending requests, then pending reas should just be removed from the submitted reas graph.

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Now what's missing here (not sure how hard it is) is the mean and max ingestion time of save code now requests (time between they being accepted and the loader task is over)

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Since there is already a graph dedicated to pending requests, then pending reas should just be removed from the submitted reas graph.

Thanks for the feedback.
Agreed, and done.

Note that I initially hard-coded values from [1] to [2] for this. Then tried [3] as I
was unhappy with the hard-coding. It's even better, no hard-coding (well except for the
exclusion but meh) and it's still compatible with the global board filtering so \m/ (and
til ;)

[1]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"$status"}) by (visit_type, status)

[2]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"accepted|rejected"}) by (visit_type, status)

[3]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"$status", status!="pending"}) by (visit_type, status)

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Adjusted as well.

Now what's missing here (not sure how hard it is) is the mean and max ingestion time
of save code now requests (time between they being accepted and the loader task is
over)

Well, we do not have the timestamp information at which the save code now request got
accepted. So in the current state, we cannot have this easily.

I don't think that information is that relevant though. With the authorized list
mecanism we have, requests are mostly automatically accepted so the creation date is
roughly enough.

Only the new origins might take more time (because of the time they stay in pending
state, because they depend on a human allowing or rejecting those). Then again, if we
use percentile to draw the information, those marginals few won't be counted so I don't
think that's worth the effort...


In the mean time, i'm trying to rework the webapp metrics we have to use buckets of
durations so we can have more readable graphs (heatmap) to display task duration (as for
example [1]). This uses histogram metrics.

[1] https://grafana.softwareheritage.org/goto/bWXW3uqGz

[2] https://prometheus.io/docs/practices/histograms/

douardda claimed this task.

we can always improve it, but now we have a decent dashboard, so let's consider this done.