add metric to monitor "save code now" efficiency
Closed, MigratedEdits Locked
Actions

Description

as per title, we'd like to know how long does it takes (on average or not) to completely process a "save code now" request (including "take snapshot now")

Revisions and Commits

rDWAPPS Web applications
	D5463	rDWAPPS7131f6cd4993 Add metric to monitor "save code now" efficiency

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1408 More/better Metrics
Migrated	gitlab-migration	T3082 Improve Save Code Now handling
Migrated	gitlab-migration	T1481 add metric to monitor "save code now" efficiency

Event Timeline

zack triaged this task as High priority.Jan 21 2019, 11:14 AM

zack created this task.

rdicosmo added a parent task: T3082: Improve Save Code Now handling.Mar 4 2021, 10:36 AM

ardumont mentioned this in T3084: Fast track save code now requests.Apr 7 2021, 12:32 PM

The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_save.py$0-625

[2]

swh-web=> \d save_origin_request
                                           Table "public.save_origin_request"
       Column        |           Type           | Collation | Nullable |                     Default
---------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                  | bigint                   |           | not null | nextval('save_origin_request_id_seq'::regclass)
 request_date        | timestamp with time zone |           | not null |
 visit_type          | character varying(200)   |           | not null |
 origin_url          | character varying(200)   |           | not null |
 status              | text                     |           | not null |
 loading_task_id     | integer                  |           | not null |
 visit_date          | timestamp with time zone |           |          |
 loading_task_status | text                     |           | not null |
Indexes:
    "save_origin_request_pkey" PRIMARY KEY, btree (id)
    "save_origin_origin__b46350_idx" btree (origin_url, status)

process a "save code now" request (including "take snapshot now")

I don't see anything allowing to differentiate between scheduling a new save code now
origin and the rescheduling of an already ingested one (which i gather is pinned behind
the term take snapshot now ;) in the current model.

I'm not sure the difference between the 2 is worth spending too much effort on it (well
at least right now, in the context of T3084). They will be using exactly the same
mechanism whether that's a new origin or an already ingested one.

So I'll focus on the main part first, adding a metric for the "save code now" time.

As a heads up, we can already determine some basic metrics out of the postgres db.

Current status is roughly (over the course of all save code now requests [1]):

~6 hours on average for a successful ingestion (so the task T3084 was right in its description, a "few" hours).
~16 hours for a failed ingestion

Over the last 4 months, we got better though:

~3.5 hours for successful ingestion
~7 hours for failed one

[1] P1001 for the details

ardumont added a revision: D5463: Add metric to monitor "save code now" efficiency.Apr 8 2021, 4:33 PM

ardumont added a commit: rDWAPPS7131f6cd4993: Add metric to monitor "save code now" efficiency.Apr 9 2021, 2:18 PM

I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.

I've added a rate and an avg_over_time panels there...

The panels look not that much readable to me right now but i gather
the tendency towards faster ingestion will eventually show up when
addressing [2]

[1] https://grafana.softwareheritage.org/goto/DT2H4qlGz

[2] T3084

ardumont claimed this task.Apr 9 2021, 4:14 PM

ardumont added a project: System administration.Apr 12 2021, 3:56 PM

ardumont moved this task from Backlog to deployed/landed/monitoring on the System administration board.

ardumont mentioned this in T3263: Save code now report error for svn type.Apr 19 2021, 9:05 AM

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

In T1481#63785, @douardda wrote:

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Since there is already a graph dedicated to pending requests, then pending reas should just be removed from the submitted reas graph.

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Now what's missing here (not sure how hard it is) is the mean and max ingestion time of save code now requests (time between they being accepted and the loader task is over)

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Since there is already a graph dedicated to pending requests, then pending reas should just be removed from the submitted reas graph.

Thanks for the feedback.
Agreed, and done.

Note that I initially hard-coded values from [1] to [2] for this. Then tried [3] as I
was unhappy with the hard-coding. It's even better, no hard-coding (well except for the
exclusion but meh) and it's still compatible with the global board filtering so \m/ (and
til ;)

[1]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"$status"}) by (visit_type, status)

[2]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"accepted|rejected"}) by (visit_type, status)

[3]

sum(swh_web_submitted_save_requests{environment="$environment",instance="$instance",status=~"$status", status!="pending"}) by (visit_type, status)

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Adjusted as well.

ardumont added a project: Save Code Now.Apr 20 2021, 4:42 PM

Now what's missing here (not sure how hard it is) is the mean and max ingestion time
of save code now requests (time between they being accepted and the loader task is
over)

Well, we do not have the timestamp information at which the save code now request got
accepted. So in the current state, we cannot have this easily.

I don't think that information is that relevant though. With the authorized list
mecanism we have, requests are mostly automatically accepted so the creation date is
roughly enough.

Only the new origins might take more time (because of the time they stay in pending
state, because they depend on a human allowing or rejecting those). Then again, if we
use percentile to draw the information, those marginals few won't be counted so I don't
think that's worth the effort...

In the mean time, i'm trying to rework the webapp metrics we have to use buckets of
durations so we can have more readable graphs (heatmap) to display task duration (as for
example [1]). This uses histogram metrics.

[1] https://grafana.softwareheritage.org/goto/bWXW3uqGz

[2] https://prometheus.io/docs/practices/histograms/

ardumont removed ardumont as the assignee of this task.Dec 3 2021, 3:57 PM

ardumont moved this task from deployed/landed/monitoring to Backlog on the System administration board.

we can always improve it, but now we have a decent dashboard, so let's consider this done.

This task has been migrated to GitLab.

add metric to monitor "save code now" efficiencyClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

add metric to monitor "save code now" efficiency
Closed, MigratedEdits Locked
Actions

Related Objects
Search...