Analyze and make the bitbucket ingestion faster
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Sep 7 2021, 10:33 AM

Description

It's apparently [1] often flushing new objects to add to the storage.
So a bump in the buffer configuration should help [2].

Also lots of time is spent running the to_model/from_dict [1] stanza.
As most of the cpu are usage full [3] bumping the machine's cpus to
something higher should help.

[1]

3370413-2021-09-06T13:04:50Z-init-flamegraph.svg84 KBDownload

[2]

3577393-2021-09-06T16:51:27Z-with-higher-buffer-config.svg67 KBDownload

[3] https://grafana.softwareheritage.org/goto/AQRE2mInk?orgId=1

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3338 Load the archived bitbucket mercurial repositories
Migrated	gitlab-migration	T3563 Analyze and make the bitbucket ingestion faster
Migrated	gitlab-migration	T3567 storage: Allow extid reading with filter on extid version

Event Timeline

ardumont triaged this task as Normal priority.Sep 7 2021, 10:33 AM

ardumont created this task.

ardumont updated the task description. (Show Details)Sep 7 2021, 10:36 AM

ardumont mentioned this in T3338: Load the archived bitbucket mercurial repositories.Sep 7 2021, 11:03 AM

ardumont changed the task status from Open to Work in Progress.Sep 9 2021, 12:32 PM

ardumont moved this task from Backlog to in-progress on the System administration board.

By the way, i forgot:

10:05:31 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+------------------------------+--------+
|             now              | count  |
+------------------------------+--------+
| 2021-09-09 08:05:31.90589+00 | 266803 |
+------------------------------+--------+
(1 row)

Time: 173117.724 ms (02:53.118)
16:45:59 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-14 14:51:18.684494+00 | 271282 |
+-------------------------------+--------+
(1 row)

Time: 175008.966 ms (02:55.009)

First round of checks validates the new diff to make the current loader mercurial faster.

worker0 (staging) is running within a venv with D6240 (more extid filtering) and D6268 (build snapshot)
worker17 (prod) is running with the current latest mercurial packaged (no optim)

But the filtering is still happening client side, in the mercurial loader (that means without D6275).

tl; dr

|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|
| machine         | 1st run    | snapshot                                   | 2nd run   | snapshot                                   |
|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|
| worker17        | 12m11.504s | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 | 0m34.106s | None                                       |
| worker0.staging | 8m47.374s  | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 | 0m32.484s | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 |
|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|

Details in [1]

[1] P1167.

ardumont mentioned this in rDLDHG765cb5aefa8b: Add support to filter out already seen revisions across hg origins.Sep 17 2021, 10:05 AM

ardumont closed subtask T3567: storage: Allow extid reading with filter on extid version as Resolved.Sep 17 2021, 10:48 AM

Another run in on a large repository (which cannot finish, the error is independent
though) [1]

Without the patches now deployed in production, the failure would have taken around the
same amount each time. From now one, the first round takes a long time, whether it
finishes or not. As extid and revisions are stored along the way, the next ingestion
round will be faster since they lift the extid mappings newly stored.

|----------------------+--------------------|
| First round          | 2nd                |
|----------------------+--------------------|
| real    1030m32.795s | real    12m35.640s |
|----------------------+--------------------|

[1] P1169

Actually restarted the loader_oneshot which now makes usage of the latest v2.2.0 loader mercurial.

Datapoint:

12:24:21 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-17 10:24:22.953431+00 | 273002 |
+-------------------------------+--------+
(1 row)

Time: 217785.472 ms (03:37.785)

Now most of the time can be spent in reading the actual mapping extids -> hgnode-id [1] to filter on something we already see.
Which does not change much from actual visits which already ended up in snapshot.
However that changes a lot for visits on forks where we can bypass already done work on those forks.

Note that there is still some mappings from_dict calls happening, that comes from reading in the storage.
As the filtering is still happening client side, i gather that applying the last patch which would make actually
filtering server side would drop some more spurious work.

[1]

mozilla-centra-js-tproxy-fork-526139-2021-09-17T10:37:51Z.svg95 KBDownload

Deployed the loader mercurial v2.3 (with filtering server side).
As expected, less time is spent in the method fetching the new changesets.

ts-mozilla-central-676469-2021-09-20T08:17:31Z.svg82 KBDownload

datapoint:

10:02:40 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+------------------------------+--------+
|             now              | count  |
+------------------------------+--------+
| 2021-09-20 10:04:30.89072+00 | 280995 |
+------------------------------+--------+
(1 row)

Time: 223465.755 ms (03:43.466)

fwiw, the rabbitmq queue is 237683 (was around 280k on friday prior to the v2.2 so the ingestion of those went way faster).
This might be a better heuristic than the current origins (as some hg origins already existed prior to actually ingest the bitbucket backup...).

ardumont moved this task from in-progress to code-review/await-feedback/pause on the System administration board.Sep 20 2021, 12:11 PM

I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20
from celery cli. It's actually holding fine. And that coupled with the filtering server
side makes for a huge bump in speed. The archive db does not seem to mind at all.

18:02:11 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-20 16:02:13.517665+00 | 282491 |
+-------------------------------+--------+
(1 row)

Time: 176714.410 ms (02:56.714)

rabbit: 231587

I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20

saam's nfs server had a bit of a hard time with some many read (the backup through nfs) and write (objstorage).
So i decreased back the concurrency to 10 as before.

Anyway, it's now as fast as it can be.
So closing this now.

ardumont closed this task as Resolved.Sep 23 2021, 9:13 AM

ardumont moved this task from code-review/await-feedback/pause to deployed/landed/monitoring on the System administration board.

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T3567: storage: Allow extid reading with filter on extid version from Resolved to Migrated.Oct 19 2022, 6:04 PM

Analyze and make the bitbucket ingestion fasterClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

tl; dr

Analyze and make the bitbucket ingestion faster
Closed, MigratedEdits Locked
Actions

Related Objects
Search...