Page MenuHomeSoftware Heritage

Analyze and make the bitbucket ingestion faster
Closed, MigratedEdits Locked

Description

It's apparently [1] often flushing new objects to add to the storage.
So a bump in the buffer configuration should help [2].

Also lots of time is spent running the to_model/from_dict [1] stanza.
As most of the cpu are usage full [3] bumping the machine's cpus to
something higher should help.

[1]

[2]

[3] https://grafana.softwareheritage.org/goto/AQRE2mInk?orgId=1

Event Timeline

ardumont triaged this task as Normal priority.Sep 7 2021, 10:33 AM
ardumont created this task.
ardumont changed the task status from Open to Work in Progress.Sep 9 2021, 12:32 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

By the way, i forgot:

10:05:31 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+------------------------------+--------+
|             now              | count  |
+------------------------------+--------+
| 2021-09-09 08:05:31.90589+00 | 266803 |
+------------------------------+--------+
(1 row)

Time: 173117.724 ms (02:53.118)
16:45:59 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-14 14:51:18.684494+00 | 271282 |
+-------------------------------+--------+
(1 row)

Time: 175008.966 ms (02:55.009)

First round of checks validates the new diff to make the current loader mercurial faster.

  • worker0 (staging) is running within a venv with D6240 (more extid filtering) and D6268 (build snapshot)
  • worker17 (prod) is running with the current latest mercurial packaged (no optim)

But the filtering is still happening client side, in the mercurial loader (that means without D6275).

tl; dr

|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|
| machine         | 1st run    | snapshot                                   | 2nd run   | snapshot                                   |
|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|
| worker17        | 12m11.504s | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 | 0m34.106s | None                                       |
| worker0.staging | 8m47.374s  | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 | 0m32.484s | \xa2cd5ce3505f3b8bd7311455d0d200f7c5ad6294 |
|-----------------+------------+--------------------------------------------+-----------+--------------------------------------------|

Details in [1]

[1] P1167.

Another run in on a large repository (which cannot finish, the error is independent
though) [1]

Without the patches now deployed in production, the failure would have taken around the
same amount each time. From now one, the first round takes a long time, whether it
finishes or not. As extid and revisions are stored along the way, the next ingestion
round will be faster since they lift the extid mappings newly stored.

|----------------------+--------------------|
| First round          | 2nd                |
|----------------------+--------------------|
| real    1030m32.795s | real    12m35.640s |
|----------------------+--------------------|

[1] P1169

Actually restarted the loader_oneshot which now makes usage of the latest v2.2.0 loader mercurial.

Datapoint:

12:24:21 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-17 10:24:22.953431+00 | 273002 |
+-------------------------------+--------+
(1 row)

Time: 217785.472 ms (03:37.785)

Now most of the time can be spent in reading the actual mapping extids -> hgnode-id [1] to filter on something we already see.
Which does not change much from actual visits which already ended up in snapshot.
However that changes a lot for visits on forks where we can bypass already done work on those forks.

Note that there is still some mappings from_dict calls happening, that comes from reading in the storage.
As the filtering is still happening client side, i gather that applying the last patch which would make actually
filtering server side would drop some more spurious work.

[1]

Deployed the loader mercurial v2.3 (with filtering server side).
As expected, less time is spent in the method fetching the new changesets.

datapoint:

10:02:40 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+------------------------------+--------+
|             now              | count  |
+------------------------------+--------+
| 2021-09-20 10:04:30.89072+00 | 280995 |
+------------------------------+--------+
(1 row)

Time: 223465.755 ms (03:43.466)

fwiw, the rabbitmq queue is 237683 (was around 280k on friday prior to the v2.2 so the ingestion of those went way faster).
This might be a better heuristic than the current origins (as some hg origins already existed prior to actually ingest the bitbucket backup...).

I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20
from celery cli. It's actually holding fine. And that coupled with the filtering server
side makes for a huge bump in speed. The archive db does not seem to mind at all.

18:02:11 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-20 16:02:13.517665+00 | 282491 |
+-------------------------------+--------+
(1 row)

Time: 176714.410 ms (02:56.714)

rabbit: 231587

I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20

saam's nfs server had a bit of a hard time with some many read (the backup through nfs) and write (objstorage).
So i decreased back the concurrency to 10 as before.


Anyway, it's now as fast as it can be.
So closing this now.