Page MenuHomeSoftware Heritage

Load the archived bitbucket mercurial repositories
Started, Work in Progress, HighPublic

Description

When swh.loader.mercurial 1.0 is deployed in production, schedule the loading of all the archived bitbucket repos from the mapping file provided by @Alphare; monitor the completion of the tasks.

It would probably make sense to set up a new worker instance for this to avoid interfering with the regular loading.

Event Timeline

The mapping file is located (on the boatbucket machine) at /srv/boatbucket/mapping-to-repos.txt. It does *not* contain the (very few) outright corrupted repositories, I might have to do some digging and even bother the BB team again to get the URL for those.

In the mean time, we agreed that the smoke-test on the staging environment should happen on:

  • https://mercurial-scm.org/repo/hg for the remote case of a non-trivial repo
  • refugees/data/54/54220cd1-b139-4188-9455-1e13e663f1ac/main-repo.mercurial/ (PyPy for a local large-ish repo)
  • any corrupted repository in /srv/boatbucket/extra-repos/corrupted/ just in case

For posterity, I have tested that all corrupted and "verify failed" repositories in the archive load correctly, as well as the humongous Mozilla-unified, PyPy and about a few thousand random other ones from the archive. Aside from the incremental loading issues detailed in T3336 (that should be fixed in today's run), everything seems fine.

That's awesome news, thanks for the heads up \o/.

The run from this week-end, detailed in T3336, appears to have worked fine. (just making sure it's obvious from this task also)

As mentioned in T3336, we've now passed 3000 repos loaded successfully in staging. We've had two failures due to attempting to add two identical objects concurrently, which is something my simple test script wouldn't catch, but would be handled properly by an actual worker process.

Once we go to production we should expect the throughput to be quite a bit faster.

Latest mercurial loader v2.1 deployed [1] [2]
We should be able to continue with this now.

[1] T3447

[2] T3448

It would probably make sense to set up a new worker instance for this to avoid interfering with the regular loading.

Yes, worker17 is currently only ingesting mercurial origins from sourceforge [1].
It has some margin to work some more so I'll reuse it.

[1] https://grafana.softwareheritage.org/goto/E-2av3Znk?orgId=1

ardumont changed the task status from Open to Work in Progress.Jul 30 2021, 1:04 PM
ardumont moved this task from Weekly backlog to in-progress on the System administration board.

Started in the same tmux session [2] as the sourceforge ingestion [1]

Grafana tag updated on the dashboad mentioned in the comment before.

[1] As of this comment, number of origins matching this:

12:45:35 softwareheritage@belvedere:5432=> select count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+--------+
| count  |
+--------+
| 253841 |
+--------+
(1 row)

(We already have some so the increase might not be that high, still, that gives a datapoint).

[2] after the outage though, the node restarted and i forgot to create the tmux session under root, so it's under my login ardumont... still, it's shareable.

(Claiming the task to find it back more easily through my activity view.)

Progressing:

13:40:06 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-07-30 11:39:37.122152+00 | 253848 |
+-------------------------------+--------+
(1 row)

Time: 159807.333 ms (02:39.807)
13:42:48 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-07-30 12:42:57.883811+00 | 253877 |
+-------------------------------+--------+
(1 row)

Time: 198293.332 ms (03:18.293)

Still progressing. It's not fast though since it's deployed rather simply. It's doing
one origin at a time which can end up seemingly stuck behind a big repository (currently
[1]):

08:52:27 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-03 06:51:55.783815+00 | 254319 |
+-------------------------------+--------+
(1 row)

Time: 188942.400 ms (03:08.942)

I don't plan to work on making it go faster immediately though, so i'll create a task for it [2]

[1] https://bitbucket.org/rhelmer/mozilla-central (that's possibly a fork of a big repo imsmw)

[2] T3455

Concurrent deployment is ongoing ^ so this should go faster now, datapoint for later.
(mozilla-central fork from previous comment still ongoing...)

17:06:45 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-05 15:07:59.084584+00 | 254442 |
+-------------------------------+--------+
(1 row)

Time: 155600.816 ms (02:35.601)

So this ingestion got stopped or crashed, at some point.
Probably around the db outage from last week (which emptied the rabbitmq queue).

I've triggered it back (a simple shuffle around all the bitbucket origins, i don't have time to make it a delta).
It's currently running (data point ongoing...).

It's currently running (data point ongoing...).

17:17:52 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-27 15:17:13.217029+00 | 260369 |
+-------------------------------+--------+
(1 row)

Time: 251776.724 ms (04:11.777)

It's currently ongoing on some large repositories (and some other large sourceforge svn repository).

I've stopped a bit the sourceforge part to let the bitbucket ingestion progress alone. I've
restarted some more processes so it goes a bit faster (hopefully) and let the large ones finish in
due time.

15:02:23 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-30 13:01:40.926225+00 | 261213 |
+-------------------------------+--------+
(1 row)

Time: 216836.839 ms (03:36.837)

Ongoing ingestion is rather slow [1]
Because like for git origins, we can't know in advance rather large repositories.
So sometimes, ingestion appears stuck because we are dealing with large repositories (more than 2 hours of loading [2]).

[1]

16:48:50 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-31 14:48:09.237384+00 | 261717 |
+-------------------------------+--------+
(1 row)

Time: 192951.994 ms (03:12.952)

[2] https://grafana.softwareheritage.org/goto/wvwKWJ47k?orgId=1

^ Temporarily disabled puppet agent and bumped the concurrency for that worker to 10 (around the time of the previous comment).

I'm gonna let this run for a while.
No guarantee that we won't hit the same large repository pattern again though.

09:11:37 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-01 07:10:55.473409+00 | 262109 |
+-------------------------------+--------+
(1 row)

Time: 168035.990 ms (02:48.036)

New day, new datapoint [1]

I've opened T3563 and adapted a bit the worker.
Things should go faster now. I'll double check later it's actually the case.

[1]

10:58:20 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-07 08:58:20.097432+00 | 264391 |
+-------------------------------+--------+
(1 row)

Time: 187294.303 ms (03:07.294)