Load the archived bitbucket mercurial repositories
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	May 21 2021, 10:18 AM

Description

When swh.loader.mercurial 1.0 is deployed in production, schedule the loading of all the archived bitbucket repos from the mapping file provided by @Alphare; monitor the completion of the tasks.

It would probably make sense to set up a new worker instance for this to avoid interfering with the regular loading.

Revisions and Commits

rDDOC Development documentation
	D6474	rDDOCab66c4976472 changelog: Reference the bitbucket mercurial ingestion finish

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3338 Load the archived bitbucket mercurial repositories
Migrated	gitlab-migration	T3337 Smoke test ingestion of bitbucket repositories with latest loader mercurial
Migrated	gitlab-migration	T3336 Deploy swh.loader.mercurial 2.1 in staging
Migrated	gitlab-migration	T3418 Decide a consistent policy on having multiple archived objects for the same extid
Migrated	gitlab-migration	T3447 staging: Deploy swh.loader.mercurial v2.1.0
Migrated	gitlab-migration	T3448 production: Deploy swh.loader.mercurial v2.1.0
Migrated	gitlab-migration	T3455 Make bitbucket origins ingestion concurrent
Migrated	gitlab-migration	T3563 Analyze and make the bitbucket ingestion faster
Migrated	gitlab-migration	T3567 storage: Allow extid reading with filter on extid version
Migrated	gitlab-migration	T3658 Reference bitbucket mercurial origins

Event Timeline

olasd triaged this task as High priority.May 21 2021, 10:18 AM

olasd created this task.

olasd added a subtask: T3337: Smoke test ingestion of bitbucket repositories with latest loader mercurial.

The mapping file is located (on the boatbucket machine) at /srv/boatbucket/mapping-to-repos.txt. It does *not* contain the (very few) outright corrupted repositories, I might have to do some digging and even bother the BB team again to get the URL for those.

In the mean time, we agreed that the smoke-test on the staging environment should happen on:

https://mercurial-scm.org/repo/hg for the remote case of a non-trivial repo
refugees/data/54/54220cd1-b139-4188-9455-1e13e663f1ac/main-repo.mercurial/ (PyPy for a local large-ish repo)
any corrupted repository in /srv/boatbucket/extra-repos/corrupted/ just in case

For posterity, I have tested that all corrupted and "verify failed" repositories in the archive load correctly, as well as the humongous Mozilla-unified, PyPy and about a few thousand random other ones from the archive. Aside from the incremental loading issues detailed in T3336 (that should be fixed in today's run), everything seems fine.

That's awesome news, thanks for the heads up \o/.

The run from this week-end, detailed in T3336, appears to have worked fine. (just making sure it's obvious from this task also)

As mentioned in T3336, we've now passed 3000 repos loaded successfully in staging. We've had two failures due to attempting to add two identical objects concurrently, which is something my simple test script wouldn't catch, but would be handled properly by an actual worker process.

Once we go to production we should expect the throughput to be quite a bit faster.

ardumont mentioned this in P1062 scheduler: warning about backend name mismatch with the latest loader mercurial.Jun 4 2021, 12:07 PM

Latest mercurial loader v2.1 deployed [1] [2]
We should be able to continue with this now.

[1] T3447

[2] T3448

ardumont added a subtask: T3418: Decide a consistent policy on having multiple archived objects for the same extid.Jul 29 2021, 6:30 PM

ardumont changed the status of subtask T3337: Smoke test ingestion of bitbucket repositories with latest loader mercurial from Open to Work in Progress.Jul 30 2021, 12:16 PM

ardumont closed subtask T3337: Smoke test ingestion of bitbucket repositories with latest loader mercurial as Resolved.Jul 30 2021, 1:00 PM

It would probably make sense to set up a new worker instance for this to avoid interfering with the regular loading.

Yes, worker17 is currently only ingesting mercurial origins from sourceforge [1].
It has some margin to work some more so I'll reuse it.

[1] https://grafana.softwareheritage.org/goto/E-2av3Znk?orgId=1

ardumont changed the task status from Open to Work in Progress.Jul 30 2021, 1:04 PM

ardumont moved this task from Weekly backlog to in-progress on the System administration board.

Started in the same tmux session [2] as the sourceforge ingestion [1]

Grafana tag updated on the dashboad mentioned in the comment before.

[1] As of this comment, number of origins matching this:

12:45:35 softwareheritage@belvedere:5432=> select count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+--------+
| count  |
+--------+
| 253841 |
+--------+
(1 row)

(We already have some so the increase might not be that high, still, that gives a datapoint).

[2] after the outage though, the node restarted and i forgot to create the tmux session under root, so it's under my login ardumont... still, it's shareable.

ardumont moved this task from in-progress to code-review/await-feedback/pause on the System administration board.Jul 30 2021, 1:33 PM

(Claiming the task to find it back more easily through my activity view.)

(claiming i said ;)

Progressing:

13:40:06 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-07-30 11:39:37.122152+00 | 253848 |
+-------------------------------+--------+
(1 row)

Time: 159807.333 ms (02:39.807)
13:42:48 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-07-30 12:42:57.883811+00 | 253877 |
+-------------------------------+--------+
(1 row)

Time: 198293.332 ms (03:18.293)

Still progressing. It's not fast though since it's deployed rather simply. It's doing
one origin at a time which can end up seemingly stuck behind a big repository (currently
[1]):

08:52:27 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-03 06:51:55.783815+00 | 254319 |
+-------------------------------+--------+
(1 row)

Time: 188942.400 ms (03:08.942)

I don't plan to work on making it go faster immediately though, so i'll create a task for it [2]

[1] https://bitbucket.org/rhelmer/mozilla-central (that's possibly a fork of a big repo imsmw)

[2] T3455

ardumont changed the status of subtask T3455: Make bitbucket origins ingestion concurrent from Open to Work in Progress.Aug 5 2021, 12:45 PM

Concurrent deployment is ongoing ^ so this should go faster now, datapoint for later.
(mozilla-central fork from previous comment still ongoing...)

17:06:45 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-05 15:07:59.084584+00 | 254442 |
+-------------------------------+--------+
(1 row)

Time: 155600.816 ms (02:35.601)

ardumont closed subtask T3455: Make bitbucket origins ingestion concurrent as Resolved.Aug 6 2021, 9:23 AM

So this ingestion got stopped or crashed, at some point.
Probably around the db outage from last week (which emptied the rabbitmq queue).

I've triggered it back (a simple shuffle around all the bitbucket origins, i don't have time to make it a delta).
It's currently running (data point ongoing...).

It's currently running (data point ongoing...).

17:17:52 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-27 15:17:13.217029+00 | 260369 |
+-------------------------------+--------+
(1 row)

Time: 251776.724 ms (04:11.777)

It's currently ongoing on some large repositories (and some other large sourceforge svn repository).

I've stopped a bit the sourceforge part to let the bitbucket ingestion progress alone. I've
restarted some more processes so it goes a bit faster (hopefully) and let the large ones finish in
due time.

15:02:23 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-30 13:01:40.926225+00 | 261213 |
+-------------------------------+--------+
(1 row)

Time: 216836.839 ms (03:36.837)

Ongoing ingestion is rather slow [1]
Because like for git origins, we can't know in advance rather large repositories.
So sometimes, ingestion appears stuck because we are dealing with large repositories (more than 2 hours of loading [2]).

[1]

16:48:50 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-08-31 14:48:09.237384+00 | 261717 |
+-------------------------------+--------+
(1 row)

Time: 192951.994 ms (03:12.952)

[2] https://grafana.softwareheritage.org/goto/wvwKWJ47k?orgId=1

^ Temporarily disabled puppet agent and bumped the concurrency for that worker to 10 (around the time of the previous comment).

I'm gonna let this run for a while.
No guarantee that we won't hit the same large repository pattern again though.

09:11:37 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-01 07:10:55.473409+00 | 262109 |
+-------------------------------+--------+
(1 row)

Time: 168035.990 ms (02:48.036)

New day, new datapoint [1]

I've opened T3563 and adapted a bit the worker.
Things should go faster now. I'll double check later it's actually the case.

[1]

10:58:20 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-09-07 08:58:20.097432+00 | 264391 |
+-------------------------------+--------+
(1 row)

Time: 187294.303 ms (03:07.294)

ardumont changed the status of subtask T3563: Analyze and make the bitbucket ingestion faster from Open to Work in Progress.Sep 9 2021, 12:32 PM

ardumont closed subtask T3563: Analyze and make the bitbucket ingestion faster as Resolved.Sep 23 2021, 9:13 AM

A first run of bitbucket origins have been scheduled and mostly ingested now [1]
(remains only 13 large ones ongoing).

As there have been fixes [2] been deployed after the scheduling of some of those
origins, ~6.5k origins [3] remains to be rescheduled so they have a meaningful visit
(with a snapshot).

I'm gonna attend to this soon.

[1]

09:36:13 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+------------------------------+--------+
|             now              | count  |
+------------------------------+--------+
| 2021-10-07 07:36:33.48957+00 | 336795 |
+------------------------------+--------+
(1 row)

Time: 190101.818 ms (03:10.102)

[2] uneventful visits did not generate any snapshot, fixed meanwhile.

[3]

swhworker@worker17:~$ wc -l bitbucket-no-snapshot-to-reschedule-20211007.txt
6462 bitbucket-no-snapshot-to-reschedule-20211007.txt

I'm gonna attend to this soon.

ongoing

The last origins are still ongoing. They are taking their time...

Oct 09 09:56:06 worker17 python3[2301741]: [2021-10-09 09:56:06,837: INFO/ForkPoolWorker-720] Task swh.loader.mercurial.tasks.LoadMercurial[808d60f1-a1a3-4419-b1a1-0e11f5b301c8] succeeded in 221815.95676311804s: {'status': 'eventful'}
Oct 10 14:29:47 worker17 python3[2771603]: [2021-10-10 14:29:47,615: INFO/ForkPoolWorker-763] Task swh.loader.mercurial.tasks.LoadMercurial[a0f4c5c0-d257-4c3e-857d-aa6428ed29f7] succeeded in 273199.063783668s: {'status': 'eventful'}
Oct 11 15:19:51 worker17 python3[2789825]: [2021-10-11 15:19:51,729: INFO/ForkPoolWorker-778] Task swh.loader.mercurial.tasks.LoadMercurial[72530eef-9d19-4cae-a2a1-65ae172f34c0] succeeded in 341565.8088486707s: {'status': 'eventful'}
Oct 11 19:25:02 worker17 python3[1278615]: [2021-10-11 19:25:02,651: INFO/ForkPoolWorker-41] Task swh.loader.mercurial.tasks.LoadMercurial[ee25802f-3657-4680-aea3-763b647dd371] succeeded in 970513.5825914801s: {'status': 'eventful'}
Oct 11 22:47:42 worker17 python3[1676242]: [2021-10-11 22:47:42,302: INFO/ForkPoolWorker-663] Task swh.loader.mercurial.tasks.LoadMercurial[adbb2e3a-6e4c-4f16-96d7-c336f55848c7] succeeded in 507602.27686831495s: {'status': 'eventful'}
...

ardumont added a revision: D6474: changelog: Update bitbucket mercurial ingestion status.Oct 14 2021, 2:30 PM

ardumont added a commit: rDDOCab66c4976472: changelog: Reference the bitbucket mercurial ingestion finish.Oct 14 2021, 3:00 PM

ardumont moved this task from code-review/await-feedback/pause to deployed/landed/monitoring on the System administration board.Oct 14 2021, 3:00 PM

Actual count on bitbucket origins:

14:23:32 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';
+-------------------------------+--------+
|              now              | count  |
+-------------------------------+--------+
| 2021-10-14 12:23:33.901117+00 | 336795 |
+-------------------------------+--------+
(1 row)

ardumont changed the status of subtask T3658: Reference bitbucket mercurial origins from Open to Work in Progress.Oct 15 2021, 9:49 AM

ardumont closed subtask T3658: Reference bitbucket mercurial origins as Resolved.Oct 20 2021, 12:16 PM

Remains 6 origins still ongoing...

Remains 3 origins still ongoing...

Remains 1 origin still ongoing... almost there...

Done now, closing.

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.Nov 18 2021, 3:16 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T3455: Make bitbucket origins ingestion concurrent from Resolved to Migrated.Oct 19 2022, 6:03 PM

gitlab-migration changed the status of subtask T3563: Analyze and make the bitbucket ingestion faster from Resolved to Migrated.

gitlab-migration changed the status of subtask T3658: Reference bitbucket mercurial origins from Resolved to Migrated.

gitlab-migration changed the status of subtask T3418: Decide a consistent policy on having multiple archived objects for the same extid from Resolved to Migrated.Jan 8 2023, 10:02 PM

Load the archived bitbucket mercurial repositoriesClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Load the archived bitbucket mercurial repositories
Closed, MigratedEdits Locked
Actions

Related Objects
Search...