Page MenuHomeSoftware Heritage

Load the archived bitbucket mercurial repositories
Open, HighPublic

Description

When swh.loader.mercurial 1.0 is deployed in production, schedule the loading of all the archived bitbucket repos from the mapping file provided by @Alphare; monitor the completion of the tasks.

It would probably make sense to set up a new worker instance for this to avoid interfering with the regular loading.

Event Timeline

The mapping file is located (on the boatbucket machine) at /srv/boatbucket/mapping-to-repos.txt. It does *not* contain the (very few) outright corrupted repositories, I might have to do some digging and even bother the BB team again to get the URL for those.

In the mean time, we agreed that the smoke-test on the staging environment should happen on:

  • https://mercurial-scm.org/repo/hg for the remote case of a non-trivial repo
  • refugees/data/54/54220cd1-b139-4188-9455-1e13e663f1ac/main-repo.mercurial/ (PyPy for a local large-ish repo)
  • any corrupted repository in /srv/boatbucket/extra-repos/corrupted/ just in case

For posterity, I have tested that all corrupted and "verify failed" repositories in the archive load correctly, as well as the humongous Mozilla-unified, PyPy and about a few thousand random other ones from the archive. Aside from the incremental loading issues detailed in T3336 (that should be fixed in today's run), everything seems fine.

That's awesome news, thanks for the heads up \o/.

The run from this week-end, detailed in T3336, appears to have worked fine. (just making sure it's obvious from this task also)

As mentioned in T3336, we've now passed 3000 repos loaded successfully in staging. We've had two failures due to attempting to add two identical objects concurrently, which is something my simple test script wouldn't catch, but would be handled properly by an actual worker process.

Once we go to production we should expect the throughput to be quite a bit faster.