This is the minimal amount of code needed to switch from the old one to
the new one. If the new loader proves to be good enough, we may remove
the old one entirely.
Details
- Reviewers
ardumont vlorentz douardda - Group Reviewers
Reviewers - Maniphest Tasks
- T3335: Integrate the new Mercurial loader into the tasks system
- Commits
- rDLDHG92441e90110b: Replace old loader with the new one
Diff Detail
- Repository
- rDLDHG Mercurial loader
- Branch
- corruption
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 21168 Build 32854: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 32853: arc lint + arc unit
Event Timeline
Build has FAILED
Patch application report for D5628 (id=20068)
Could not rebase; Attempt merge onto f03f274065...
Updating f03f274..acd2f1d Fast-forward swh/loader/mercurial/from_disk.py | 27 +++++++++++++++++++++++---- swh/loader/mercurial/hgutil.py | 3 ++- swh/loader/mercurial/tasks.py | 5 +++-- swh/loader/mercurial/utils.py | 3 ++- 4 files changed, 30 insertions(+), 8 deletions(-)
Changes applied before test
commit acd2f1dc247fe4dc3fd2981d9eb7ccbaa64e120d Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:55:16 2021 +0200 Replace old loader with the new one This is the minimal amount of code needed to switch from the old one to the new one. If the new loader proves to be good enough, we may remove the old one entirely. commit 5413622cd6cf1a584d1b9d300b3dd1f5a6e94ce6 Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:33:20 2021 +0200 Handle more cases of corruption Some corrupted repos have missing files or broken logical links in the underlying Mercurial datastructure, which means that say sometimes fail for a given revision. This does not mean we should throw away the rest of the repository. (Tested on repos of various levels and flavors of corruption in the Boatbucket archive) commit c6c3b386ef246860e9292ab0331aaf32cf72d61b Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:28:50 2021 +0200 Ignore the repository's config `HGRCPATH` only tells Mercurial to ignore the user's config files, but some repositories have a `.hg/hgrc` file (only in the case that you copy the files instead of cloning, if present) that is usually used for server-side configuration. We want to ignore this, since it might affect loading and ask for hooks that are not there or are otherwise annoying/dangerous, for example. commit 2ec0206482f46491086791b6b8718d5094cb4d77 Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:26:09 2021 +0200 Also use minimal env in the new Mercurial loader The old loader (bundle2 loader) already received this treatment which ensures Mercurial doesn't pick up on any user customization, but I apparently forgot to apply the same changes to the new one. commit 1d8b26c042011f3271e451b11b670f37b44e8685 Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:53:56 2021 +0200 Use billiard instead of stdlib multiprocessing This circumvents a few celery-related issues, and is consistent with what the rest of the codebase does.
Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/208/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/208/console
Build is green
Patch application report for D5628 (id=20081)
Could not rebase; Attempt merge onto f03f274065...
Updating f03f274..1337196 Fast-forward swh/loader/mercurial/from_disk.py | 27 +++++++++++++++++++++++---- swh/loader/mercurial/hgutil.py | 18 +++++++++++++----- swh/loader/mercurial/tasks.py | 10 +++++----- swh/loader/mercurial/tests/test_hgutil.py | 11 +++++++---- swh/loader/mercurial/tests/test_tasks.py | 4 ++-- swh/loader/mercurial/utils.py | 3 ++- 6 files changed, 52 insertions(+), 21 deletions(-)
Changes applied before test
commit 1337196d87dadebbe089dcc40f3f8a25dfb6c768 Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:55:16 2021 +0200 Replace old loader with the new one This is the minimal amount of code needed to switch from the old one to the new one. If the new loader proves to be good enough, we may remove the old one entirely. commit d88ab535e1f998136c1b27bf5aca6c585d2440d6 Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:33:20 2021 +0200 Handle more cases of corruption Some corrupted repos have missing files or broken logical links in the underlying Mercurial datastructure, which means that say sometimes fail for a given revision. This does not mean we should throw away the rest of the repository. (Tested on repos of various levels and flavors of corruption in the Boatbucket archive) commit 250edbb11b85a62498dde8def39e84367cd3cebb Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:28:50 2021 +0200 Ignore the repository's config `HGRCPATH` only tells Mercurial to ignore the user's config files, but some repositories have a `.hg/hgrc` file (only in the case that you copy the files instead of cloning, if present) that is usually used for server-side configuration. We want to ignore this, since it might affect loading and ask for hooks that are not there or are otherwise annoying/dangerous, for example. commit 457fb88bf36d6c4eedee5e9423f9747e3ea4abf4 Author: Raphaël Gomès <rgomes@octobus.net> Date: Mon Apr 26 23:26:09 2021 +0200 Also use minimal env in the new Mercurial loader The old loader (bundle2 loader) already received this treatment which ensures Mercurial doesn't pick up on any user customization, but I apparently forgot to apply the same changes to the new one. commit a89783c52f2e3c44e08018e0c5fa99f54471f994 Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:53:56 2021 +0200 Use billiard instead of stdlib multiprocessing This circumvents a few celery-related issues, and is consistent with what the rest of the codebase does.
See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/213/ for more details.
Sure!
Just to bring more info to the table: the loader is able to handle all corrupted repositories I could find and seems fast *enough* for most repositories. The Mozilla-Unified run was very slow because it's so much bigger (in every metric) than the vast majority of repositories out there. Not saying we can't do anything to improve the speed of the loader, but it might be good enough since repos this large are far and few.
Also... the old loader would just eat all of your RAM and die, so it's definitely an improvement! :D
LGTM, but let's wait for a second opinion
lgtm as well
Sure!
Just to bring more info to the table: the loader is able to handle all corrupted
repositories I could find and seems fast *enough* for most repositories. The
Mozilla-Unified run was very slow because it's so much bigger (in every metric) than
the vast majority of repositories out there. Not saying we can't do anything to
improve the speed of the loader, but it might be good enough since repos this large
are far and few.
Also... the old loader would just eat all of your RAM and die, so it's definitely an
improvement! :D
Good to know ;) Thanks for the heads up.
As existing tests continue on passing with both loader, i guess it's fine. I'll validate
as me, let's see what others may say ;)
even if you accept the diff only as yourself, it leaves everyone's review queue.
... T.T ...
(there resigned as reviewer then, cleaned up the validation and removed reviewer, which resets the status on the diff)
Build is green
Patch application report for D5628 (id=20179)
Rebasing onto 888471483a...
Current branch diff-target is up to date.
Changes applied before test
commit ab4859d201e62336bdd9ca5f1a7b0591ae6846ec Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:55:16 2021 +0200 Replace old loader with the new one This is the minimal amount of code needed to switch from the old one to the new one. If the new loader proves to be good enough, we may remove the old one entirely.
See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/223/ for more details.
swh/loader/mercurial/tasks.py | ||
---|---|---|
12 | I would have replaced this by an absolute import (to match the previous import statement) but meh |
I'm not very fond of the name of all this (from_disk, HgLoaderFromDisk, etc.) since it makes people think it cannot deal with remote hg repos (which the HgLoaderFromDisk does, but this is not really advertized, as far as I can tell).
But let's kick this, and add tasks for better documentation/naming, some day...
Build is green
Patch application report for D5628 (id=20567)
Rebasing onto 4630de8a50...
Current branch diff-target is up to date.
Changes applied before test
commit 92441e90110b9e45dbf63903dfbf716f84728659 Author: Raphaël Gomès <rgomes@octobus.net> Date: Tue Apr 27 10:55:16 2021 +0200 Replace old loader with the new one This is the minimal amount of code needed to switch from the old one to the new one. If the new loader proves to be good enough, we may remove the old one entirely.
See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/238/ for more details.