Page MenuHomeSoftware Heritage

Replace old loader with the new one
AcceptedPublic

Authored by Alphare on Tue, Apr 27, 5:31 PM.

Details

Reviewers
ardumont
vlorentz
douardda
Group Reviewers
Reviewers
Summary

This is the minimal amount of code needed to switch from the old one to
the new one. If the new loader proves to be good enough, we may remove
the old one entirely.

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
corruption
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21168
Build 32854: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 32853: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5628 (id=20068)

Could not rebase; Attempt merge onto f03f274065...

Updating f03f274..acd2f1d
Fast-forward
 swh/loader/mercurial/from_disk.py | 27 +++++++++++++++++++++++----
 swh/loader/mercurial/hgutil.py    |  3 ++-
 swh/loader/mercurial/tasks.py     |  5 +++--
 swh/loader/mercurial/utils.py     |  3 ++-
 4 files changed, 30 insertions(+), 8 deletions(-)
Changes applied before test
commit acd2f1dc247fe4dc3fd2981d9eb7ccbaa64e120d
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:55:16 2021 +0200

    Replace old loader with the new one
    
    This is the minimal amount of code needed to switch from the old one to
    the new one. If the new loader proves to be good enough, we may remove
    the old one entirely.

commit 5413622cd6cf1a584d1b9d300b3dd1f5a6e94ce6
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:33:20 2021 +0200

    Handle more cases of corruption
    
    Some corrupted repos have missing files or broken logical links in the
    underlying Mercurial datastructure, which means that say sometimes fail
    for a given revision. This does not mean we should throw away the rest
    of the repository. (Tested on repos of various levels and flavors of
    corruption in the Boatbucket archive)

commit c6c3b386ef246860e9292ab0331aaf32cf72d61b
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:28:50 2021 +0200

    Ignore the repository's config
    
    `HGRCPATH` only tells Mercurial to ignore the user's config files, but
    some repositories have a `.hg/hgrc` file (only in the case that you copy
    the files instead of cloning, if present) that is usually used for server-side
    configuration. We want to ignore this, since it might affect loading
    and ask for hooks that are not there or are otherwise annoying/dangerous,
    for example.

commit 2ec0206482f46491086791b6b8718d5094cb4d77
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:26:09 2021 +0200

    Also use minimal env in the new Mercurial loader
    
    The old loader (bundle2 loader) already received this treatment which
    ensures Mercurial doesn't pick up on any user customization, but I
    apparently forgot to apply the same changes to the new one.

commit 1d8b26c042011f3271e451b11b670f37b44e8685
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:53:56 2021 +0200

    Use billiard instead of stdlib multiprocessing
    
    This circumvents a few celery-related issues, and is consistent with
    what the rest of the codebase does.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/208/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/208/console

Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Apr 27, 5:39 PM
Harbormaster failed remote builds in B21061: Diff 20068!

Rebase, fix imports and add the archive one

Build is green

Patch application report for D5628 (id=20081)

Could not rebase; Attempt merge onto f03f274065...

Updating f03f274..1337196
Fast-forward
 swh/loader/mercurial/from_disk.py         | 27 +++++++++++++++++++++++----
 swh/loader/mercurial/hgutil.py            | 18 +++++++++++++-----
 swh/loader/mercurial/tasks.py             | 10 +++++-----
 swh/loader/mercurial/tests/test_hgutil.py | 11 +++++++----
 swh/loader/mercurial/tests/test_tasks.py  |  4 ++--
 swh/loader/mercurial/utils.py             |  3 ++-
 6 files changed, 52 insertions(+), 21 deletions(-)
Changes applied before test
commit 1337196d87dadebbe089dcc40f3f8a25dfb6c768
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:55:16 2021 +0200

    Replace old loader with the new one
    
    This is the minimal amount of code needed to switch from the old one to
    the new one. If the new loader proves to be good enough, we may remove
    the old one entirely.

commit d88ab535e1f998136c1b27bf5aca6c585d2440d6
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:33:20 2021 +0200

    Handle more cases of corruption
    
    Some corrupted repos have missing files or broken logical links in the
    underlying Mercurial datastructure, which means that say sometimes fail
    for a given revision. This does not mean we should throw away the rest
    of the repository. (Tested on repos of various levels and flavors of
    corruption in the Boatbucket archive)

commit 250edbb11b85a62498dde8def39e84367cd3cebb
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:28:50 2021 +0200

    Ignore the repository's config
    
    `HGRCPATH` only tells Mercurial to ignore the user's config files, but
    some repositories have a `.hg/hgrc` file (only in the case that you copy
    the files instead of cloning, if present) that is usually used for server-side
    configuration. We want to ignore this, since it might affect loading
    and ask for hooks that are not there or are otherwise annoying/dangerous,
    for example.

commit 457fb88bf36d6c4eedee5e9423f9747e3ea4abf4
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Apr 26 23:26:09 2021 +0200

    Also use minimal env in the new Mercurial loader
    
    The old loader (bundle2 loader) already received this treatment which
    ensures Mercurial doesn't pick up on any user customization, but I
    apparently forgot to apply the same changes to the new one.

commit a89783c52f2e3c44e08018e0c5fa99f54471f994
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:53:56 2021 +0200

    Use billiard instead of stdlib multiprocessing
    
    This circumvents a few celery-related issues, and is consistent with
    what the rest of the codebase does.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/213/ for more details.

LGTM, but let's wait for a second opinion

LGTM, but let's wait for a second opinion

Sure!
Just to bring more info to the table: the loader is able to handle all corrupted repositories I could find and seems fast *enough* for most repositories. The Mozilla-Unified run was very slow because it's so much bigger (in every metric) than the vast majority of repositories out there. Not saying we can't do anything to improve the speed of the loader, but it might be good enough since repos this large are far and few.

Also... the old loader would just eat all of your RAM and die, so it's definitely an improvement! :D

ardumont added a subscriber: ardumont.

LGTM, but let's wait for a second opinion

lgtm as well

Sure!
Just to bring more info to the table: the loader is able to handle all corrupted
repositories I could find and seems fast *enough* for most repositories. The
Mozilla-Unified run was very slow because it's so much bigger (in every metric) than
the vast majority of repositories out there. Not saying we can't do anything to
improve the speed of the loader, but it might be good enough since repos this large
are far and few.
Also... the old loader would just eat all of your RAM and die, so it's definitely an
improvement! :D

Good to know ;) Thanks for the heads up.

As existing tests continue on passing with both loader, i guess it's fine. I'll validate
as me, let's see what others may say ;)

This revision is now accepted and ready to land.Wed, Apr 28, 12:41 PM

As existing tests continue on passing with both loader, i guess it's fine. I'll validate
as me, let's see what others may say ;)

even if you accept the diff only as yourself, it leaves everyone's review queue.

This revision now requires review to proceed.Wed, Apr 28, 12:44 PM

even if you accept the diff only as yourself, it leaves everyone's review queue.

... T.T ...
(there resigned as reviewer then, cleaned up the validation and removed reviewer, which resets the status on the diff)

Alternatively, you could just mark "Reviewers" as blocking reviewer ;)

cool, til, then, accepted back as me. Thanks.

Build is green

Patch application report for D5628 (id=20179)

Rebasing onto 888471483a...

Current branch diff-target is up to date.
Changes applied before test
commit ab4859d201e62336bdd9ca5f1a7b0591ae6846ec
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:55:16 2021 +0200

    Replace old loader with the new one
    
    This is the minimal amount of code needed to switch from the old one to
    the new one. If the new loader proves to be good enough, we may remove
    the old one entirely.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/223/ for more details.

douardda added inline comments.
swh/loader/mercurial/tasks.py
12

I would have replaced this by an absolute import (to match the previous import statement) but meh

I'm not very fond of the name of all this (from_disk, HgLoaderFromDisk, etc.) since it makes people think it cannot deal with remote hg repos (which the HgLoaderFromDisk does, but this is not really advertized, as far as I can tell).

But let's kick this, and add tasks for better documentation/naming, some day...

This revision is now accepted and ready to land.Tue, May 4, 4:53 PM