Page MenuHomeSoftware Heritage

HgLoaderFromDisk: Only load new commits
AbandonedPublic

Authored by Alphare on Dec 2 2020, 2:59 PM.

Details

Summary

When a repository as new commits only load the new ones.

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
D4649
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17749
Build 27437: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 27436: arc lint + arc unit

Unit TestsFailed

TimeTest
0 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.mercurial.tests.test_tasks::test_archive_loader
swh_scheduler_celery_app = <Celery celery.tests at 0x7fe052605128> swh_scheduler_celery_includes = ['swh.scheduler.tests.tasks', 'swh.loader.mercurial.tasks', 'swh.loader.mercurial.tasks_from_disk', 'swh.loader.package.archive.tasks', 'swh.loader.package.cran.tasks', 'swh.loader.package.debian.tasks', ...]
11,819 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.mercurial.tests.test_tasks::test_loader
swh_scheduler_celery_app = <Celery celery.tests at 0x7fe052605128> swh_scheduler_celery_includes = ['swh.scheduler.tests.tasks', 'swh.loader.mercurial.tasks', 'swh.loader.mercurial.tasks_from_disk', 'swh.loader.package.archive.tasks', 'swh.loader.package.cran.tasks', 'swh.loader.package.debian.tasks', ...]
136 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.mercurial.tests.test_tasks_from_disk::test_archive_loader
swh_scheduler_celery_app = <Celery celery.tests at 0x7fe052605128> swh_scheduler_celery_includes = ['swh.scheduler.tests.tasks', 'swh.loader.mercurial.tasks', 'swh.loader.mercurial.tasks_from_disk', 'swh.loader.package.archive.tasks', 'swh.loader.package.cran.tasks', 'swh.loader.package.debian.tasks', ...]
0 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.mercurial.tests.test_tasks_from_disk::test_loader
swh_scheduler_celery_app = <Celery celery.tests at 0x7fe052605128> swh_scheduler_celery_includes = ['swh.scheduler.tests.tasks', 'swh.loader.mercurial.tasks', 'swh.loader.mercurial.tasks_from_disk', 'swh.loader.package.archive.tasks', 'swh.loader.package.cran.tasks', 'swh.loader.package.debian.tasks', ...]
2 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.mercurial.tests.test_converters.TestParseAuthorConverters::test_parse_author_2
View Full Test Results (4 Failed · 30 Passed)

Event Timeline

Build is green

Patch application report for D4649 (id=16488)

Could not rebase; Attempt merge onto 4bf91cff72...

Updating 4bf91cf..a110ee9
Fast-forward
 swh/loader/mercurial/from_disk.py            | 88 +++++++++++++++++++++++++---
 swh/loader/mercurial/tests/test_from_disk.py | 80 +++++++++++++++++++++++++
 2 files changed, 161 insertions(+), 7 deletions(-)
Changes applied before test
commit a110ee9e907e8356a5583b5a874d0285f6312539
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Wed Dec 2 14:47:18 2020 +0100

    HgLoaderFromDisk: Only load new commits
    
    When a repository as new commits only load the new ones.

commit c0ac9e2e4bbbd6bf0a6c09517a532708db824df1
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Mon Nov 30 14:29:47 2020 +0100

    Make loading an unchanged repository uneventful
    
    Summary:
    By looking at the previous snapshot heads, loading of an unchanged repository
    will be uneventful.
    
    Reviewers: #reviewers
    
    Differential Revision: https://forge.softwareheritage.org/D4643

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/133/ for more details.

douardda requested changes to this revision.Dec 4 2020, 2:38 PM
douardda added a subscriber: douardda.
douardda added inline comments.
swh/loader/mercurial/from_disk.py
372

I don't really understand why this is needed. I mean isn't it just a matter of properly sort revisions to add to ensure parents revs are always handled before? ie make sure ingested revisions graph is traversed "properly".

I mean if this is needed, then there is recursion store_revisions -> get_revisions_parents -> store_revision [...] and there is no guarantee this recursion will stay low enough for python to handle it (-> RecursionError) .

So to prevent that, the revision graph must be traversed properly, then this special case (resulting in possible recursion) is not needed anymore.

This revision now requires changes to proceed.Dec 4 2020, 2:38 PM
acezar marked an inline comment as done.

Followup

swh/loader/mercurial/from_disk.py
372

It was to get missing parents until they can be fetched from the storage, but since I know some of them from the latest snapshot, I have pre-populated self._revision_nodeid_to_swhid instead.

Build was aborted

Patch application report for D4649 (id=16566)

Could not rebase; Attempt merge onto 4bf91cff72...

Updating 4bf91cf..c89f1b7
Fast-forward
 swh/loader/mercurial/from_disk.py            | 92 +++++++++++++++++++++++++---
 swh/loader/mercurial/hgutil.py               |  4 +-
 swh/loader/mercurial/tests/test_from_disk.py | 80 ++++++++++++++++++++++++
 3 files changed, 168 insertions(+), 8 deletions(-)
Changes applied before test
commit c89f1b729ea862082a51e73eefaa3023ef92cf62
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Wed Dec 2 14:47:18 2020 +0100

    HgLoaderFromDisk: Only load new commits
    
    When a repository as new commits only load the new ones.

commit 7af11ca8dcaaec1b65250bfbdb9a9d43d3c588cf
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Mon Nov 30 14:29:47 2020 +0100

    HgLoaderFromDisk: uneventful load when unchanged
    
    By looking at the previous snapshot heads, loading of an unchanged repository
    will be uneventful.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/135/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/135/console

douardda added inline comments.
swh/loader/mercurial/from_disk.py
372

this still looks wrong to me.

@marmoute can you have a look please?

In which situation can this sha1_git be None?

I think this would happen in the situation where:

  • last snapshot returned a set of hg revisions s1, so we know the sha1_git for every changeset c_i in this snapshot
  • the updated hg repo includes a changeset c_new which parent is known by the SWH Archive but is not in s1 (so it's an ancestor of at least one element of s1),
  • then the call to self.get_revision_id_from_hg_nodeid(parent_hg_nodeid) will fail and the whole updating will fail.

Am I right?

If so, we cannot implement this without a "working" get_revision_id_from_hg_nodeid().

acezar added inline comments.
swh/loader/mercurial/from_disk.py
372

You are right. It will fail in some situations.

I added link to T2849 to the diff related objects.

swh/loader/mercurial/from_disk.py
372

Example of get_revision_id_from_hg_nodeid returning None:

The repository is usable but has some corruption that prevent loading part of its revisions. This is the case for pypy for example.

Not an issue since the rest of the repository is ok to work with as long that you don't use the corrupted part.

But when a revision is corrupted all its descendants will fail to get their parent id has they cannot be loaded from the storage or the cache.

I'm preparing a diff that will avoid failing of the entire load when there are some corruption by not loading the corrupted revisions. And it will imply that get_revision_id_from_hg_nodeid can return None or raise a specific exception.

Alphare commandeered this revision.EditedWed, May 5, 10:44 PM
Alphare abandoned this revision.
Alphare added a reviewer: acezar.

Abandoned in favor of D5687