Page MenuHomeSoftware Heritage

Make the Mercurial loader incremental
ClosedPublic

Authored by Alphare on May 5 2021, 6:23 PM.

Details

Summary

Before this change, if a previous snapshot of a given origin existed
and new changes were detected, we would start from scratch.

This change leverages the recent new db mapping for external ids (like
Mercurial's node ids) to internal SWH ids to compute what has changed
from the latest snapshot, now that it is possible to find an SWH id from
a Mercurial node id.

For revisions, the logic is simple: look at the heads we've saved and
ask Mercurial for all the revisions that are not ancestors of these
heads (themselves excluded). This is not as "clever" as the full
Mercurial discovery algorithm, but is much simpler and good enough for
the kinds of scales we're operating at on a single repository.

For tags, the previous logic assumed that all possible target revisions
were done in the same run. Here, we look at the difference between the
tags Mercurial reports and the one form the previous snapshot; any new
tag will either have its corresponding release in cache (because it was
processed in the same run) or fetched from the database using the
aforementioned mapping.

Test Plan

Note that I saw that test_identify tests seem to fail when run with
the rest of the tests, I'm not sure why. So I'm trying it on the CI to
see if it behaves the same.

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
incremental
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21362
Build 33181: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 33180: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5687 (id=20310)

Could not rebase; Attempt merge onto 888471483a...

Updating 8884714..7f041be
Fast-forward
 swh/loader/mercurial/from_disk.py            | 128 +++++++++++++++++----------
 swh/loader/mercurial/tasks.py                |  10 +--
 swh/loader/mercurial/tests/test_from_disk.py |  50 ++++++++++-
 swh/loader/mercurial/tests/test_tasks.py     |   4 +-
 4 files changed, 137 insertions(+), 55 deletions(-)
Changes applied before test
commit 7f041be86e44b2edb90c9008eb65abb87e4779be
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

commit ab4859d201e62336bdd9ca5f1a7b0591ae6846ec
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Tue Apr 27 10:55:16 2021 +0200

    Replace old loader with the new one
    
    This is the minimal amount of code needed to switch from the old one to
    the new one. If the new loader proves to be good enough, we may remove
    the old one entirely.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/224/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/224/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2021, 6:25 PM
Harbormaster failed remote builds in B21292: Diff 20310!

Rebase directly on master, instead of the loader switch

Build has FAILED

Patch application report for D5687 (id=20311)

Rebasing onto 888471483a...

Current branch diff-target is up to date.
Changes applied before test
commit 3753ab5e944fa56a081c7392f405758ede1680f0
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/225/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/225/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2021, 6:33 PM
Harbormaster failed remote builds in B21293: Diff 20311!

Build has FAILED

Patch application report for D5687 (id=20318)

Rebasing onto 888471483a...

Current branch diff-target is up to date.
Changes applied before test
commit 3753ab5e944fa56a081c7392f405758ede1680f0
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/226/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/226/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2021, 9:35 PM
Harbormaster failed remote builds in B21300: Diff 20318!

Override broken dependency

Build has FAILED

Patch application report for D5687 (id=20319)

Rebasing onto 888471483a...

Current branch diff-target is up to date.
Changes applied before test
commit e54b73ebc0a10443450b878da8bbefb63c64b947
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/227/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/227/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2021, 9:47 PM
Harbormaster failed remote builds in B21301: Diff 20319!

Rebase on top of environment fix

Build is green

Patch application report for D5687 (id=20321)

Could not rebase; Attempt merge onto 888471483a...

Updating 8884714..5a700c3
Fast-forward
 requirements-test.txt                        |   1 +
 swh/loader/mercurial/from_disk.py            | 147 +++++++++++++++++----------
 swh/loader/mercurial/tests/conftest.py       |   1 +
 swh/loader/mercurial/tests/test_from_disk.py |  50 ++++++++-
 4 files changed, 146 insertions(+), 53 deletions(-)
Changes applied before test
commit 5a700c3a6ab5772170ba99aa6445cfc00ffe16bb
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

commit ecb99e7ddbcb91b51f414c1cda3a36e1d71858d7
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 22:29:27 2021 +0200

    Remove `os.environ` manipulation in tests
    
    The explanation is inside the patch. We also add the `HGSKIPREPO`
    variable that I forgot to add in the test environment.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/229/ for more details.

Build is green

Patch application report for D5687 (id=20326)

Could not rebase; Attempt merge onto 888471483a...

Updating 8884714..c4239db
Fast-forward
 requirements-test.txt                        |   1 +
 swh/loader/mercurial/from_disk.py            | 152 ++++++++++++++++++---------
 swh/loader/mercurial/tests/conftest.py       |   1 +
 swh/loader/mercurial/tests/test_from_disk.py |  50 ++++++++-
 4 files changed, 151 insertions(+), 53 deletions(-)
Changes applied before test
commit c4239db559c1f2279ef35f4c66305725c26e9d85
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

commit 24ab587b6440850951bdcfeeef8a40d7e3eaee9a
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 22:29:27 2021 +0200

    Remove `os.environ` manipulation in tests
    
    The explanation is inside the patch. We also add the `HGSKIPREPO`
    variable that I forgot to add in the test environment.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/232/ for more details.

vlorentz added inline comments.
swh/loader/mercurial/from_disk.py
226–230

What about two variables, instead? (and move the for loop's body in a function)

308–309

nitpick: self.log.info("New revisions found: %d", len(new_revs)), so it only formats the string if the log level is high enough for it to be printed

350

Would it make sense to add a test for this?

425–430

Hmm that's inconvenient. What happens after this exception is raised?

And what does the non-incremental loader do when a parent is missing?

swh/loader/mercurial/tests/test_from_disk.py
450

What does this do?

swh/loader/mercurial/from_disk.py
226–230

Sure!

350

This was me being (overly?) cautious, I don't think I've ever seen it in the wild. But it wouldn't be hard to test I think, so I'll give it a shot.

425–430

It's caught in get_revision_parents, and raises CorruptedRevision for the parent, which causes the whole descendants chain to be blacklisted from the run.

The non-incremental loader already handled PyPy fine now that I think about it. I was going off Antoine's comment, but maybe this can't actually happen since we're catching corruptions earlier. I'll try loading PyPy and see if we come through here.

swh/loader/mercurial/tests/test_from_disk.py
450

It appears I forgot to add a docstring! Will do.
(it straight-up removes the tip of the repository)

Rebase + remove useless loader call in test

Build is green

Patch application report for D5687 (id=20383)

Could not rebase; Attempt merge onto 888471483a...

Updating 8884714..8ea698f
Fast-forward
 requirements-test.txt                        |   1 +
 swh/loader/mercurial/from_disk.py            | 136 +++++++++++++++++----------
 swh/loader/mercurial/tests/conftest.py       |   1 +
 swh/loader/mercurial/tests/test_from_disk.py |  47 ++++++++-
 4 files changed, 133 insertions(+), 52 deletions(-)
Changes applied before test
commit 8ea698f9ba4a5d392a4d22571b8c3798280b7104
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

commit 4a2541c1b7868574eb772bc372829904f03c37d7
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 22:29:27 2021 +0200

    Move `os.environ` manipulation to pre_cleanup
    
    Simply initializing a loader would empty the environment, which can
    cause seemingly unrelated things to break. Moving the environment
    handling to the `pre_cleanup` phase ensures that `cleanup` will also
    be called and the environment will not be left in a broken state.
    
    We also add the `HGRCSKIPREPO` variable that I forgot to add in the
    test environment. This is still needed because the tests invoke
    `hg` directly. We could potentially have a wrapper util that uses a
    context-manager to do the environment manipulation closer to the issue,
    but we'd have to make sure that no other bare `hg` invocations can
    happen, even in random subprocesses.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/234/ for more details.

Build is green

Patch application report for D5687 (id=20390)

Could not rebase; Attempt merge onto 888471483a...

Updating 8884714..94ccd86
Fast-forward
 swh/loader/mercurial/from_disk.py            | 136 +++++++++++++++++----------
 swh/loader/mercurial/tests/conftest.py       |   1 +
 swh/loader/mercurial/tests/test_from_disk.py |  47 ++++++++-
 3 files changed, 132 insertions(+), 52 deletions(-)
Changes applied before test
commit 94ccd86dd4eb0941825768908ffd54f29c9db504
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

commit 773d872a81645505cd7c78e2c4ee819856eaad24
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 22:29:27 2021 +0200

    Move `os.environ` manipulation to pre_cleanup
    
    Simply initializing a loader would empty the environment, which can
    cause seemingly unrelated things to break. Moving the environment
    handling to the `pre_cleanup` phase ensures that `cleanup` will also
    be called and the environment will not be left in a broken state.
    
    We also add the `HGRCSKIPREPO` variable that I forgot to add in the
    test environment. This is still needed because the tests invoke
    `hg` directly. We could potentially have a wrapper util that uses a
    context-manager to do the environment manipulation closer to the issue,
    but we'd have to make sure that no other bare `hg` invocations can
    happen, even in random subprocesses.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/236/ for more details.

I haven't gotten around to doing the changes yet, will do so shortly.

Alphare marked an inline comment as done.

Include suggestions

From testing on a bunch of corrupted/verify-failed repositories,
there appears to be no indication that the fetching of a node's
ExtID can fail, meaning that the measures we've taken in previous
patches prevent this issue from happening.
If during testing or production we stumble upon such issues, we'll
tackle them, but it seems unlikely. In the mean time, remove the
overly cautious exceptions.

Build is green

Patch application report for D5687 (id=20396)

Rebasing onto 773d872a81...

Current branch diff-target is up to date.
Changes applied before test
commit 4630de8a500d67ebf70e91da32328c9fbc3aa37e
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed May 5 17:23:47 2021 +0200

    Make the Mercurial loader incremental
    
    Before this change, if a previous snapshot of a given origin existed
    and new changes were detected, we would start from scratch.
    
    This change leverages the recent new db mapping for external ids (like
    Mercurial's node ids) to internal SWH ids to compute what has changed
    from the latest snapshot, now that it is possible to find an SWH id from
    a Mercurial node id.
    
    For revisions, the logic is simple: look at the heads we've saved and
    ask Mercurial for all the revisions that are not ancestors of these
    heads (themselves excluded). This is not as "clever" as the full
    Mercurial discovery algorithm, but is much simpler and good enough for
    the kinds of scales we're operating at on a single repository.
    
    For tags, the previous logic assumed that all possible target revisions
    were done in the same run. Here, we look at the difference between the
    tags Mercurial reports and the one form the previous snapshot; any new
    tag will either have its corresponding release in cache (because it was
    processed in the same run) or fetched from the database using the
    aforementioned mapping.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/237/ for more details.

This revision is now accepted and ready to land.May 11 2021, 10:20 AM
This revision was automatically updated to reflect the committed changes.