Page MenuHomeSoftware Heritage

git: Ingest ordered tags then ordered branches references
Changes PlannedPublicDraft

Authored by ardumont on Oct 1 2021, 6:55 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

This should allow to gradually ingest the git history. If we don't deal with
{master/main/HEAD} references first, it sounds highly plausible we won't have a mostly
full connected graph at first (except for repositories with no tags). Started from tag
v0.1.0, v0.2.0, etc...

If we couple this with the option to ingest repository with multiple packfiles, this
should allow to ingest more incrementally large repositories (as it creates partial
snapshots at each fetch/store loop).

Related to T3625
Depends on D6386

Diff Detail

Repository
rDLDG Git loader
Branch
improve-loader-git-sort-refs
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32830
Build 51442: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51441: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6392 (id=23249)

Could not rebase; Attempt merge onto 8d371e815c...

Merge made by the 'recursive' strategy.
 swh/loader/git/loader.py | 138 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 103 insertions(+), 35 deletions(-)
Changes applied before test
commit fe69a9262bbe6e6a14ced2e50b4e95c3d0d371cc
Merge: 8d371e8 85db21a
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 1 16:55:54 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 85db21a3b2b802af8bb47834c76da5f124b0c4c2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Oct 1 18:45:36 2021 +0200

    git: Ingest in order tag then branche references
    
    It should allow to gradually ingest the git history in smaller packfiles first. If we
    don't deal with master/main/HEAD first, it sounds highly plausible we won't have a
    mostly full connected graph in the first packfiles (except for repositories with no
    tags)
    
    Related to T3625

commit 0acadb93ecc3b18712e06b36ed965a0eb8032b3d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/136/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/136/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 1 2021, 6:56 PM
Harbormaster failed remote builds in B24166: Diff 23249!

Build has FAILED

Patch application report for D6392 (id=23256)

Could not rebase; Attempt merge onto 8d371e815c...

Updating 8d371e8..828faca
Fast-forward
 requirements-swh.txt                |   2 +-
 swh/loader/git/loader.py            | 153 ++++++++++++++++++++++++++----------
 swh/loader/git/tests/test_loader.py |  36 +++++++++
 3 files changed, 150 insertions(+), 41 deletions(-)
Changes applied before test
commit 828faca8c7f9ba4e7e1d3d89311c1a67cbe06ad4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Oct 1 18:45:36 2021 +0200

    git: Ingest in order tag then branche references
    
    It should allow to gradually ingest the git history in smaller packfiles first. If we
    don't deal with master/main/HEAD first, it sounds highly plausible we won't have a
    mostly full connected graph in the first packfiles (except for repositories with no
    tags)
    
    Related to T3625

commit 07930184b93055772e7b38c9ca95dc477f5b9eac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/138/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/138/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 3 2021, 3:31 PM
Harbormaster failed remote builds in B24173: Diff 23256!

I don't understand the sorting aspect in this diff.

Why is sorting required here? I mean it seems you only need to put tag branches in front and non-tag ones at the back, right?

Wouldn't this be enought?

tags = set()
branches = set()
for ref_name, ref_target in refs.items():
    if utils.ignore_branch_name(ref_name) or ref_target in local_heads:
        continue
    if ref_name.startswith(b"refs/tags/"):
        tags.add(ref_target)
    else:
        branches.add(ref_target)
self.wanted_refs = list(tags) + list(branches)

I don't understand the sorting aspect in this diff.

Why is sorting required here?

The idea was to sort in ascending order the refs (assuming the tags would mostly be
first indeed) to progressively load commits in smaller connected chunks (in lesser-sized
packfiles). If so doing, that would deal with larger packfiles more smoothly than it
used to do while not changing much to the logic for smaller repositories.

And if a crash would happen during the loading (for any repositories), we'd rely on the
partial snapshotting done after each ingested packfiles to incrementally load again the
repository at the next iteration.

I mean it seems you only need to put tag branches in front and non-tag ones at the
back, right?

Maybe what you suggest would be enough, i'm not sure. I only had tested my logic at the
time which was sound enough to me. It had also some success iirc. That'd be worth
revisiting now that the main initial conditions of the git loader got fixed in v2.1.

Also i think now that a mix of our suggestions would be even better. That is, creating a
sorted list of tags (and then add the branches at the end).

Rebase and adapt according to discussion

ardumont retitled this revision from git: Ingest in order tag then branche references to git: Ingest ordered tags then ordered branches references.Nov 16 2022, 12:26 PM
ardumont edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6392 (id=31897)

Could not rebase; Attempt merge onto 92d9ada9b7...

Updating 92d9ada..ba71702
Fast-forward
 requirements-swh.txt                |   2 +-
 swh/loader/git/dumb.py              |   6 +-
 swh/loader/git/loader.py            | 325 ++++++++++++++++++++++++++++--------
 swh/loader/git/tests/test_loader.py |  73 +++++++-
 4 files changed, 328 insertions(+), 78 deletions(-)
Changes applied before test
commit ba71702ec16760fdfb92d5b26d31c4b349c94dea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Oct 1 18:45:36 2021 +0200

    git: Ingest ordered tags then ordered branches references
    
    This should allow to gradually ingest the git history. If we don't deal with
    {master/main/HEAD} references first, it sounds highly plausible we won't have a mostly
    full connected graph at first (except for repositories with no tags). Started from tag
    v0.1.0, v0.2.0, etc...
    
    If we couple this with the option to ingest repository with multiple packfiles, this
    should allow to ingest more incrementally large repositories (as it creates partial
    snapshots at each fetch/store loop).
    
    Related to T3625

commit 59131eb21484a13257e668f97033ad5e0e06eee2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    This introduces the means to configure the packfile fetching policy. The default, as
    before, is to fetch one packfile to ingest everything unknown out of it. When
    fetch_multiple_packfiles is True (and the ingestion passes through the 'smart'
    protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile).
    After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in
    gathering seen refs so far) snapshot is created.
    
    Even if the new fetching policy were activated, this should not impact how small to
    medium repositories are ingested.
    
    The end goal is to decrease the potential issues of failure during loading large
    repositories (with large packfiles) and to allow the eventual next loading to pick up
    where the last loading failure occurred.
    
    It's not perfect yet because it also depends on how the repository git graph
    connectivity (for example, if it happens that first 200 references are fully connected,
    then we will retrieve everything in one round anyway).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so it won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/244/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/244/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 16 2022, 12:26 PM
Harbormaster failed remote builds in B32830: Diff 31897!