Page MenuHomeSoftware Heritage

git: Ingest in order tag then branche references
Changes PlannedPublicDraft

Authored by ardumont on Fri, Oct 1, 6:55 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

It should allow to gradually ingest the git history in smaller packfiles first. If we
don't deal with master/main/HEAD first, it sounds highly plausible we won't have a
mostly full connected graph in the first packfiles (except for repositories with no
tags)

Related to T3625
Depends on D6386

Diff Detail

Repository
rDLDG Git loader
Branch
improve-loader-git-with-vincent
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24173
Build 37731: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37730: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6392 (id=23249)

Could not rebase; Attempt merge onto 8d371e815c...

Merge made by the 'recursive' strategy.
 swh/loader/git/loader.py | 138 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 103 insertions(+), 35 deletions(-)
Changes applied before test
commit fe69a9262bbe6e6a14ced2e50b4e95c3d0d371cc
Merge: 8d371e8 85db21a
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 1 16:55:54 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 85db21a3b2b802af8bb47834c76da5f124b0c4c2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Oct 1 18:45:36 2021 +0200

    git: Ingest in order tag then branche references
    
    It should allow to gradually ingest the git history in smaller packfiles first. If we
    don't deal with master/main/HEAD first, it sounds highly plausible we won't have a
    mostly full connected graph in the first packfiles (except for repositories with no
    tags)
    
    Related to T3625

commit 0acadb93ecc3b18712e06b36ed965a0eb8032b3d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/136/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/136/console

Harbormaster returned this revision to the author for changes because remote builds failed.Fri, Oct 1, 6:56 PM
Harbormaster failed remote builds in B24166: Diff 23249!

Build has FAILED

Patch application report for D6392 (id=23256)

Could not rebase; Attempt merge onto 8d371e815c...

Updating 8d371e8..828faca
Fast-forward
 requirements-swh.txt                |   2 +-
 swh/loader/git/loader.py            | 153 ++++++++++++++++++++++++++----------
 swh/loader/git/tests/test_loader.py |  36 +++++++++
 3 files changed, 150 insertions(+), 41 deletions(-)
Changes applied before test
commit 828faca8c7f9ba4e7e1d3d89311c1a67cbe06ad4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Oct 1 18:45:36 2021 +0200

    git: Ingest in order tag then branche references
    
    It should allow to gradually ingest the git history in smaller packfiles first. If we
    don't deal with master/main/HEAD first, it sounds highly plausible we won't have a
    mostly full connected graph in the first packfiles (except for repositories with no
    tags)
    
    Related to T3625

commit 07930184b93055772e7b38c9ca95dc477f5b9eac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/138/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/138/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sun, Oct 3, 3:31 PM
Harbormaster failed remote builds in B24173: Diff 23256!

I don't understand the sorting aspect in this diff.

Why is sorting required here? I mean it seems you only need to put tag branches in front and non-tag ones at the back, right?

Wouldn't this be enought?

tags = set()
branches = set()
for ref_name, ref_target in refs.items():
    if utils.ignore_branch_name(ref_name) or ref_target in local_heads:
        continue
    if ref_name.startswith(b"refs/tags/"):
        tags.add(ref_target)
    else:
        branches.add(ref_target)
self.wanted_refs = list(tags) + list(branches)