Page MenuHomeSoftware Heritage

git: Load git repository through multiple packfiles fetch operations on large repository
Changes PlannedPublicDraft

Authored by ardumont on Fri, Oct 1, 2:09 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

This should not impact how small to medium repositories are ingested.

The end goal is to decrease the current memory pressure when loading a large repository.

To improve the current loading, we ensure that we retrieve unknown remote references
into batch of 200 references (by default, it's also configurable at loader instantiation
time).

It's not perfect yet because it also depends on how the repository git graph is (for
example, if it happens that first 200 references are fully connected, then we will
retrieve everything in one round).

Implementation wise, this adapts the current graph walker (which is the one resolving
the missing local references from the remote references) so we won't walk over already
fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles.
That is, the loader now creates partial visits with snapshot after each packfile
consumed. The end goal being to decrease the work the loader would have to do again if
the initial visit would not complete for some reasons.

Related to T3625

Test Plan
  • tox failing without swh.loader.core release with D6380
  • pytest (happy)
  • docker-compose (happy) tests that we do ingest with the same snapshot.

The memory usage is consistenly smaller than the existing master code.

Large repositories ingestion ongoing.

Diff Detail

Repository
rDLDG Git loader
Branch
improve-loader-git-with-vincent
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 24172
Build 37729: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37728: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6386 (id=23233)

Rebasing onto 1d69d2be3b...

Current branch diff-target is up to date.
Changes applied before test
commit 5fa3446823f049f66f1a492f02eac37dc14d9f03
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/131/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/131/console

Harbormaster returned this revision to the author for changes because remote builds failed.Fri, Oct 1, 2:10 PM
Harbormaster failed remote builds in B24150: Diff 23233!

This would work for some edge cases repository but would not work for other ones.

Refs correspond to branch tip commits so it might fix loading of repos containing thousand of branches
but not repos with a couple of branches and a really large number of commits inducing a too large pack file to retrieve.

I would rather try to put a size limit on the number of revisions that must be contain in each fetched pack
file. Nevertheless this would imply fetching the commits log using the graph walker and could impact
overall performance.

So I do not think that approach will fix all out of memory errors encountered in production.

ardumont retitled this revision from git: Load git repository through multiple packfiles fetch operations to wip: git: Load git repository through multiple packfiles fetch operations.Fri, Oct 1, 4:07 PM

So I do not think that approach will fix all out of memory errors encountered in production.

Agreed, for example, it doesn't change the linux repo ingestion because it has a number of refs < 2000.

Unfortunately we haven't found a way to have a deterministic number of revisions returned in the pack file (yeah without working a the revision level but it's not an option).
One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

We need to check if it can be improved like by adding a retroaction loop between the result of do_pack and the number of ref to fetch

Nevertheless this would imply fetching the commits log using the graph walker and could impact overall performance.

It still needs a pack file to be fetched to walk on commits graph and I do not see any way to fetch commit ids only
using smart transfer protocol so forgot about this.

Unfortunately we haven't found a way to have a deterministic number of revisions returned in the pack file (yeah without working a the revision level but it's not an option).
One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

Yes, that problem is not so easy to resolve.
I am wondering if we could not use the depth parameter of the dulwich.client.fetch_pack method, which would
enable to fetch the N most recent commits and their reachable objects in a pack file from the wanted refs and
iterate until all wanted objects have been retrieved.

One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

This should fix some big repos ingestion indeed, I am not against it but I would prefer a more generic solution.

One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

This should fix some big repos ingestion indeed, I am not against it but I would prefer a more generic solution.

I agree, I also have mixed feelings about the solutions, this is why it's still planned changes and we need to improve it.
But but in any case, thanks for the feedbacks

Use packet of 200 (2000 is a bit much)

Build has FAILED

Patch application report for D6386 (id=23248)

Rebasing onto 8d371e815c...

First, rewinding head to replay your work on top of it...
Applying: git: Load git repository through multiple packfiles fetch operations
Changes applied before test
commit 454453346b51ad30f5caeb517c418812c76c4a52
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/135/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/135/console

Harbormaster returned this revision to the author for changes because remote builds failed.Fri, Oct 1, 6:53 PM
Harbormaster failed remote builds in B24165: Diff 23248!

git: Load git repository through multiple packfiles fetch operations

on large repository. This should not impact how small to medium repositories are
ingested.

The end goal is to decrease the current memory pressure when loading a large repository.

To improve the current loading, we ensure that we retrieve unknown remote references
into batch of 200 references (by default, it's also configurable at loader instantiation
time).

It's not perfect yet because it also depends on how the repository git graph is (for
example, if it happens that first 200 references are fully connected, then we will
retrieve everything in one round).

Implementation wise, this adapts the current graph walker (which is the one resolving
the missing local references from the remote references) so we won't walk over already
fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles.
That is, the loader now creates partial visits with snapshot after each packfile
consumed. The end goal being to decrease the work the loader would have to do again if
the initial visit would not complete for some reasons.

Related to T3625

Build has FAILED

Patch application report for D6386 (id=23255)

Rebasing onto 8d371e815c...

Current branch diff-target is up to date.
Changes applied before test
commit 07930184b93055772e7b38c9ca95dc477f5b9eac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/137/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/137/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sun, Oct 3, 3:31 PM
Harbormaster failed remote builds in B24172: Diff 23255!
ardumont retitled this revision from wip: git: Load git repository through multiple packfiles fetch operations to git: Load git repository through multiple packfiles fetch operations on large repository.Sun, Oct 3, 3:31 PM
ardumont edited the summary of this revision. (Show Details)
swh/loader/git/loader.py
48

Do we need this?
I'd very much see this as silently ignored...

swh/loader/git/loader.py
527

That warning becomes actually tedious (and possibly strenuous for elasticsearch)...
On staging runs (because not landed yet), this can becomes a huge list [1].

As it will eventually be resolved if we get to the bottom of the repository, do we keep that log instruction?

[1] P1194