Page MenuHomeSoftware Heritage

git: Load git repository through multiple packfiles fetch operations
Changes PlannedPublicDraft

Authored by ardumont on Oct 1 2021, 2:09 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

This introduces the means to configure the packfile fetching policy. The default, as
before, is to fetch one packfile to ingest everything unknown out of it. When
fetch_multiple_packfiles is True (and the ingestion passes through the 'smart'
protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile).
After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in
gathering seen refs so far) snapshot is created.

Even if the new fetching policy were activated, this should not impact how small to
medium repositories are ingested.

The end goal is to decrease the potential issues of failure during loading large
repositories (with large packfiles) and to allow the eventual next loading to pick up
where the last loading failure occurred.

It's not perfect yet because it also depends on how the repository git graph
connectivity (for example, if it happens that first 200 references are fully connected,
then we will retrieve everything in one round anyway).

Implementation wise, this adapts the current graph walker (which is the one resolving
the missing local references from the remote references) so it won't walk over already
fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles.
That is, the loader now creates partial visits with snapshot after each packfile
consumed. The end goal being to decrease the work the loader would have to do again if
the initial visit would not complete for some reasons.

Related to T3625

Test Plan
  • tox failing without swh.loader.core release with D6380
  • pytest (happy)
  • docker-compose (happy) tests that we do ingest with the same snapshot.

The memory usage is consistenly smaller than the existing master code.

Large repositories ingestion ongoing.

Diff Detail

Repository
rDLDG Git loader
Branch
improve-loader-git-with-vincent
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 24165
Build 37715: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37714: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6386 (id=23233)

Rebasing onto 1d69d2be3b...

Current branch diff-target is up to date.
Changes applied before test
commit 5fa3446823f049f66f1a492f02eac37dc14d9f03
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/131/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/131/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 1 2021, 2:10 PM
Harbormaster failed remote builds in B24150: Diff 23233!

This would work for some edge cases repository but would not work for other ones.

Refs correspond to branch tip commits so it might fix loading of repos containing thousand of branches
but not repos with a couple of branches and a really large number of commits inducing a too large pack file to retrieve.

I would rather try to put a size limit on the number of revisions that must be contain in each fetched pack
file. Nevertheless this would imply fetching the commits log using the graph walker and could impact
overall performance.

So I do not think that approach will fix all out of memory errors encountered in production.

ardumont retitled this revision from git: Load git repository through multiple packfiles fetch operations to wip: git: Load git repository through multiple packfiles fetch operations.Oct 1 2021, 4:07 PM

So I do not think that approach will fix all out of memory errors encountered in production.

Agreed, for example, it doesn't change the linux repo ingestion because it has a number of refs < 2000.

Unfortunately we haven't found a way to have a deterministic number of revisions returned in the pack file (yeah without working a the revision level but it's not an option).
One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

We need to check if it can be improved like by adding a retroaction loop between the result of do_pack and the number of ref to fetch

Nevertheless this would imply fetching the commits log using the graph walker and could impact overall performance.

It still needs a pack file to be fetched to walk on commits graph and I do not see any way to fetch commit ids only
using smart transfer protocol so forgot about this.

Unfortunately we haven't found a way to have a deterministic number of revisions returned in the pack file (yeah without working a the revision level but it's not an option).
One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

Yes, that problem is not so easy to resolve.
I am wondering if we could not use the depth parameter of the dulwich.client.fetch_pack method, which would
enable to fetch the N most recent commits and their reachable objects in a pack file from the wanted refs and
iterate until all wanted objects have been retrieved.

One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

This should fix some big repos ingestion indeed, I am not against it but I would prefer a more generic solution.

One good point of this diff I think, is it opens the path of an incremental ingestion we can improve later

This should fix some big repos ingestion indeed, I am not against it but I would prefer a more generic solution.

I agree, I also have mixed feelings about the solutions, this is why it's still planned changes and we need to improve it.
But but in any case, thanks for the feedbacks

Use packet of 200 (2000 is a bit much)

Build has FAILED

Patch application report for D6386 (id=23248)

Rebasing onto 8d371e815c...

First, rewinding head to replay your work on top of it...
Applying: git: Load git repository through multiple packfiles fetch operations
Changes applied before test
commit 454453346b51ad30f5caeb517c418812c76c4a52
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 15:36:48 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    This does not impact how small to medium repositories are ingested.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 2000 references (by default). It's configurable at loader instantiation
    time. It's not perfect yet because it also depends on how the repository git graph
    is (for example, if it happens that first 2000 references are fully connex, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/135/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/135/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 1 2021, 6:53 PM
Harbormaster failed remote builds in B24165: Diff 23248!

git: Load git repository through multiple packfiles fetch operations

on large repository. This should not impact how small to medium repositories are
ingested.

The end goal is to decrease the current memory pressure when loading a large repository.

To improve the current loading, we ensure that we retrieve unknown remote references
into batch of 200 references (by default, it's also configurable at loader instantiation
time).

It's not perfect yet because it also depends on how the repository git graph is (for
example, if it happens that first 200 references are fully connected, then we will
retrieve everything in one round).

Implementation wise, this adapts the current graph walker (which is the one resolving
the missing local references from the remote references) so we won't walk over already
fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles.
That is, the loader now creates partial visits with snapshot after each packfile
consumed. The end goal being to decrease the work the loader would have to do again if
the initial visit would not complete for some reasons.

Related to T3625

Build has FAILED

Patch application report for D6386 (id=23255)

Rebasing onto 8d371e815c...

Current branch diff-target is up to date.
Changes applied before test
commit 07930184b93055772e7b38c9ca95dc477f5b9eac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/137/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/137/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 3 2021, 3:31 PM
Harbormaster failed remote builds in B24172: Diff 23255!
ardumont retitled this revision from wip: git: Load git repository through multiple packfiles fetch operations to git: Load git repository through multiple packfiles fetch operations on large repository.Oct 3 2021, 3:31 PM
ardumont edited the summary of this revision. (Show Details)
swh/loader/git/loader.py
47

Do we need this?
I'd very much see this as silently ignored...

swh/loader/git/loader.py
522

That warning becomes actually tedious (and possibly strenuous for elasticsearch)...
On staging runs (because not landed yet), this can becomes a huge list [1].

As it will eventually be resolved if we get to the bottom of the repository, do we keep that log instruction?

[1] P1194

ardumont edited the summary of this revision. (Show Details)

Rebase

Opened to check the diff inline

Build has FAILED

Patch application report for D6386 (id=31892)

Rebasing onto 92d9ada9b7...

Current branch diff-target is up to date.
Changes applied before test
commit 0f13f78892d03f8aff1a399c6d5d1c61312d013b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/240/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/240/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 15 2022, 5:28 PM
Harbormaster failed remote builds in B32825: Diff 31892!

Build has FAILED

Patch application report for D6386 (id=31893)

Rebasing onto 92d9ada9b7...

Current branch diff-target is up to date.
Changes applied before test
commit 50b55bb4692a96b752628eb669e6913060e67065
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    on large repository. This should not impact how small to medium repositories are
    ingested.
    
    The end goal is to decrease the current memory pressure when loading a large repository.
    
    To improve the current loading, we ensure that we retrieve unknown remote references
    into batch of 200 references (by default, it's also configurable at loader instantiation
    time).
    
    It's not perfect yet because it also depends on how the repository git graph is (for
    example, if it happens that first 200 references are fully connected, then we will
    retrieve everything in one round).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so we won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/241/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/241/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 15 2022, 5:37 PM
Harbormaster failed remote builds in B32826: Diff 31893!
ardumont retitled this revision from git: Load git repository through multiple packfiles fetch operations on large repository to git: Load git repository through multiple packfiles fetch operations.
  • Update docstrings and test
  • reword commit message

Build has FAILED

Patch application report for D6386 (id=31895)

Rebasing onto 92d9ada9b7...

Current branch diff-target is up to date.
Changes applied before test
commit 6fbc94f24b10fdd950a65d2beced3c1d47032309
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    This introduces the means to configure the packfile fetching policy. The default, as
    before, is to fetch one packfile to ingest everything unknown out of it. When
    fetch_multiple_packfiles is True (and the ingestion passes through the 'smart'
    protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile).
    After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in
    gathering seen refs so far) snapshot is created.
    
    Even if the new fetching policy were activated, this should not impact how small to
    medium repositories are ingested.
    
    The end goal is to decrease the potential issues of failure during loading large
    repositories (with large packfiles) and to allow the eventual next loading to pick up
    where the last loading failure occurred.
    
    It's not perfect yet because it also depends on how the repository git graph
    connectivity (for example, if it happens that first 200 references are fully connected,
    then we will retrieve everything in one round anyway).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so it won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/242/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/242/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 16 2022, 11:25 AM
Harbormaster failed remote builds in B32828: Diff 31895!

Build has FAILED

Patch application report for D6386 (id=31896)

Rebasing onto 92d9ada9b7...

Current branch diff-target is up to date.
Changes applied before test
commit 59131eb21484a13257e668f97033ad5e0e06eee2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 3 15:23:50 2021 +0200

    git: Load git repository through multiple packfiles fetch operations
    
    This introduces the means to configure the packfile fetching policy. The default, as
    before, is to fetch one packfile to ingest everything unknown out of it. When
    fetch_multiple_packfiles is True (and the ingestion passes through the 'smart'
    protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile).
    After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in
    gathering seen refs so far) snapshot is created.
    
    Even if the new fetching policy were activated, this should not impact how small to
    medium repositories are ingested.
    
    The end goal is to decrease the potential issues of failure during loading large
    repositories (with large packfiles) and to allow the eventual next loading to pick up
    where the last loading failure occurred.
    
    It's not perfect yet because it also depends on how the repository git graph
    connectivity (for example, if it happens that first 200 references are fully connected,
    then we will retrieve everything in one round anyway).
    
    Implementation wise, this adapts the current graph walker (which is the one resolving
    the missing local references from the remote references) so it won't walk over already
    fetched references when multiple iterations is needed.
    
    This also makes the loader git explicitely create partial visit when fetching packfiles.
    That is, the loader now creates partial visits with snapshot after each packfile
    consumed. The end goal being to decrease the work the loader would have to do again if
    the initial visit would not complete for some reasons.
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/243/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/243/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 16 2022, 11:30 AM
Harbormaster failed remote builds in B32829: Diff 31896!
requirements-swh.txt
5 ↗(On Diff #31895)

alas, not working

swh/loader/git/loader.py
47

Do we need this?
I'd very much see this as silently ignored...

nvm, let's keep it.