Page MenuHomeSoftware Heritage

Run a graph traversal to compute objects in packfiles that we know we have
Changes PlannedPublicDraft

Authored by vlorentz on May 20 2022, 5:55 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

With the 'smart' protocol, the loader tells the remote what head it
already has, which allows remotes not to include them or their
(transitive) ancestors in packfiles.
However, remotes can cache packfiles, and send many objects we do not
need, which wastes a lot of time in *_missing() RPC calls to the
storage.

This new graph traversal goes through all objects in the packfiles we
received, and computes which objects are ancestors of refs we already
have (assuming all objects in the path are in the packfile).

Depends on D7881 and D7876

Test Plan

This needs to be tested better; bugs in this algo would create holes
in the graph, which would be very bad.

Diff Detail

Repository
rDLDG Git loader
Branch
unwanted-packfiles
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29827
Build 46618: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 46617: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7882 (id=28439)

Could not rebase; Attempt merge onto 85a4794094...

Updating 85a4794..c04bb20
Fast-forward
 requirements-swh.txt                |   2 +-
 swh/loader/git/base.py              | 134 ++++++++++++++++++++++++++++
 swh/loader/git/from_disk.py         |   4 +-
 swh/loader/git/loader.py            | 112 ++++++++++++++++++++++--
 swh/loader/git/tests/test_loader.py | 169 ++++++++++++++++++++++++++++++++++--
 5 files changed, 406 insertions(+), 15 deletions(-)
 create mode 100644 swh/loader/git/base.py
Changes applied before test
commit c04bb20c96b0cd5bd2e65bec3d624f759e6113db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 17:52:09 2022 +0200

    Run a graph traversal to compute objects in packfiles that we know we have
    
    With the 'smart' protocol, the loader tells the remote what head it
    already has, which allows remotes not to include them or their
    (transitive) ancestors in packfiles.
    However, remotes can cache packfiles, and send many objects we do not
    need, which wastes a lot of time in `*_missing()` RPC calls to the
    storage.
    
    This new graph traversal goes through all objects in the packfiles we
    received, and computes which objects are ancestors of refs we already
    have (assuming all objects in the path are in the packfile).

commit 356b5542852baff2bd5eb616703ff74566506f95
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 15:53:41 2022 +0200

    Log summary of filtered objects in store_data

commit 5ced09db7e6636c0959cca2bb36088b2e3a86e4f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 15:10:55 2022 +0200

    Add an unweighted average for filtered_objects + fix existing metric name

commit f45ca1c2c0fac2f57cd1b43af984251078c89169
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 13:46:58 2022 +0200

    Add metrics in store_data on ratios of objects already stored

commit 083e1aa18e24cd3311162e259b15c1867b313060
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 12:16:23 2022 +0200

    Move store_data from DVCSLoader to a new BaseGitLoader
    
    In preparation for the removal of DVCSLoader from swh.loader.core,
    as the git loader is the only one to use it anymore.

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/222/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/222/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 20 2022, 5:55 PM
Harbormaster failed remote builds in B29530: Diff 28439!
  • Only traverse the tag/commit graph

Build has FAILED

Patch application report for D7882 (id=28745)

Rebasing onto 44e35e4e24...

First, rewinding head to replay your work on top of it...
Applying: Run a graph traversal to compute objects in packfiles that we know we have
Using index info to reconstruct a base tree...
M	swh/loader/git/base.py
M	swh/loader/git/tests/test_loader.py
Falling back to patching base and 3-way merge...
Auto-merging swh/loader/git/tests/test_loader.py
Auto-merging swh/loader/git/base.py
Applying: Only traverse the tag/commit graph
Changes applied before test
commit feb3e3075938cfe18da3b202baf6f4f01a505bf2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jun 10 13:23:08 2022 +0200

    Only traverse the tag/commit graph
    
    On linux.git, the full traversal on trees and blobs uses over 50GB of RAM;
    while it is less than 1GB on tag/commits only.
    
    After running the traversal:
    
    1. build a pack index
    2. iterate over commit/tags in the packfile
    3. for each commit/tags we have not seen while traversing (ie. that are
       not reachable from known snapshots), recursively load the tree and
       blobs they reference
    
    In terms of benchmarks (on my machine, and using dulwich 0.20.43 +
    in-memory overlay storage on top of production):
    
    * 2h with this commit
    * 2.5h with a variant of this commit, using the on-disk indexes
      (instead of MemoryPackIndex)
    * 2.75h with the previous algorithm
    
    And this is despite this commit being naive, in that will repeatedly try
    to load the same trees and blobs when they were already.
    A future commit may improve this; eg. by marking what trees and blobs we
    loaded in the pack index.

commit 8df3a56d3cc02c142c35a86072e41dad35ea7ac0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 20 17:52:09 2022 +0200

    Run a graph traversal to compute objects in packfiles that we know we have
    
    With the 'smart' protocol, the loader tells the remote what head it
    already has, which allows remotes not to include them or their
    (transitive) ancestors in packfiles.
    However, remotes can cache packfiles, and send many objects we do not
    need, which wastes a lot of time in `*_missing()` RPC calls to the
    storage.
    
    This new graph traversal goes through all objects in the packfiles we
    received, and computes which objects are ancestors of refs we already
    have (assuming all objects in the path are in the packfile).

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/224/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/224/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 10 2022, 1:29 PM
Harbormaster failed remote builds in B29827: Diff 28745!