Page MenuHomeSoftware Heritage

wip: git: Group objects per type early to drop the packfile reference asap
AbandonedPublic

Authored by ardumont on Sep 30 2021, 9:27 AM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

Prior to this commit, the loader git would keep the packfile reference and iterate over
it multiple times per object type. In this commit, we try to drop the packfile reference
earlier to release that reference as soon as possible.

This should reduce the memory pressure on loading very large repository. Assuming the
python dict takes less space than the packfile.

Related to T3625

Test Plan

tox

(I'm actually checking that it does the right thing on staging with a venv)

it does not. That kills the process almost immediately [1]

[1] P1185

Diff Detail

Repository
rDLDG Git loader
Branch
improve-loader-git
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24122
Build 37641: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37640: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6377 (id=23205)

Rebasing onto 368674744c...

Current branch diff-target is up to date.
Changes applied before test
commit f17ccb5a810583c6d520b0da7ef7b644d4f88381
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 09:13:19 2021 +0200

    Group objects per type early to drop the packfile reference asap
    
    Prior to this commit, the loader git would keep the packfile reference and iterate over
    it multiple times per object type. In this commit, we try to drop the packfile reference
    earlier to release that reference as soon as possible.
    
    This should reduce the memory pressure on loading very large repository. Assuming the
    python dict takes less space than the packfile.

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/124/ for more details.

Drop no longer needs to rewind the packfile prior to read it

Build is green

Patch application report for D6377 (id=23206)

Rebasing onto 368674744c...

Current branch diff-target is up to date.
Changes applied before test
commit 1f5fddb0e86162da1cc77b739ff4270fd4c2de51
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 09:13:19 2021 +0200

    Group objects per type early to drop the packfile reference asap
    
    Prior to this commit, the loader git would keep the packfile reference and iterate over
    it multiple times per object type. In this commit, we try to drop the packfile reference
    earlier to release that reference as soon as possible.
    
    This should reduce the memory pressure on loading very large repository. Assuming the
    python dict takes less space than the packfile.

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/125/ for more details.

ardumont retitled this revision from git: Group objects per type early to drop the packfile reference asap to wip: git: Group objects per type early to drop the packfile reference asap.Sep 30 2021, 10:11 AM
ardumont edited the test plan for this revision. (Show Details)

I truly doubt that proceeding like this will optimize the memory consumption of the loader as objects in a pack file are gzip compressed and usually deltified to optimize size.

Your approach will store all git objects uncompressed in a dict so few chances that it will eat less memory than the pack file here, nevertheless it will optimize loading performance.

I truly doubt that proceeding like this will optimize the memory consumption of the loader as objects in a pack file are gzip compressed and usually deltified to optimize size.

Your approach will store all git objects uncompressed in a dict so few chances that it will eat less memory than the pack file here, nevertheless it will optimize loading performance.

yes, I agree. That does reduce the i/o not the memory pressure.
I had updated the tox plan to say as much.
I forgot to close this ;)