Page MenuHomeSoftware Heritage

Spool large packfiles to disk instead of consuming tons of memory
AcceptedPublic

Authored by olasd on Fri, Apr 30, 8:25 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Summary

This lowers memory consumption by writing packfiles above a given
threshold to disk. This reduces the memory pressure on workers (but increases
the disk churn), and also allows to use the git loader on more, memory constrained, systems.

As there is a single temporary file which we hold open, we can use the default
Python tempfile feature which unlinks the temporary file directly, allowing the
file to be reaped as soon as the process disappears, even if the process gets
killed. This avoids the need for any manual tempfile cleanup.

Test Plan

This has been exercised on large repositories (e.g. linux.git yields
a packfile that is almost 4GiB).

Diff Detail

Repository
rDLDG Git loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21200
Build 32909: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 32908: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5657 (id=20210)

Rebasing onto 15e12fae18...

First, rewinding head to replay your work on top of it...
Applying: Spool large packfiles to disk instead of consuming tons of memory
Changes applied before test
commit 39692e66ded1bed94c530074455999366e7d2613
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Apr 26 20:57:36 2021 +0200

    Spool large packfiles to disk instead of consuming tons of memory

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/99/ for more details.

olasd requested review of this revision.Fri, Apr 30, 8:28 PM
zack added a subscriber: zack.

nice hack/trade-off !

This revision is now accepted and ready to land.Fri, Apr 30, 8:39 PM