HomeSoftware Heritage

Stop processing packfiles before sending objects

Description

Stop processing packfiles before sending objects

Since its creation, the git loader would process the packfile downloaded
from the remote repository, to make an index of all objects, filtering
them before sending them on to the storage. Since this functionality has
been implemented as a filter proxy in the storage API itself, the
built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through
the packfile six times: once for the basic object id indexing, once to
get content ids, then once for each object type. This change removes the
first two runs. By eschewing the double filtering, we should also reduce
the load on the backend storage (we would call the <object_type>_missing
endpoints twice).

Finally, as this change removes the global index of objects, and sends
the converted objects to the storage as soon as they're read, the memory
usage decreases substantially for large loads.

Details

Provenance
olasdAuthored on Feb 25 2021, 3:42 PM
olasdPushed on Feb 25 2021, 6:46 PM
Differential Revision
D5147: Stop processing packfiles before sending objects
Parents
rDLDG5e434d6f6e1c: Drop unused get_fetch_history_result methods
Branches
Unknown
Tags
Unknown
Build Status
Buildable 19500
Build 30251: test-and-buildJenkins console · Jenkins