Page MenuHomeSoftware Heritage

Vault: Add a "git bare" tarball cooker
Closed, ResolvedPublic

Description

We currently use the git fast-import format to allow people to retrieve revisions from the archive using the vault. However, this is wildly inefficient and generally a bad idea for several reasons:

  • It's a poorly documented format (man git-fast-import), non-trivial to understand, and generally niche, which makes development hard
  • It's lossy. It doesn't always handle signed tags properly as well as tagged trees (see man git-fast-export -> /LIMITATIONS). No guarantee of retrieving the same hash in output.
  • The output is extremely large. The format was designed to be trivial to use as an in-memory piped interchange between a VCS and a git fast-import process, no considerations were given to space utilization. All the objects are deduplicated with no delta encoding.
  • Exporting it from the archive is expensive. Only modified files are exported at each commit, so finding out which files where modified requires diffing *all* the commit trees with their parents, which is stupidly expensive and virtually impossible to parallelize.

A better option would be to create a "git bare" cooker: a bare Git repository (= with no working directory) where we put all the git objects in .git/objects directly. This is very fast and easily parallelizable on our side, and we can recompress all the objects together before caching the bundle by calling git repack-objects.

Once we have this bare repository, we could just create a tarball of it and cache this. Another option would be to investigate the git bundle format, which apparently serves a similar purpose, but could be simpler to import for the users.

Revisions and Commits

rDSTO Storage manager
Closed
rDVAU Software Heritage Vault
Closed
Closed
Closed
Abandoned
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Closed
Abandoned
D5658
rDMOD Data model
D5652
D5650

Event Timeline

seirl triaged this task as Wishlist priority.Nov 7 2017, 7:06 PM
seirl created this task.

And what would be the actual on-the-wire serialization format used to send the bare repo to users? Some sort of archive of the .git dir or what?

Maybe git-bundle(1) is an even better idea. I'm not sure if the format is documented somewhere though.

seirl removed seirl as the assignee of this task.Mar 22 2019, 1:29 PM
seirl raised the priority of this task from Wishlist to Normal.Apr 27 2021, 10:56 PM

done, and released as swh-vault v1.0.0 :)

Not yet publicly available, this is tracked by this task's parent task