We currently use the git fast-import format to allow people to retrieve revisions from the archive using the vault. However, this is wildly inefficient and generally a bad idea for several reasons:
- It's a poorly documented format (man git-fast-import), non-trivial to understand, and generally niche, which makes development hard
- It's lossy. It doesn't always handle signed tags properly as well as tagged trees (see man git-fast-export -> /LIMITATIONS). No guarantee of retrieving the same hash in output.
- The output is extremely large. The format was designed to be trivial to use as an in-memory piped interchange between a VCS and a git fast-import process, no considerations were given to space utilization. All the objects are deduplicated with no delta encoding.
- Exporting it from the archive is expensive. Only modified files are exported at each commit, so finding out which files where modified requires diffing *all* the commit trees with their parents, which is stupidly expensive and virtually impossible to parallelize.
A better option would be to create a "git bare" cooker: a bare Git repository (= with no working directory) where we put all the git objects in .git/objects directly. This is very fast and easily parallelizable on our side, and we can recompress all the objects together before caching the bundle by calling git repack-objects.
Once we have this bare repository, we could just create a tarball of it and cache this. Another option would be to investigate the git bundle format, which apparently serves a similar purpose, but could be simpler to import for the users.