Page MenuHomeSoftware Heritage

Using git to store objects
Closed, WontfixPublic

Description

Robin H. Johnson wrote

To brainstorm parts of an idea, I'm wondering about Git's
still-in-development partial clone work, with the caveat that you intend
to NEVER checkout the entire repository at the same time.

Ideally, using some manner of fuse filesystem (similar to Git Virtual
Filesystem) w/ an index-only clone, naive clients could access the
object they wanted, which would be fetched on demand from the git server
which has mostly git packs and a few sparse objects that are waiting for
packing.

The write path on ingest clients would involve sending back the new
data, and git background processes on some regular interval packing the
loose objects into new packfiles.

Running this on top of CephFS for now means that you get the ability to
move it to future storage systems more easily than any custom RBD/EOS
development you might do: bring up enough space, sync the files over,
profit.

Git handles the deduplication, compression, access methods, and
generates large pack files, which Ceph can store more optimally than the
plethora of tiny objects.

[snip]

Being able to take a backup of the Git-on-CephFS is also made a lot
easier sin

Event Timeline

dachary changed the task status from Open to Work in Progress.Feb 20 2021, 1:59 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.
dachary changed the task status from Work in Progress to Open.Feb 22 2021, 12:17 AM

While this is very creative, there is no benefit in storing small objects in git for the Software Heritage workload.