To brainstorm parts of an idea, I'm wondering about Git's
still-in-development partial clone work, with the caveat that you intend
to NEVER checkout the entire repository at the same time.
Ideally, using some manner of fuse filesystem (similar to Git Virtual
Filesystem) w/ an index-only clone, naive clients could access the
object they wanted, which would be fetched on demand from the git server
which has mostly git packs and a few sparse objects that are waiting for
The write path on ingest clients would involve sending back the new
data, and git background processes on some regular interval packing the
loose objects into new packfiles.
Running this on top of CephFS for now means that you get the ability to
move it to future storage systems more easily than any custom RBD/EOS
development you might do: bring up enough space, sync the files over,
Git handles the deduplication, compression, access methods, and
generates large pack files, which Ceph can store more optimally than the
plethora of tiny objects.
Being able to take a backup of the Git-on-CephFS is also made a lot