Page MenuHomeSoftware Heritage

Using an RBD image to store artifacts
Closed, ResolvedPublic

Description

The initial idea is described in this thread https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html
It is further discussed on the ceph user mailing list https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
A well documented blog post with benchmarks and a billion objects dated 02/2020 https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond

"We've never managed 100TB+ in a single RBD volume. [...] Otherwise, yes RBD sounds very convenient for what you need."

Packing's obviously a good idea for storing these kinds of artifacts in Ceph, and hacking through the existing librbd might indeed be easier than building something up from raw RADOS, especially if you want to use stuff like rbd-mirror.

Event Timeline

dachary created this object in space S1 Public.

A trivial test case (attached) shows that an RBD image backed by a k=4,m=2 erasure coded pool (RAID6 equivalent) can store 4GB of data using 6GB of disk. The metadata overhead is small. It would be great if someone could repeat the test to make sure I did not accidentally obtained these results.

venv/bin/ansible -i inventory -a 'du -sh /var/ceph/osd' ceph
ceph4 | CHANGED | rc=0 >>
644K	/var/ceph/osd
ceph2 | CHANGED | rc=0 >>
628K	/var/ceph/osd
ceph1 | CHANGED | rc=0 >>
628K	/var/ceph/osd
ceph5 | CHANGED | rc=0 >>
648K	/var/ceph/osd
ceph3 | CHANGED | rc=0 >>
656K	/var/ceph/osd
ceph6 | CHANGED | rc=0 >>
608K	/var/ceph/osd
$ scp bench.sh debian@10.11.12.211:/tmp/bench.sh ; ssh debian@10.11.12.211 sudo bash /tmp/bench.sh
bench.sh
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    600 GiB  594 GiB  18 MiB   6.0 GiB       1.00
TOTAL  600 GiB  594 GiB  18 MiB   6.0 GiB       1.00
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1      0 B        0      0 B      0    188 GiB
rbd                     3    4     19 B        1  192 KiB      0    188 GiB
swh                     4   32  2.7 KiB        0   64 KiB      0    376 GiB
+ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META   AVAIL    %USE  VAR   PGS  STATUS
 4    hdd  0.09769   1.00000  100 GiB  1.0 GiB  2.8 MiB   0 B  1 GiB   99 GiB  1.00  1.00    4      up
 0    hdd  0.09769   1.00000  100 GiB  1.0 GiB  2.9 MiB   0 B  1 GiB   99 GiB  1.00  1.00    6      up
 1    hdd  0.09769   1.00000  100 GiB  1.0 GiB  2.9 MiB   0 B  1 GiB   99 GiB  1.00  1.00    7      up
 2    hdd  0.09769   1.00000  100 GiB  1.0 GiB  3.3 MiB   0 B  1 GiB   99 GiB  1.00  1.00    4      up
 3    hdd  0.09769   1.00000  100 GiB  1.0 GiB  2.9 MiB   0 B  1 GiB   99 GiB  1.00  1.00    6      up
 5    hdd  0.09769   1.00000  100 GiB  1.0 GiB  2.7 MiB   0 B  1 GiB   99 GiB  1.00  1.00    6      up
                       TOTAL  600 GiB  6.0 GiB   17 MiB   0 B  6 GiB  594 GiB  1.00                   
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
+ dd if=/dev/urandom of=/dev/rbd0 count=4096 bs=1024k status=progress
4247781376 bytes (4.2 GB, 4.0 GiB) copied, 36 s, 118 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 37.153 s, 116 MB/s
+ sleep 60
after ---------------------------------------------------------------------
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    600 GiB  588 GiB  6.1 GiB    12 GiB       2.01
TOTAL  600 GiB  588 GiB  6.1 GiB    12 GiB       2.01
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1      0 B        0      0 B      0    186 GiB
rbd                     3   32     35 B        4  384 KiB      0    186 GiB
swh                     4   32  4.0 GiB    1.02k  6.1 GiB   1.07    372 GiB
+ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META   AVAIL    %USE  VAR   PGS  STATUS
 4    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   47      up
 0    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   46      up
 1    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   51      up
 2    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   47      up
 3    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   52      up
 5    hdd  0.09769   1.00000  100 GiB  2.0 GiB  1.0 GiB   0 B  1 GiB   98 GiB  2.01  1.00   48      up
                       TOTAL  600 GiB   12 GiB  6.1 GiB   0 B  6 GiB  588 GiB  2.01                   
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
$ venv/bin/ansible -i inventory -a 'du -sh /var/ceph/osd' ceph
ceph2 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
ceph5 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
ceph1 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
ceph3 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
ceph4 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
ceph6 | CHANGED | rc=0 >>
1.1G	/var/ceph/osd
zack triaged this task as Normal priority.Feb 3 2021, 3:23 PM

Benchmarking S3 in Ceph with COSBench could be interesting (the video is not yet available). In the past COSBench was difficult to use but maybe it improved. This is off-topic though, but I don't know where to write that down at the moment.

For the record, today's IRC log:

<dachary> Yeah, an infinitely growing volume was definitely not a good idea :-)
<zack> i'm still not clear about where you plan to store the index though (but i've already asked that on list, so i can wait for feedback there)
<dachary> Yes.
<zack> douardda: oh, pixz (https://github.com/vasi/pixz) from the ceph list thread, would be interesting for what i mentioned on list for the Vitam compression
<zack> although something like that but based on zstd would make me even happier :-P
<dachary> I'm having doubts about using RGW vs using RBD. Because both provide packing. I suppose RGW overhead will be higher but that deserves benchmarking. And if RGW wins... the index problem is solved.
<zack> anyway, great thread on the ceph list, thanks for starting/nurturing it
<dachary> I wonder if it would be practical to use tar or another format dedicated to archiving. A 1TB tar is unusual though. And they are probably not fit for quickly accessing a file given its name. They are most likely designed for sequential extraction and not random access.

<dachary> simple but still too complicated for Software Heritage
<dachary> there is no need for directories, nor xattr, just flat index => content
<dachary> "cold storage" may be good keywords

<olasd> cold storage often means compromising on latency
<dachary> indeed
<dachary> ltfs assumes sequential access media which is not what Software Heritage has. It is off topic but I'm amazed that it looks like something very well maintained and lively (last version of the spec is dated august 2020).
<dachary> olasd: I can't help but think there already exist a well known format to store content addressable archives in the simple way (key => offset,size + data). Rocksdb is the simplest I can think about but it could be a lot simpler.
<dachary> it's nothing more but storing a hash table really
<dachary> a read only hash table
<dachary> I dinstinctly feel I'm missing something :-)
<dachary> and equivalent of https://en.wikipedia.org/wiki/Berkeley_DB but with hash maps & no collisions instead of b-tree
<dachary> maybe "readonly key-value store" is a better set of keywords

<dachary> https://dbmx.net/tkrzw/#hashdbm_overview is close enough
<dachary> but not quite right
<dachary> https://en.wikipedia.org/wiki/Cdb_(software) even closer but still no focus on readonly storage
<dachary> http://www.unixuser.org/~euske/doc/cdbinternals/index.html
<dachary> https://css.csail.mit.edu/6.888/2015/papers/swang93.pdf describes something close to what is desired but in a slightly different context

For the record yesterday's IRC log

<zack> dachary: i wonder if "append-only storage" could be a better/alternativey/complementary search keyword
<dachary> +1
<dachary> https://en.wikipedia.org/wiki/Append-only points to https://en.wikipedia.org/wiki/Log-structured_merge-tree which loops back to https://en.wikipedia.org/wiki/RocksDB
<dachary> It could be interesting to have rocksdb where each sst file (i.e. the rocksdb unit of storage) is a 1TB RBD image. level 0 is for insertion and sorting and there would only be level 1 and it would be read-only.
<dachary> however... when the level 0 is full, it will be merged in level 1 and will modify all underlying 1TB images: that's not read only
<dachary> the ideal implementation would be a library that maintains an index pointing to each object with SHA256 => id of the 1TB image,offset,size. Assuming the index is sorted by SHA256, looking for an object is O(logN)
<dachary> the library would implement writing by appending in a journal (SHA256 + content), then an update of the index, then a write to the actual 1TB device, then discarding the entry from the journal
<dachary> it bothers me that we can't find anything addressing this particular use case
<dachary> the discussion on the Ceph mailing list tells me we're not missing something that everybody knows about, that much is good :-)
<dachary> I have a good feeling about having a self contained 1TB RBD image. Self contained in the sense that it would include the index mapping SHA256 to the object within the 1TB RBD image as well as the data. There would be no metadata at all but it would be a storage unit. From a collection of 1TB containers, it would be possible to build a global index to speed up search from (number of containers)*O(logN) to O(logN). But the global index could be rebuilt from scratch, it would not be something precious that must not be lost.
<dachary> If I'm not mistaken, the Software Heritage database contains the SWHID of each object, hence the signature of the object as well as metadata (maybe the size). Would it be possible to add the index of container holding the data as well ? 16 bits more would address 65PB.
<dachary> A self contained 1TB container could be built by appending (size + SHA + content) to the container. Once it's full, it can be rebuilt with a sorted index of the SHA1 => offset,size staring at 1TB and the content starting at 0. It is then frozen/immutable. The index can then be read to update a global index (which could be the SWH database).
<dachary> This immutable 1TB container can then conveniently be mirrored elsewhere, with rbd-mirror, borg, dd or whatever. It can be signed and verified. Then there is the question of maintaining and curating an inventory of the 1TB images floating around, which one are trusted by whom.
<dachary> 1TB is a good container size. Anyone can store that on an external disk. Even small organizations could contribute 1TB to SWH by keeping a copy. And it is manageable: a thousand copies (which is not that much) is 1PB (which is 3 times the current size).
<dachary> By comparison, if objects are stored in an object storage, there are billions of small objects and nothing else. No object storage can enable someone to mirror "1TB worth of data" with off-the-shelf tools.

https://github.com/vasi/pixz is a candidate for the 1TB archive content

The existing XZ Utils provide great compression in the .xz file format, but they produce just one big block of compressed data. Pixz instead produces a collection of smaller blocks which makes random access to the original data possible. This is especially useful for large tarballs.

Very quickly extract a single file, multi-core, also verifies that contents match index:
pixz -x dir/file < foo.tpxz | tar x

https://news.ycombinator.com/item?id=17085391 discussion about random access and archive formats

This comment was removed by dachary.
dachary changed the task status from Open to Work in Progress.Feb 15 2021, 2:13 PM
dachary claimed this task.

This preliminary exploration is complete and moved to benchmarking to discover blockers.

There is one concern that was not addressed: the metadata do not scale out, it is a single rocksdb database.