Low-level is considered anything below Software Heritage Python code.
For all intent and purposes, this task will focus on the main object storage archive.
Low-level is considered anything below Software Heritage Python code.
For all intent and purposes, this task will focus on the main object storage archive.
Softwareheritage uses various servers with local storage in order to keep internal state for its various internal components.
The bulk of the storage is dedicated to what we call "the object storage" or "the archive".
The object storage is a big pool of small binary blocks (deduplicated and gzipped text content) stored in files whose individual sizes typically range from a few hundred bytes to a few kilobytes.
As of 2019年08月, it is entirely stored on a proprietary Dell MD3460 storage bay linked to the uffizi server by SAS 12Gb/s cables.
This primary bay is located in building #30 of the Rocquencourt campus.
A secondary bay (same model, same number of disks, same disk models) is linked to the banco server and was initially used to store a replica of the data contained on the first bay.
That secondary bay is located in building #9 of the Rocquencourt campus.
[on the bay]
[on uffizi, previously louvre]
That unified RAID0 volume is then split into 16x ~= 12TB volumes and 1x ~= 100TB volume:
├─vg--data-uffizi--data0 254:67 0 12.2T 0 lvm /srv/storage/0 ├─vg--data-uffizi--data1 254:68 0 12.2T 0 lvm /srv/storage/1 ├─vg--data-uffizi--data2 254:69 0 12T 0 lvm /srv/storage/2 ├─vg--data-uffizi--data3 254:70 0 11.8T 0 lvm /srv/storage/3 ├─vg--data-uffizi--data4 254:71 0 12.2T 0 lvm /srv/storage/4 ├─vg--data-uffizi--data5 254:72 0 12.2T 0 lvm /srv/storage/5 ├─vg--data-uffizi--data6 254:73 0 12.2T 0 lvm /srv/storage/6 ├─vg--data-uffizi--data7 254:74 0 12.2T 0 lvm /srv/storage/7 ├─vg--data-uffizi--data8 254:75 0 12.2T 0 lvm /srv/storage/8 ├─vg--data-uffizi--data9 254:76 0 12.2T 0 lvm /srv/storage/9 ├─vg--data-uffizi--dataa 254:77 0 12.2T 0 lvm /srv/storage/a ├─vg--data-uffizi--datab 254:78 0 12.2T 0 lvm /srv/storage/b ├─vg--data-uffizi--datac 254:79 0 12.2T 0 lvm /srv/storage/c ├─vg--data-uffizi--datad 254:80 0 12.2T 0 lvm /srv/storage/d ├─vg--data-uffizi--datae 254:81 0 12.2T 0 lvm /srv/storage/e ├─vg--data-uffizi--dataf 254:82 0 12.2T 0 lvm /srv/storage/f └─vg--data-uffizi--space 254:83 0 100T 0 lvm /srv/storage/space
Only one machine, 192.168.100.31 (moma) mounts uffizi:/srv/softwareheritage/objects to /srv/softwareheritage/objects at this time.
The lowest levels of storage are the same as on the main bay.
The bays themselves are exactly the same models with the same kinds of disks
One of the biggest differences is multipath appears to really be in use
[on banco]
The unified RAID0 volume is split into 16x ~= 10TB volumes and a few additional bigger ones (barman and space)
As on the uffizi storage bay, the 16x ~= 10TB volumes are mounted in /srv/storage/0 to /srv/storage/f and contain xfs filesystems.
Subdirectories 00 to ff in these filesystems are then mounted with bind mounts into/srv/softwareheritage/objects/00 to /srv/softwareheritage/objects/ff , once again replicating the configuration of the primary uffizi bay.
An individual object storage xfs filesystem has the following directory hierarchy:
16 first-level directories | *-- 256 second-level directories | *-- 256 third-level directories | *-- many individual files
Leaf files are named using hexadecimal digits and the directory names correspond to the six first digits of the file names.
There are 16 xfs filesystems per bay, which also correspond to the first digit.
The first directory names reuse the filesystem mount point digit as their first character and have 0 to f for the second.
The full path required to reach a10203fce717bef5c8232c3a0ef413acabe6c24c on uffizi is thus of this form: /srv/storage/a/a1/02/03/a10203fce717bef5c8232c3a0ef413acabe6c24c .
The Dell MD3460 disk bays have ethernet ports and can be managed on top of a TCP/IP network by proprietary tools.
Both uffizi and banco are connected to dedicated private IP networks having only two hosts present: themselves and their storage bay.
Debian-amd4 binaries are present in /opt/dell/mdstoragesoftware .
Some useful documentation can be found on
http://ftp.respmech.com/pub/MD3000/en/MDSM_CLI_Guide/scriptcm.htm#wp1299092
The storage bays have the following IP addresses:
A few useful commands:
CLIENT=/opt/dell/mdstoragesoftware/client/SMcli BAY=192.168.128.101 ${CLIENT} ${BAY} -c "show allPhysicalDisks;" ${CLIENT} ${BAY} -c "show allPhysicalDisks summary;" ${CLIENT} ${BAY} -c "show storageArray batteryAge;" ${CLIENT} ${BAY} -c "show allVirtualDisks summary;" ${CLIENT} ${BAY} -c "show diskgroup [swhbackup1];" ${CLIENT} ${BAY} -c "show storageArray unreadableSectors;"
Few comments:
Thanks.
Some work-in-progress Sphinxdoc documentation is visible in this Phabricator review: https://forge.softwareheritage.org/D2140 .
Documentation pushed to swh-docs in bc863ec6a56d539f57079b0b60e616a625c84f81 and 66b3e07ed9d9dbde2333cefe0e3375742dc76231.