Page MenuHomeSoftware Heritage

Document low-level storage layers
Closed, ResolvedPublic

Description

Low-level is considered anything below Software Heritage Python code.

For all intent and purposes, this task will focus on the main object storage archive.

Event Timeline

ftigeot created this task.Aug 27 2019, 2:29 PM
ftigeot changed the task status from Open to Work in Progress.Aug 29 2019, 4:33 PM

Softwareheritage: low-level storage

Softwareheritage uses various servers with local storage in order to keep internal state for its various internal components.

The bulk of the storage is dedicated to what we call "the object storage" or "the archive".

The object storage is a big pool of small binary blocks (deduplicated and gzipped text content) stored in files whose individual sizes typically range from a few hundred bytes to a few kilobytes.

As of 2019年08月, it is entirely stored on a proprietary Dell MD3460 storage bay linked to the uffizi server by SAS 12Gb/s cables.
This primary bay is located in building #30 of the Rocquencourt campus.

A secondary bay (same model, same number of disks, same disk models) is linked to the banco server and was initially used to store a replica of the data contained on the first bay.
That secondary bay is located in building #9 of the Rocquencourt campus.

Main storage bay storage hierarchy :

[on the bay]

  • 60x 5.5 TB, 7200 RPM hard disks (6 TB drives for disk vendors)
  • 3x RAID60 volumes of 20 drives.

    These volumes are exported to louvre as LUN 1, LUN 2 and LUN 3. The drive bay and louvre are connected via SAS12 cables and HBAs.

[on uffizi, previously louvre]

  • three SCSI devices exported by the bay: sddr, sdds and sddt
  • three multipath devices created on top of the previous ones: dm-1, dm-4 and dm-7
  • three 92.7TB partitions: dm-5 on top of dm-1, dm-8 on top of dm-4 and dm-10 on top of dm-7
  • md0: a RAID0 mdadm device created on top of dm-5, dm-8 and dm-10

That unified RAID0 volume is then split into 16x ~= 12TB volumes and 1x ~= 100TB volume:

├─vg--data-uffizi--data0  254:67   0  12.2T  0 lvm   /srv/storage/0
├─vg--data-uffizi--data1  254:68   0  12.2T  0 lvm   /srv/storage/1
├─vg--data-uffizi--data2  254:69   0    12T  0 lvm   /srv/storage/2
├─vg--data-uffizi--data3  254:70   0  11.8T  0 lvm   /srv/storage/3
├─vg--data-uffizi--data4  254:71   0  12.2T  0 lvm   /srv/storage/4
├─vg--data-uffizi--data5  254:72   0  12.2T  0 lvm   /srv/storage/5
├─vg--data-uffizi--data6  254:73   0  12.2T  0 lvm   /srv/storage/6
├─vg--data-uffizi--data7  254:74   0  12.2T  0 lvm   /srv/storage/7
├─vg--data-uffizi--data8  254:75   0  12.2T  0 lvm   /srv/storage/8
├─vg--data-uffizi--data9  254:76   0  12.2T  0 lvm   /srv/storage/9
├─vg--data-uffizi--dataa  254:77   0  12.2T  0 lvm   /srv/storage/a
├─vg--data-uffizi--datab  254:78   0  12.2T  0 lvm   /srv/storage/b
├─vg--data-uffizi--datac  254:79   0  12.2T  0 lvm   /srv/storage/c
├─vg--data-uffizi--datad  254:80   0  12.2T  0 lvm   /srv/storage/d
├─vg--data-uffizi--datae  254:81   0  12.2T  0 lvm   /srv/storage/e
├─vg--data-uffizi--dataf  254:82   0  12.2T  0 lvm   /srv/storage/f
└─vg--data-uffizi--space  254:83   0   100T  0 lvm   /srv/storage/space
  • The 16x ~= 12TB volumes are then mounted in /srv/storage/0 to /srv/storage/f and contain xfs filesystems
  • Subdirectories 00 to ff in these filesystems are then mounted with bind mounts into /srv/softwareheritage/objects/00 to /srv/softwareheritage/objects/ff
  • 256 directories from /srv/softwareheritage/objects/00 to /srv/softwareheritage/objects/ff are then nfs-exported to various machines in 192.168.100.0/24 .

Only one machine, 192.168.100.31 (moma) mounts uffizi:/srv/softwareheritage/objects to /srv/softwareheritage/objects at this time.

Secondary storage bay :

The lowest levels of storage are the same as on the main bay.
The bays themselves are exactly the same models with the same kinds of disks

One of the biggest differences is multipath appears to really be in use

[on banco]

  • 6 SCSI devices exported by the bay: sdb, sdc, sdd, sde, sdf, sdg
  • three multipath devices: dm-1 on top of (sdb + sde), dm-4 on top of (sdc + sdf) and dm-7 on top of (sdd + sdg)
  • three ~= 92.7 TB partitions on top of the previous devices: dm-5, dm-8 and dm-10 (same as on the main bay)
  • md0: a RAID0 mdadm device created on top of dm-5, dm-8 and dm-10 (same as on the primary bay)

The unified RAID0 volume is split into 16x ~= 10TB volumes and a few additional bigger ones (barman and space)

As on the uffizi storage bay, the 16x ~= 10TB volumes are mounted in /srv/storage/0 to /srv/storage/f and contain xfs filesystems.
Subdirectories 00 to ff in these filesystems are then mounted with bind mounts into/srv/softwareheritage/objects/00 to /srv/softwareheritage/objects/ff , once again replicating the configuration of the primary uffizi bay.

Directory hashing

An individual object storage xfs filesystem has the following directory hierarchy:

16 first-level directories
   |
   *-- 256 second-level directories
       |
       *-- 256 third-level directories
           |
           *-- many individual files

Leaf files are named using hexadecimal digits and the directory names correspond to the six first digits of the file names.

There are 16 xfs filesystems per bay, which also correspond to the first digit.
The first directory names reuse the filesystem mount point digit as their first character and have 0 to f for the second.

The full path required to reach a10203fce717bef5c8232c3a0ef413acabe6c24c on uffizi is thus of this form: /srv/storage/a/a1/02/03/a10203fce717bef5c8232c3a0ef413acabe6c24c .

Storage bay management

The Dell MD3460 disk bays have ethernet ports and can be managed on top of a TCP/IP network by proprietary tools.

Both uffizi and banco are connected to dedicated private IP networks having only two hosts present: themselves and their storage bay.
Debian-amd4 binaries are present in /opt/dell/mdstoragesoftware .

Some useful documentation can be found on
http://ftp.respmech.com/pub/MD3000/en/MDSM_CLI_Guide/scriptcm.htm#wp1299092

The storage bays have the following IP addresses:

  • 192.168.128.101 on banco
  • 192.168.254.95 on uffizi

A few useful commands:

CLIENT=/opt/dell/mdstoragesoftware/client/SMcli
BAY=192.168.128.101

${CLIENT} ${BAY} -c "show allPhysicalDisks;"
${CLIENT} ${BAY} -c "show allPhysicalDisks summary;"

${CLIENT} ${BAY} -c "show storageArray batteryAge;"

${CLIENT} ${BAY} -c "show allVirtualDisks summary;"       

${CLIENT} ${BAY} -c "show diskgroup [swhbackup1];"

${CLIENT} ${BAY} -c "show storageArray unreadableSectors;"

Few comments:

  • We want of description of current storage infrastructure, not a sysadm doc on how to manage this storage, so I think the Storage bay management section should not be part of this documentation.
  • You should go from the global view fisrt: please describe the overall storage architecture before going into details of each machine (like uffizi). the overall storage architecture includes remote locations (azure, aws, etc.)
  • Having a written description is nice, but a bunch of diagrams should come first: (literally) draw the big picture *then* describe it.
  • Adding comments in a phabricator task is not the best way to deliver this task: please provide a properly structured document (be it a sphinx-based project or pages on the internal wiki).

Thanks.

zack triaged this task as Normal priority.Sep 5 2019, 9:01 PM

Some work-in-progress Sphinxdoc documentation is visible in this Phabricator review: https://forge.softwareheritage.org/D2140 .