Page MenuHomeSoftware Heritage

Hardware architecture for the object storage
Started, Work in Progress, NormalPublic

Description

Given the benchmark results T3149, what hardware architecture could support the object storage design? (see also T3054 for more context).

Here is a high level description of the minimal hardware setup.

Network

  • 10Gb 16 port switch

Write Storage

If the failure domain is the host, there must be two of each. If the failure domain is the disk, additional disks must be added for RAID5 or RAID6.

  • 1 Global Index: disks == 4TB nvme, nproc == 48, ram == 128GB, network == 10Gb
  • 1 Write ingestion: disks == 6TB nvme, nproc == 64, ram == 256GB, network == 10Gb

The size of the global index uses 125 bytes per entry once ingested in PostgreSQL. Each entry is 32 bytes for the cryptographic signature + 8 bytes for identifier of the shard in which the corresponding object can be found. And there is a unique index created on the cryptographic signature.

Read Storage

  • 3 monitor/orchestrator: disks == 500GB ssd + 4TB storage, nproc == 8, ram == 32GB, network == 10Gb
  • 7 osd: disks == 500GB ssd + 10 x 8TB/12TB, nproc == 16, ram == 128GB, network == 10Gb

Clients

Each is running up to 20 daemons servicing client requests for the Read Storage and the Write Storage.

  • 2 daemons: disks == 500GB ssd + 4TB storage, nproc == 24, ram == 32GB, network == 10Gb

See also https://www.supermicro.com/en/solutions/red-hat-ceph

Event Timeline

dachary changed the task status from Open to Work in Progress.Sat, May 15, 1:07 PM
dachary created this task.
dachary created this object in space S1 Public.
dachary updated the task description. (Show Details)
dachary triaged this task as Normal priority.Mon, May 17, 1:59 PM
dachary updated the task description. (Show Details)

@olasd E. Lacour completed a study for a Ceph cluster today, with hardware specifications and pricing. He is available to discuss if you'd like.

This comment was removed by dachary.

E. Lacour @ easter-eggs recently finished a study for hardware procurement and the design of a Ceph cluster that is not too far from the minimum that would be required for the Read Storage. He is available to help if needed / possible.

Yeah, I think it would be useful to have a chat, at least to get a set of sensible ballpark figures we can measure our own quotes to (and maybe get an idea of other providers we could get hardware from, if what we're getting isn't satisfactory).

Would you mind setting up a call with Emmanuel, @vsellier and myself this week? (starting Tuesday, I don't have any hard scheduling constraints, for what I'd expect would be a 30 mins call?).

The call is set to Wednesday June 2nd, 2021 4pm UTC+2 at https://meet.jit.si/ApparentStreetsJokeOk

My notes on the meeting:

manu

  • Remote access
  • ASINFO provided the necessary specs with our requirements. It would be too difficult for us to navigate the catalogue.
  • The hard drives: we don't buy them with ASINFO because it is more expensive
  • Carefull on the SSD and nvme: it matters a lot and the Ceph cluster wears them very quickly (see Intel vs others)
  • We added HBA cards with cache otherwise it is slower
  • We have two pools for Ceph
    • Journal on SSD + HDD
    • Full SSD
  • It is difficult to find 2.5'' SSD and we find more nvme
  • Backend and frontend for Ceph are worth separating for debugging (two 10GB cards)
  • We tried hyperconverged (VM + Ceph) but we had trouble debugging performance problems

olasd

  • For the PostgreSQL cluster
  • Dell with 5 years warranty
  • Two machines with cross replication
  • Ceph
  • Machines without warranty
  • Not necessary Dell, maybe SuperMicro
  • ASINFO is our provider
  • We bought disk array from ASINFO and they made us a reasonable deal but we did not thoroughly research
  • It is 2x cheaper than Dell
  • I noticed that nvme can be on the same price range as SSD
  • Regarding the network with have two switchs that do 10GB
  • I'm not sure if there is a need to aggregate link
  • We have proxmox based Ceph (hyperconverged)
  • For the Read Storage we are looking at a 100% dedicated Cluster
  • Maybe (at a later time) we could use the Ceph cluster for other workloads (if the performances allow that)
  • We will need to add a rack (3 PDU, 32A each)