Page MenuHomeSoftware Heritage

Scale out object storage design
Started, Work in Progress, NormalPublic

Description

[Parent task for all related tasks]

Current status

An object storage design was discussed and described. Benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be secured.

Description

The draft of the design for the object storage can be found here

Explorations

Discussions

Quantitative data

Current

  • I/O limits writes at 10MB/s
  • reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
  • 50TB (30TB ZFS compressed) objects added every month
  • Available space exhausted by the end of 2021
  • 10 billions objects
  • Objects occupy 750TB (350TB ZFS compressed) (see statistics as of February 2021 )

Goals

  • Write > 100MB/s, ~3,000 objects/s
  • Read > 100MB/s, ~3,000 objects/s
  • Durability overhead (erasure coding) 50% (2+1, 4+2)
  • Storage overhead (storage system) < 20%
  • Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms
  • 100 billions objects

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@zack, very good point about having a target for the "time to first byte when reading an object".

I don't know what would be a "good" target for that metric; my gut says that staying within 100ms for any given object would be acceptable, as long as the number of parallel readers doesn't impact the amount too much (of course, within the IOPS of the underlying media, etc.).

In T3054#58874, @olasd wrote:

@zack, very good point about having a target for the "time to first byte when reading an object".

I don't know what would be a "good" target for that metric; my gut says that staying within 100ms for any given object would be acceptable, as long as the number of parallel readers doesn't impact the amount too much (of course, within the IOPS of the underlying media, etc.).

Updated the description with 100ms, thanks !

dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)

In the following small objects are < 4KB and object storage software refers to the list of software from the description for which there are no blockers.

  • Scale out: a "full scale out" (e.g. SWIFT or Ceph) requires optimizing the internals of the object storage to be friendly to the "small immutable objects" workload.
    • Pros
      • Work done to achieve the desired performances and space saving does not need to be revisited as storage grows.
      • Work done to achieve the desired performances and space saving can be contributed back to the object storage software and does not need to be maintained in the long run.
    • Cons
      • No object storage software is a perfect fit for small objects. They have significant space amplification and/or degraded performances.
      • No object storage software targets immutable and never deleted content. They miss optimizations that take advantage of these features.
      • Improving the internals of an object storage software is a difficult task.
  • Metadata database: a "scale up metadata & scale out data" (e.g. EOS or seaweed) requires writing glue to nicely bundle the database and the object storage.
    • Pros
      • It is not difficult to develop the software to glue together a database and an object storage.
    • Cons
      • The database scales up and will need to be revisited as storage grows.
      • The software needs to be maintained in the long run.
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)

For the record the half baked benchmark script for the proposed designed I worked on today. To be continued!


dachary updated the task description. (Show Details)