Page MenuHomeSoftware Heritage

Benchmark software for the object storage
Started, Work in Progress, NormalPublic

Description

Benchmarks software for the layer 0 of the object storage.

Event Timeline

dachary changed the task status from Open to Work in Progress.Mar 17 2021, 4:15 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.

First draft for layer 0.

  • tests pass
  • runs with a degraded configuration and pgsql as a database
  • requires 8 machines (libvirt)

The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.

The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.

direnv: loading bench/.envrc                                                                                                           ========================================================= test session starts =========================================================
platform linux -- Python 3.7.3, pytest-6.2.2, py-1.10.0, pluggy-0.13.1                                                                 
rootdir: /root                                                                                                                         
plugins: mock-3.5.1, asyncio-0.14.0                                                                                                    
collected 10 items                                                                                                                    
                                                            
bench/test_bench.py ..........                                                                                                  [100%] 
                                                                                                                  
========================================================= 10 passed in 55.48s =========================================================
Connection to dahu-25.grenoble.grid5000.fr closed.

I could not resist despite the fact that the benchmark is nowhere near meaningful and tried it anyway. With 2 writers it gives:

WARNING:root:Objects write/seconds 1008/s                                                                                             
WARNING:root:Bytes write/seconds 20MB/s

and with 5 writers it gives:

WARNING:root:Objects write/seconds 2K/s
WARNING:root:Bytes write/seconds 41MB/s

Meanwhile the PostgreSQL host has way too many processors for the load :-)

Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).

The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps

Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves ~500 object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.

I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:

  • verify --fake-ro can reach 10K object insertions per second with 20 workers
  • run the bench mark without --fake-ro and see how much MB/s it achieves
  • bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
  • bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
  • bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024
    • WARNING:root:Objects write 6.4K/s
    • WARNING:root:Bytes write 131.1MB/s
  • bench.py --file-count-ro 200 --rw-workers 20 --packer-workers 20 --file-size 1024
    • WARNING:root:Objects write 6.1K/s
    • WARNING:root:Bytes write 124.4MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/4552098bc6f364ab0e59df996551f23b2ec35049

  • Add reader to continuously read from images to simulate a read workload
  • Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)

I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:

  • zapping known devices even when they don't show up via the orchestrator: the internal logic waits for the required data to be available and does a better job than an attempt to wait for them to show up (sometime they don't and the reason is unclear)
  • using a Docker mirror registry to avoid hitting the rate limit

https://git.easter-eggs.org/biceps/biceps/-/commit/ffaf1cad18748377ec8e90b12beed83a862afd4f

Complete rewrite to:

  • Use one thread per worker (using asyncio for workloads turns out to be too complicated because python3 lacks universal support, for file I/O)
  • Merge the write/pack steps together for simplicity since one follows the other, always

Running the tests during ~24h showed:

  • there is no significant memory leak (but there is). The memory usage stayed at 27GB and went from 800MB RSS to 1.4GB RSS.
  • the throughput is does not degrade over time: creating ~6,000 image for 6TB shows the same througput as creating 20 images for 20GB.

There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.

root@ceph1:~# ceph df
--- RAW STORAGE ---
CLASS  SIZE    AVAIL   USED     RAW USED  %RAW USED
hdd    25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
TOTAL  25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  214 KiB       12  642 KiB      0    4.7 TiB
ro-data                18   32  6.0 TiB    1.56M  9.3 TiB  40.04    9.3 TiB
ro                     19   32   11 MiB   12.16k  1.1 GiB      0    4.7 TiB

rbd bench on the images created

for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro  bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & done
rm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K  bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done