Page MenuHomeSoftware Heritage

Benchmark software for the object storage
Closed, ResolvedPublic

Description

Benchmarks software for the object storage.

The version that will be used to run the benchmarks can be found at https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8

Event Timeline

dachary changed the task status from Open to Work in Progress.Mar 17 2021, 4:15 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.

First draft for layer 0.

  • tests pass
  • runs with a degraded configuration and pgsql as a database
  • requires 8 machines (libvirt)

The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.

The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.

direnv: loading bench/.envrc                                                                                                           ========================================================= test session starts =========================================================
platform linux -- Python 3.7.3, pytest-6.2.2, py-1.10.0, pluggy-0.13.1                                                                 
rootdir: /root                                                                                                                         
plugins: mock-3.5.1, asyncio-0.14.0                                                                                                    
collected 10 items                                                                                                                    
                                                            
bench/test_bench.py ..........                                                                                                  [100%] 
                                                                                                                  
========================================================= 10 passed in 55.48s =========================================================
Connection to dahu-25.grenoble.grid5000.fr closed.

I could not resist despite the fact that the benchmark is nowhere near meaningful and tried it anyway. With 2 writers it gives:

WARNING:root:Objects write/seconds 1008/s                                                                                             
WARNING:root:Bytes write/seconds 20MB/s

and with 5 writers it gives:

WARNING:root:Objects write/seconds 2K/s
WARNING:root:Bytes write/seconds 41MB/s

Meanwhile the PostgreSQL host has way too many processors for the load :-)

Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).

The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps

Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves ~500 object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.

I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:

  • verify --fake-ro can reach 10K object insertions per second with 20 workers
  • run the bench mark without --fake-ro and see how much MB/s it achieves
  • bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
  • bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
  • bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024
    • WARNING:root:Objects write 6.4K/s
    • WARNING:root:Bytes write 131.1MB/s
  • bench.py --file-count-ro 200 --rw-workers 20 --packer-workers 20 --file-size 1024
    • WARNING:root:Objects write 6.1K/s
    • WARNING:root:Bytes write 124.4MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/4552098bc6f364ab0e59df996551f23b2ec35049

  • Add reader to continuously read from images to simulate a read workload
  • Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)

I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:

  • zapping known devices even when they don't show up via the orchestrator: the internal logic waits for the required data to be available and does a better job than an attempt to wait for them to show up (sometime they don't and the reason is unclear)
  • using a Docker mirror registry to avoid hitting the rate limit

https://git.easter-eggs.org/biceps/biceps/-/commit/ffaf1cad18748377ec8e90b12beed83a862afd4f

Complete rewrite to:

  • Use one thread per worker (using asyncio for workloads turns out to be too complicated because python3 lacks universal support, for file I/O)
  • Merge the write/pack steps together for simplicity since one follows the other, always

Running the tests during ~24h showed:

  • there is no significant memory leak (but there is). The memory usage stayed at 27GB and went from 800MB RSS to 1.4GB RSS.
  • the throughput is does not degrade over time: creating ~6,000 image for 6TB shows the same througput as creating 20 images for 20GB.

There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.

root@ceph1:~# ceph df
--- RAW STORAGE ---
CLASS  SIZE    AVAIL   USED     RAW USED  %RAW USED
hdd    25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
TOTAL  25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  214 KiB       12  642 KiB      0    4.7 TiB
ro-data                18   32  6.0 TiB    1.56M  9.3 TiB  40.04    9.3 TiB
ro                     19   32   11 MiB   12.16k  1.1 GiB      0    4.7 TiB

rbd bench on the images created

for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro  bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & done
rm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K  bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done

https://git.easter-eggs.org/biceps/biceps/-/commit/d10cf0ec8c6e18ebc79122d0bb3587edbcb4594e

Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.

$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 200 --rw-workers 10 --ro-workers 10 --file-size 1024 --no-warmup

WARNING:root:Objects write 1.4K/s
WARNING:root:Bytes write 27.9MB/s
WARNING:root:Objects read 816/s
WARNING:root:Bytes read 61.6MB/s

Since there are 4 process delivering the same performances (more or less 10%), this is:

  • Objects write ~6K/s
  • Bytes write: ~120MB/s
  • Objects read: ~3K/s
  • Bytes read: ~150MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/d4702adfedfa5486674062aa7364766ca4d22ec1

The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:

  • Postgres server side limitation to 100 clients which I originally mistook for a client side limitation because of the sqlalchemy pools (and the code indeed keeps way too many open connections)
  • The select when packing turned out to be very expensive and the reason why it slows down to a halt with 100GB Shards. Switched to psycopg2 instead to use server side cursors (and not use an insane amount of RAM) while enumerating all objects at a constant speed.

The grid5000 cluster was reserved for this weekend (from friday night to sunday night) to run tests and collect results which are hopefully final for the layer 0 benchmarks, using 100GB images and all available disk space. I will babysit the run to catch unexpected behavior.

$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --rand-ratio 10
...
WARNING:root:Objects write 5.8K/s
WARNING:root:Bytes write 118.4MB/s
WARNING:root:Objects read 12.3K/s
WARNING:root:Bytes read 850.3MB/s

I think the results are too optimistic for reads because it starts at the warmup phase (i.e. when a single Shard exists in the Read Storage) and it fits in RAM.

$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --no-warmup
...
WARNING:root:Objects write 5.8K/s
WARNING:root:Bytes write 117.9MB/s
WARNING:root:Objects read 1.3K/s
WARNING:root:Bytes read 100.4MB/s

Now the reads do not deliver the expected number of objects/s which I did not forsee.

https://git.easter-eggs.org/biceps/biceps/-/commit/83434e93db6a12988d1f12764cecc5447a9af549

  • Group the two postgresql nvme drives in a single logical volume to get more storage. We need 30 write workers using 100GB Shards require 3TB of postgresql storage
  • Setup a second postgresql server set as a standby replication of the master: it may negatively impact the performances of the master cluster and should be included in the benchmark
  • Explain the benchmark methodology & assumptions
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup

Crashed because the postgres WAL grew over 700GB and filled the disk. It happened after the first 40 workers completed and after the next 40 were started.

$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 1000 --rw-workers 40 --ro-workers 40 --file-size 25000 --no-warmup

Should consume a maximum of 1TB of database space at any point in time (40 * 25GB) and that will allow monitoring of how the size of the WAL grows, specially after databases are deleted.

After writing 1TB in 40 DB (40 * 25GB), the WAL is ~200GB i.e. ~20%:

Sat 08 May 2021 10:18:20 AM CEST                                                                          
1.1T    /var/lib/postgresql/11/main/base                                                                  
198G    /var/lib/postgresql/11/main/pg_wal

and the standby server is ~200GB behind and its WAL is small which suggests it is not the one lagging behind:

Sat 08 May 2021 10:18:29 AM CEST        
900G    /var/lib/postgresql/11/main/base  
1.5G    /var/lib/postgresql/11/main/pg_wal

While the content of the Write Storage is moved over to the Read Storage, the standby catches up but not very fast:

Sat 08 May 2021 10:28:21 AM CEST
1.1T    /var/lib/postgresql/11/main/base
185G    /var/lib/postgresql/11/main/pg_wal

and the standby:

Sat 08 May 2021 10:28:00 AM CEST
919G    /var/lib/postgresql/11/main/base
1.6G    /var/lib/postgresql/11/main/pg_wal

Looking at the walreceiver process on the standby, it writes on disk at a rate of up to 500MB/s when the databases are being written. But when they are complete and there only are reads to move data from Write Storage to Read Storage, it goes down to 100MB/s although the standby is still ~200GB behind the master and there is ~200GB worth of WAL available on the master. Why does the receiver slow down?

After the first round of writers is done and 1TB worth of database is deleted from the master, the standby does not free the disk space:

Sat 08 May 2021 10:57:25 AM CEST
107G    /var/lib/postgresql/11/main/base
163G    /var/lib/postgresql/11/main/pg_wal

and on the standby:

Sat 08 May 2021 10:57:32 AM CEST
1022G   /var/lib/postgresql/11/main/base
1.4G    /var/lib/postgresql/11/main/pg_wal
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup
...
WARNING:root:Objects write 6.8K/s
WARNING:root:Bytes write 137.7MB/s
WARNING:root:Objects read 1.5K/s
WARNING:root:Bytes read 109.9MB/s

There is a contention on reads: the individual OSDs are not overloaded (vmstat shows almost no iowait), the network bandwidth is at 50% capacity, the CPU is mostly idle, the RAM is used at less than 50%. An analysis of a running cluster is needed to figure out why the reads under perform. The the first task is to repeat the problem with a shorter run of the benchmark.

https://git.easter-eggs.org/biceps/biceps/-/commit/54e9c9e5b5ec00e22d2ae28f8d0ae91f658eac0d

Tune PostgreSQL and verify it improves the situation as follows:

  1. Add enough data to the Write Storage to have time to observe the long tail (at least 500GB)
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 1000 --rw-workers 40 --ro-workers 40 --file-size 25000 --no-warmup
  1. ssh to the standby server and verify it writes ~500MB/2 and the WAL is at least 50GB (10%) on the master
  2. Kill the benchmark
  3. ssh to the standby server and verify it writes ~500MB/2 and the WAL decreases at the same speed on the master

https://git.easter-eggs.org/biceps/biceps/-/commit/45aedaa6887659dc4454e2109a6b7361abab5db9

With hot_standby = off the WAL is quickly flushed to the standby server when the write finish.
As soon as the write finish, the benchmark starts to read all databases as fast as it can which
significantly slows down the replication because it needs to ensure strong consistency between the
master and the standby.

Here is the disk usage on the standby towards the end of the write phase. The WAL has grown to 15GB
and is ~50GB on the master. It only takes a few minutes for the standby to catch up.

Mon 10 May 2021 02:46:19 PM CEST
382G    /var/lib/postgresql/11/main/base
15G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:46:49 PM CEST
388G    /var/lib/postgresql/11/main/base
11G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:47:19 PM CEST
396G    /var/lib/postgresql/11/main/base
7.7G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:47:49 PM CEST
402G    /var/lib/postgresql/11/main/base
5.7G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:48:19 PM CEST
409G    /var/lib/postgresql/11/main/base
4.3G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:48:49 PM CEST
416G    /var/lib/postgresql/11/main/base
2.4G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:49:19 PM CEST
421G    /var/lib/postgresql/11/main/base
2.4G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:49:49 PM CEST
426G    /var/lib/postgresql/11/main/base
2.4G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:50:19 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:50:49 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:51:19 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:51:49 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:52:19 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:52:49 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:53:20 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:53:50 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal

And the disk usage on the master:

Mon 10 May 2021 02:52:17 PM CEST
431G    /var/lib/postgresql/11/main/base
48G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:52:48 PM CEST
431G    /var/lib/postgresql/11/main/base
48G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:53:18 PM CEST
431G    /var/lib/postgresql/11/main/base
48G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:53:48 PM CEST
431G    /var/lib/postgresql/11/main/base
48G     /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:54:18 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:54:48 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal
Mon 10 May 2021 02:55:18 PM CEST
431G    /var/lib/postgresql/11/main/base
1.1G    /var/lib/postgresql/11/main/pg_wal

https://git.easter-eggs.org/biceps/biceps/-/commit/b23b78446aa485586d29ba599e55a341fc2db29c

  • remap RBD images readonly when they are full so that there is no need to acquire read-write (not sure it matters, just an idea at this point and it's a simple thing to do)
  • clobber postgres when starting the benchmarks, in case there are leftovers
  • the postgres standby does not need to be hot (see above)
  • add recommended tuning for PostgreSQL (assuming a machine that has 128GB RAM)
  • zap the grid5000 nvme for PostgreSQL because they are not reset when the machine is deployed

https://git.easter-eggs.org/biceps/biceps/-/commit/ca01c6d758b26cfe27e6a61993f6c0d94986b460

  • estimate the number of objects with sequential read based on the median size
  • implement read-only to experiment with various settings on an existing Read Storage

Now reads perform a lot better because the miscalculation is fixed but also because the RBD is mounted read-only. It must be throttled otherwise it puts too much pressure on the cluster which underperforms on writes.

$ bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 40000 --no-warmup
...
WARNING:root:Objects write 3.9K/s
WARNING:root:Bytes write 78.7MB/s
WARNING:root:Objects read 46.5K/s
WARNING:root:Bytes read 190.1MB/s

Reducing the number of read workers to 20 allows writes to perform as expected. The test results are collected in the README file for archive.

https://git.easter-eggs.org/biceps/biceps/-/commit/14a6cf08549712a4341db548c4aa984e1494d206

1GB Shards, 200GB Read Storage

$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 1024 --no-warmup
WARNING:root:Objects write 6.0K/s
WARNING:root:Bytes write 122.0MB/s
WARNING:root:Objects read 66.8K/s
WARNING:root:Bytes read 272.7MB/s

10GB Shards, 2TB Read Storage

$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 10240 --no-warmup
WARNING:root:Objects write 5.9K/s
WARNING:root:Bytes write 121.3MB/s
WARNING:root:Objects read 42.7K/s
WARNING:root:Bytes read 174.2MB/s

40GB Shards, 8TB Read Storage

$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 40000 --no-warmup
WARNING:root:Objects write 5.8K/s
WARNING:root:Bytes write 118.3MB/s
WARNING:root:Objects read 34.5K/s
WARNING:root:Bytes read 140.8MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/78259cc59251d817b730f75f1444d5a8a3d7ce1e

  • Add the generate script to ingest entries in the global index.

It proceeds in two phases a) use COPY to insert entries at 500K/s, b) build the index. Since building the index roughly takes the same time as inserting the rows, the effective ingestion speed is 250K/s. Each entry uses ~100 bytes of raw space when indexed. That's ~1 billion entries per hour. It should take 24h to create a 2.5TB index (roughly the maximum space that is currently available on a PostgreSQL server) with 25 billions entries.

When COPY inserts records while the index exists, the rate drops to 32K/s, i.e. it is an order of magnitude faster to create the index afterwards.

20 billions entries were inserted in the global index. After building the index it occupies 2.5TB, therefore each entry uses ~125 bytes of raw space. That's 25% more than with a 1 billion entries global index (i.e. 100 bytes)

It turns out building the index for 20 billions entries takes more time than expected. Creating the table took 10 hours. I stopped the create index after 18 hours. A 10 billions entries index should be created first to evaluate how long it really takes.

https://git.easter-eggs.org/biceps/biceps/-/tags/v2021-06-05

This week-end run was not very fruitful: since the global index could not be populated as expected and it was discovered Sunday morning, there was no time to fallback to a small one, for instance 10 billion entries. A run was launched and lasted ~24h to show:

  • time to first byte
  • insertion on an empty global index when a new object is written does not negatively impact the performances

The performances are as expected:

40GB Shards, 16TB Read Storage

$ bench.py --file-count-ro 400 --rw-workers 40 --ro-workers 20 --file-size $((40 * 1024)) --no-warmup
...
WARNING:root:Objects write 5.4K/s
WARNING:root:Bytes write 110.6MB/s
WARNING:root:Objects read 40.7K/s
WARNING:root:Bytes read 166.2MB/s
WARNING:root:Time to first byte 62ms

However the time to first byte is the average but it should be the maximum time to first byte and this needs fixing.

The grid5000 cluster was booked for next week-end with:

  • 2 yeti starting Friday evening until Monday morning, on which the global index will be built
  • 2 yeti + 30 dahu starting Saturday evening until Monday morning on which the benchmarks will be run using the pre-filled global index

and this needs fixing.

do you mean the bench code needs fixing (to report the proper stats)?

and this needs fixing.

do you mean the bench code needs fixing (to report the proper stats)?

Yes, that's what I'm working on right now :-)

While you're at it, could you report quantiles for the time to first byte, instead of just a raw maximum?

Something like:

  • best 1%
  • best 10%
  • best 25%
  • median
  • worst 25% / best 75%
  • worst 10%
  • worst 1%
  • maximum

(this all might be overkill, but...)

In T3149#65880, @olasd wrote:

While you're at it, could you report quantiles for the time to first byte, instead of just a raw maximum?

Something like:

  • best 1%
  • best 10%
  • best 25%
  • median
  • worst 25% / best 75%
  • worst 10%
  • worst 1%
  • maximum

(this all might be overkill, but...)

Preliminary testing shows that there is roughly 1 in a million reads taking longer than 100ms. I don't think it can be avoided, even in a controlled environment there will be bumps that create longer response time. Since we're probably not interested in how time to first byte varies when it's under 100ms I'm not very motivated to create and display this information. It would be interesting to figure out when the read takes longer than 100ms but this is a complex system and figuring the root cause of these exceptional events is difficult.

Preliminary testing shows that there is roughly 1 in a million reads taking longer than 100ms. I don't think it can be avoided, even in a controlled environment there will be bumps that create longer response time. Since we're probably not interested in how time to first byte varies when it's under 100ms I'm not very motivated to create and display this information. It would be interesting to figure out when the read takes longer than 100ms but this is a complex system and figuring the root cause of these exceptional events is difficult.

I agree that some latency spikes are probably unavoidable, or at least that it would take considerable profiling and observation to get a hang on them (which is outside of the scope of this set of benchmarks). I also agree that the current results for the benchmark look on target.

I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.

I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.

Ack: I made a note to not forget this topic.

how about just collecting all raw timings in an output CSV file (or several files if needed) and compute the stats downstream (e.g., with pandas)?
that would allow changing the percentiles later on as well as compute different stats, without having to rerun the benchmarks

In T3149#65906, @zack wrote:

how about just collecting all raw timings in an output CSV file (or several files if needed) and compute the stats downstream (e.g., with pandas)?
that would allow changing the percentiles later on as well as compute different stats, without having to rerun the benchmarks

I'll work on that next week.

For the record, creating 10 billions entries in the global index took:

real    872m59.325s
user    192m42.553s
sys     134m48.580s

I interrupted the benchmarks because it shows reads are not as expected, i.e. a large number of reads take very long and the number of reads per seconds is way more than what is needed. There is no throttling on reads only the number of workers is the limit. I was expecting they would be slowed down by other factors and not apply too much pressure on the cluster. But I was apparently wrong and throttling must be implemented to slow them down.

WARNING:root:Objects write 6.5K/s                                                                                                                           
WARNING:root:Bytes write 131.8MB/s                                                                                                                           
WARNING:root:Objects read 55.3K/s                                           
WARNING:root:Bytes read 226.1MB/s                                                                                                                            WARNING:root:17.84542486337239% of random reads took longer than 100.0ms

The writes are however performing in a way that is consistent with previous results despite the fact that it also writes in the global index populated with 10 billions entries.

Running the benchmark with a read workload only (the Ceph cluster is doing nothing else), with 20 workers shows 8% of requests with a latency above the threshold:

$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 20 
WARNING:root:Objects read 64.2K/s
WARNING:root:Bytes read 261.6MB/s
WARNING:root:8.382380110322257% of random reads took longer than 100.0ms
WARNING:root:Worst times to first byte on random reads (ms) [699, 687, 685, 682, 666, 654, 649, 641, 637, 633, 632, 624, 622, 622, 601, 574, 560, 558, 557, 555, 553, 553, 550, 549, 548, 547, 546, 544, 542, 535, 534, 531, 520, 507, 500, 495, 493, 492, 485, 483, 476, 475, 473, 471, 469, 468, 468, 467, 467, 464, 464, 462, 460, 460, 460, 460, 455, 453, 449, 447, 447, 447, 441, 441, 439, 438, 436, 436, 435, 435, 435, 431, 429, 426, 419, 418, 417, 415, 414, 413, 412, 411, 409, 408, 407, 406, 405, 404, 403, 403, 403, 402, 400, 399, 396, 394, 394, 394, 394, 394]

Running 5 read workers in the same conditions shows latencies above the threshold are about 0.01%

$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 5
WARNING:root:Objects read 36.8K/s                                                                                                                            
WARNING:root:Bytes read 151.1MB/s                                                                                                                            
WARNING:root:0.01038885483653137% of random reads took longer than 100.0ms                                                                                   
WARNING:root:Worst times to first byte on random reads (ms) [133, 129, 121, 119, 109, 107, 107, 105, 101, 100] 

When the benchmark write, the pressure of 40 workers slows down the reads significantly.

WARNING:root:Objects read 9.0K/s                                            
WARNING:root:Bytes read 36.6MB/s                                             
WARNING:root:19.843788137661704% of random reads took longer than 100.0ms  
WARNING:root:Worst times to first byte on random reads (ms) [4432, 3346, 3028, 2561, 2297, 2101, 1832, 1788, 1759, 1696, 1676, 1574, 1541, 1524, 1502, 1407,
1379, 1335, 1253, 1244, 1234, 1233, 1200, 1197, 1167, 1128, 1091, 1089, 1046, 1017, 1016, 1007, 980, 961, 958, 923, 905, 904, 882, 868, 864, 855, 849, 811, 7
97, 796, 794, 792, 789, 780, 775, 767, 757, 731, 726, 718, 712, 711, 708, 708, 693, 680, 665, 661, 659, 641, 626, 623, 609, 607, 594, 593, 588, 583, 581, 577
, 572, 560, 558, 554, 537, 530, 522, 522, 519, 514, 508, 495, 494, 493, 492, 489, 488, 483, 481, 481, 479, 476, 475, 474]

It is quite clear that they are competing. The next step is to try to reduce the number of write workers.

The equilibrium between reads and write is with 5 readers and 10 writers which leads to 1.2% random reads above the threshold, the worst one being 2sec. What it means is that care must be taken, application side, to throttle reads and writes otherwise the penalty is a significant degradation is latency.

$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 5
WARNING:root:Objects read 29.4K/s                                                                                                                            
WARNING:root:Bytes read 119.9MB/s                                                                                                                            
WARNING:root:1.2761297725814886% of random reads took longer than 100.0ms                                                                                    
WARNING:root:Worst times to first byte on random reads (ms) [2123, 2011, 1952, 1939, 1812, 1777, 1756, 1710, 1587, 1565, 1559, 1546, 1543, 1512, 1494, 1472, 
1472, 1464, 1456, 1421, 1403, 1375, 1372, 1362, 1355, 1353, 1348, 1341, 1324, 1323, 1321, 1293, 1285, 1282, 1269, 1268, 1264, 1263, 1262, 1254, 1235, 1235, 1227, 1224, 1220, 1183, 1174, 1172, 1167, 1166, 1162, 1160, 1157, 1142, 1139, 1129, 1128, 1120, 1104, 1104, 1104, 1098, 1092, 1087, 1083, 1082, 1080, 1079, 1070, 1069, 1058, 1050, 1044, 1042, 1042, 1040, 1030, 1028, 1023, 1019, 1017, 1016, 1015, 1014, 1013, 1013, 1008, 1008, 1007, 1004, 1002, 1002, 1002, 1001, 999, 994, 987, 985, 982, 977]     
$ bench.py --file-count-ro 500 --rw-workers 10 --ro-workers 0 --file-size 40000 --no-warmup --no-reset
WARNING:root:Objects write 5.6K/s
WARNING:root:Bytes write 113.9MB/s

Creating a 20 billions global index fails because there is not enough disk space (2.9TB is full even with tunefs -m 0).

$ time python generate.py --count $((20 * 1024 * 1024 * 1024)) && time python generate.py --index
...
real    658m2.446s
user    392m2.581s
sys     268m17.226s



Traceback (most recent call last):
  File "generate.py", line 213, in <module>
    sys.exit(asyncio.run(Application().main(sys.argv[1:])))
  File "/usr/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete                                                                         
    return future.result()
  File "generate.py", line 208, in main
    d.create_index()
  File "generate.py", line 76, in create_index
    c.execute("CREATE INDEX shards_index ON shards (signature)")
psycopg2.errors.DiskFull: could not extend file "base/16385/16427.358": No space left on device                                                             
HINT:  Check free disk space.


real    1081m25.672s
user    0m0.066s
sys     0m0.014s

When the Read Storage went over 20TB, the PGs of the Ceph pool were automatically increased (double). As a consequence backfilling started but it is throttled to not have a negative impact on performances.

18TB

$ ceph status                                                                                                  
  cluster:                                                                                                                   
    id:     80469a64-d059-11eb-b593-3cfdfe55e820                                                                             
    health: HEALTH_WARN                                                                                                      
            2 pool(s) do not have an application enabled                                                                     
                                                                                                                             
  services:                                                                                                                  
    mon: 5 daemons, quorum ceph1,ceph21,ceph24,ceph29,ceph11 (age 24h)                                                       
    mgr: ceph1.bvaofb(active, since 24h), standbys: ceph9.bnncbd                                                             
    osd: 28 osds: 28 up (since 24h), 28 in (since 24h)                                                                       
                                           
  data:                                                          
    pools:   3 pools, 65 pgs        
    objects: 2.94M objects, 11 TiB                               
    usage:   18 TiB used, 84 TiB / 102 TiB avail
    pgs:     65 active+clean
                                                        
  io:                  
    client:   8.3 MiB/s rd, 520 MiB/s wr, 76 op/s rd, 260 op/s wr

29TB

$ ceph status
  cluster:
    id:     80469a64-d059-11eb-b593-3cfdfe55e820
    health: HEALTH_WARN
            2 pool(s) do not have an application enabled

  services:
    mon: 5 daemons, quorum ceph1,ceph21,ceph24,ceph29,ceph11 (age 39h)                                                      
    mgr: ceph1.bvaofb(active, since 39h), standbys: ceph9.bnncbd                                                            
    osd: 28 osds: 28 up (since 9h), 28 in (since 39h); 61 remapped pgs                                                      

  data:
    pools:   3 pools, 161 pgs
    objects: 4.86M objects, 19 TiB
    usage:   29 TiB used, 73 TiB / 102 TiB avail
    pgs:     3592853/29185062 objects misplaced (12.311%)
             100 active+clean
             59  active+remapped+backfill_wait
             2   active+remapped+backfilling

  io:
    client:   105 MiB/s rd, 299 op/s rd, 0 op/s wr
    recovery: 35 MiB/s, 8 objects/s

https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-06-18

$ bench.py --file-count-ro 350 --rw-workers 10 --ro-workers 5 --file-size $((100 * 1024)) --no-warmup
...
WARNING:root:Objects write 6.4K/s
WARNING:root:Bytes write 131.3MB/s
WARNING:root:Objects read 24.3K/s
WARNING:root:Bytes read 99.4MB/s
WARNING:root:2.0859388857985817% of random reads took longer than 100.0ms
WARNING:root:Worst times to first byte on random reads (ms) [10751, 8217, 7655, 7446, 7366, 6919, 6722, 6515, 6481, 6079, 5918, 5839, 5823, 5759, 5634, 5573, 5492, 5335, 5114, 5105, 5009, 4976, 4963, 4914, 4913, 4854, 4822, 4668, 4658, 4605, 4593, 4551, 4537, 4489, 4470, 4431, 4418, 4411, 4385, 4327, 4298, 4224, 4090, 4082, 4070, 4010, 3868, 3865, 3819, 3818, 3815, 3805, 3798, 3755, 3719, 3716, 3711, 3704, 3688, 3612, 3608, 3606, 3579, 3543, 3537, 3527, 3493, 3450, 3441, 3356, 3346, 3338, 3319, 3313, 3294, 3272, 3264, 3258, 3244, 3183, 3179, 3160, 3145, 3136, 3127, 3123, 3119, 3107, 3098, 3093, 3090, 3083, 3082, 3068, 3057, 3052, 3029, 3028, 3022, 3022]

Although the number of PGs increased during the benchmark run, causing recovery in the background, I don't think it had a very significant impact on performances. The 2% of slow random reads probably happen when the Read Storage are being created because the writers are not throttled and apply a high pressure on the cluster. If throttling was implemented, it is expected to be below 1% with maximum values that are not higher than 1 second.

The benchmarks will be modified to output raw data about its workload every 5 seconds for the duration of the benchmark to verify the latency is correlated with the writes.

https://git.easter-eggs.org/biceps/biceps/-/tree/f8fff3e951b9dfde33584f22a9f689cbf30579c2

  • The statistics are no longer displayed as the benchmark runs, they are stored in CSV files: one line is added every 5 seconds
  • IO stats are collected from the Ceph cluster every five seconds and included in the CSV files
  • The stats.py file is implemented to analyze the content of the CSV files and display statistics on the benchmark run

https://git.easter-eggs.org/biceps/biceps/-/tree/939dae6b7530dcd9152b60b0a2dc8e0cbb5c821f

New stats look like this, with a Ceph cluster of 15 OSDs:

$ ./run-bench.sh --file-count-ro 100 --rw-workers 10 --ro-workers 5 --file-size $((10 * 1024)) --no-warmup
Bytes write   188.6 MB/s
Objects write 9.2 Kobject/s
Bytes read    69.5 MB/s
Objects read  17.0 Kobject/s
24765 random reads take longer than 100ms (5.292028514770241%)

It confirms that reads compete with writes. At the beginning there are times when no writes are performed because the objects are written to the Write Storage. All workers start writing Shards on the Read Storage at the same time (because they all start at the same time). The reads then slow down significantly and small random reads take longer than 100ms.

For the record this blog post published April 2021 has pointers on how to benchmark and tune Ceph.

https://ceph.io/community/qos-study-with-mclock-and-wpq-schedulers/

https://git.easter-eggs.org/biceps/biceps/-/tree/51bb6d17ee2d568c6b3af34accd36e2a1df77347

  • Implement throttling for writes

The write pattern changes: instead of a continuous pressure, it spikes because the writers sleep. This does not help with the read latency: they still happen when the writes spike. The benchmark has no control over this, it happens when the RBD kernel driver think it appropriate. It could be controlled by using librbd instead of mounting the RBD device or modify the parameters of the kernel module.

https://git.easter-eggs.org/biceps/biceps/-/tree/8701731b4a98042fb8dee09a9c239928db0b1dc1

  • Add RBD QoS dynamically to avoid bursts
$  /run-bench.sh --reader-io 500 --writer-io 80 --file-count-ro 100 --rw-workers 10 --ro-workers 5 --file-size $((10 * 1024)) --no-warmup
Bytes write   99.5 MB/s
Objects write 4.9 Kobject/s
Bytes read    126.5 MB/s
Objects read  31.0 Kobject/s
18147 random reads take longer than 100ms (1.2006558053861789%)

https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-06-25

With a warmup phase and 100GB Shards. The number of PGs was incorrectly set to the ro pool instead of the ro-data pool: background recovery happened during the last third of the run approximately.

$  /run-bench.sh --reader-io 150 --writer-io 85 --file-count-ro 350 --rw-workers 10 --ro-workers 5 --file-size $((100 * 1024)) 
Bytes write   108.7 MB/s
Objects write 5.3 Kobject/s
Bytes read    103.4 MB/s
Objects read  25.3 Kobject/s
494961 random reads take longer than 100ms (1.7177595208163046%)

After discussing with @olasd and @douardda today, it was decided that although the benchmarks are not 100% as expected, they are good enough. The next steps will be to:

After some cleanup, the final version is https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8. It concludes this (long) task and it can be closed.