Benchmarks software for the object storage.
The version that will be used to run the benchmarks can be found at https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8
Benchmarks software for the object storage.
The version that will be used to run the benchmarks can be found at https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T3116 Roll out at least one operational mirror | ||
Migrated | gitlab-migration | T3054 Scale out object storage design | ||
Migrated | gitlab-migration | T3149 Benchmark software for the object storage |
First draft for layer 0.
The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.
The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.
direnv: loading bench/.envrc ========================================================= test session starts ========================================================= platform linux -- Python 3.7.3, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /root plugins: mock-3.5.1, asyncio-0.14.0 collected 10 items bench/test_bench.py .......... [100%] ========================================================= 10 passed in 55.48s ========================================================= Connection to dahu-25.grenoble.grid5000.fr closed.
I could not resist despite the fact that the benchmark is nowhere near meaningful and tried it anyway. With 2 writers it gives:
WARNING:root:Objects write/seconds 1008/s WARNING:root:Bytes write/seconds 20MB/s
and with 5 writers it gives:
WARNING:root:Objects write/seconds 2K/s WARNING:root:Bytes write/seconds 41MB/s
Meanwhile the PostgreSQL host has way too many processors for the load :-)
Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).
The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps
Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves ~500 object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.
I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:
https://git.easter-eggs.org/biceps/biceps/-/commit/4552098bc6f364ab0e59df996551f23b2ec35049
I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:
https://git.easter-eggs.org/biceps/biceps/-/commit/ffaf1cad18748377ec8e90b12beed83a862afd4f
Complete rewrite to:
Running the tests during ~24h showed:
There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.
root@ceph1:~# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 25 TiB 16 TiB 9.3 TiB 9.4 TiB 36.72 TOTAL 25 TiB 16 TiB 9.3 TiB 9.4 TiB 36.72 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 214 KiB 12 642 KiB 0 4.7 TiB ro-data 18 32 6.0 TiB 1.56M 9.3 TiB 40.04 9.3 TiB ro 19 32 11 MiB 12.16k 1.1 GiB 0 4.7 TiB
rbd bench on the images created
for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & done rm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done
https://git.easter-eggs.org/biceps/biceps/-/commit/9ee7fa5db5766bc8b765e0967d40ed86185fe1e9
Completed the tests for the rewrite, it is working.
https://git.easter-eggs.org/biceps/biceps/-/commit/5463a5c133f95de2406902b173b1b1e0eeb78630
debug and use nvme on yeti
https://git.easter-eggs.org/biceps/biceps/-/commit/d10cf0ec8c6e18ebc79122d0bb3587edbcb4594e
Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 200 --rw-workers 10 --ro-workers 10 --file-size 1024 --no-warmup WARNING:root:Objects write 1.4K/s WARNING:root:Bytes write 27.9MB/s WARNING:root:Objects read 816/s WARNING:root:Bytes read 61.6MB/s
Since there are 4 process delivering the same performances (more or less 10%), this is:
https://git.easter-eggs.org/biceps/biceps/-/commit/d4702adfedfa5486674062aa7364766ca4d22ec1
The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:
The grid5000 cluster was reserved for this weekend (from friday night to sunday night) to run tests and collect results which are hopefully final for the layer 0 benchmarks, using 100GB images and all available disk space. I will babysit the run to catch unexpected behavior.
https://git.easter-eggs.org/biceps/biceps/-/commit/c5bb5a56291a00dd06529d4425cdfa05a36c2891
Fix a race condition that failed postgresql database drops.
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --rand-ratio 10 ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 118.4MB/s WARNING:root:Objects read 12.3K/s WARNING:root:Bytes read 850.3MB/s
I think the results are too optimistic for reads because it starts at the warmup phase (i.e. when a single Shard exists in the Read Storage) and it fits in RAM.
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --no-warmup ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 117.9MB/s WARNING:root:Objects read 1.3K/s WARNING:root:Bytes read 100.4MB/s
Now the reads do not deliver the expected number of objects/s which I did not forsee.
https://git.easter-eggs.org/biceps/biceps/-/commit/83434e93db6a12988d1f12764cecc5447a9af549
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup
Crashed because the postgres WAL grew over 700GB and filled the disk. It happened after the first 40 workers completed and after the next 40 were started.
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 1000 --rw-workers 40 --ro-workers 40 --file-size 25000 --no-warmup
Should consume a maximum of 1TB of database space at any point in time (40 * 25GB) and that will allow monitoring of how the size of the WAL grows, specially after databases are deleted.
After writing 1TB in 40 DB (40 * 25GB), the WAL is ~200GB i.e. ~20%:
Sat 08 May 2021 10:18:20 AM CEST 1.1T /var/lib/postgresql/11/main/base 198G /var/lib/postgresql/11/main/pg_wal
and the standby server is ~200GB behind and its WAL is small which suggests it is not the one lagging behind:
Sat 08 May 2021 10:18:29 AM CEST 900G /var/lib/postgresql/11/main/base 1.5G /var/lib/postgresql/11/main/pg_wal
While the content of the Write Storage is moved over to the Read Storage, the standby catches up but not very fast:
Sat 08 May 2021 10:28:21 AM CEST 1.1T /var/lib/postgresql/11/main/base 185G /var/lib/postgresql/11/main/pg_wal
and the standby:
Sat 08 May 2021 10:28:00 AM CEST 919G /var/lib/postgresql/11/main/base 1.6G /var/lib/postgresql/11/main/pg_wal
Looking at the walreceiver process on the standby, it writes on disk at a rate of up to 500MB/s when the databases are being written. But when they are complete and there only are reads to move data from Write Storage to Read Storage, it goes down to 100MB/s although the standby is still ~200GB behind the master and there is ~200GB worth of WAL available on the master. Why does the receiver slow down?
After the first round of writers is done and 1TB worth of database is deleted from the master, the standby does not free the disk space:
Sat 08 May 2021 10:57:25 AM CEST 107G /var/lib/postgresql/11/main/base 163G /var/lib/postgresql/11/main/pg_wal
and on the standby:
Sat 08 May 2021 10:57:32 AM CEST 1022G /var/lib/postgresql/11/main/base 1.4G /var/lib/postgresql/11/main/pg_wal
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup ... WARNING:root:Objects write 6.8K/s WARNING:root:Bytes write 137.7MB/s WARNING:root:Objects read 1.5K/s WARNING:root:Bytes read 109.9MB/s
There is a contention on reads: the individual OSDs are not overloaded (vmstat shows almost no iowait), the network bandwidth is at 50% capacity, the CPU is mostly idle, the RAM is used at less than 50%. An analysis of a running cluster is needed to figure out why the reads under perform. The the first task is to repeat the problem with a shorter run of the benchmark.
https://git.easter-eggs.org/biceps/biceps/-/commit/54e9c9e5b5ec00e22d2ae28f8d0ae91f658eac0d
Tune PostgreSQL and verify it improves the situation as follows:
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 1000 --rw-workers 40 --ro-workers 40 --file-size 25000 --no-warmup
https://git.easter-eggs.org/biceps/biceps/-/commit/45aedaa6887659dc4454e2109a6b7361abab5db9
With hot_standby = off the WAL is quickly flushed to the standby server when the write finish.
As soon as the write finish, the benchmark starts to read all databases as fast as it can which
significantly slows down the replication because it needs to ensure strong consistency between the
master and the standby.
Here is the disk usage on the standby towards the end of the write phase. The WAL has grown to 15GB
and is ~50GB on the master. It only takes a few minutes for the standby to catch up.
Mon 10 May 2021 02:46:19 PM CEST 382G /var/lib/postgresql/11/main/base 15G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:46:49 PM CEST 388G /var/lib/postgresql/11/main/base 11G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:47:19 PM CEST 396G /var/lib/postgresql/11/main/base 7.7G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:47:49 PM CEST 402G /var/lib/postgresql/11/main/base 5.7G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:48:19 PM CEST 409G /var/lib/postgresql/11/main/base 4.3G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:48:49 PM CEST 416G /var/lib/postgresql/11/main/base 2.4G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:49:19 PM CEST 421G /var/lib/postgresql/11/main/base 2.4G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:49:49 PM CEST 426G /var/lib/postgresql/11/main/base 2.4G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:50:19 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:50:49 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:51:19 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:51:49 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:52:19 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:52:49 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:53:20 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:53:50 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal
And the disk usage on the master:
Mon 10 May 2021 02:52:17 PM CEST 431G /var/lib/postgresql/11/main/base 48G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:52:48 PM CEST 431G /var/lib/postgresql/11/main/base 48G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:53:18 PM CEST 431G /var/lib/postgresql/11/main/base 48G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:53:48 PM CEST 431G /var/lib/postgresql/11/main/base 48G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:54:18 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:54:48 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal Mon 10 May 2021 02:55:18 PM CEST 431G /var/lib/postgresql/11/main/base 1.1G /var/lib/postgresql/11/main/pg_wal
https://git.easter-eggs.org/biceps/biceps/-/commit/b23b78446aa485586d29ba599e55a341fc2db29c
https://git.easter-eggs.org/biceps/biceps/-/commit/ca01c6d758b26cfe27e6a61993f6c0d94986b460
Now reads perform a lot better because the miscalculation is fixed but also because the RBD is mounted read-only. It must be throttled otherwise it puts too much pressure on the cluster which underperforms on writes.
$ bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 40000 --no-warmup ... WARNING:root:Objects write 3.9K/s WARNING:root:Bytes write 78.7MB/s WARNING:root:Objects read 46.5K/s WARNING:root:Bytes read 190.1MB/s
Reducing the number of read workers to 20 allows writes to perform as expected. The test results are collected in the README file for archive.
https://git.easter-eggs.org/biceps/biceps/-/commit/14a6cf08549712a4341db548c4aa984e1494d206
$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 1024 --no-warmup WARNING:root:Objects write 6.0K/s WARNING:root:Bytes write 122.0MB/s WARNING:root:Objects read 66.8K/s WARNING:root:Bytes read 272.7MB/s
$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 10240 --no-warmup WARNING:root:Objects write 5.9K/s WARNING:root:Bytes write 121.3MB/s WARNING:root:Objects read 42.7K/s WARNING:root:Bytes read 174.2MB/s
$ bench.py --file-count-ro 200 --rw-workers 40 --ro-workers 20 --file-size 40000 --no-warmup WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 118.3MB/s WARNING:root:Objects read 34.5K/s WARNING:root:Bytes read 140.8MB/s
https://git.easter-eggs.org/biceps/biceps/-/commit/6e911e7f2f94c3cac2bca7c38edf7c4d1d138b81
https://git.easter-eggs.org/biceps/biceps/-/commit/78259cc59251d817b730f75f1444d5a8a3d7ce1e
It proceeds in two phases a) use COPY to insert entries at 500K/s, b) build the index. Since building the index roughly takes the same time as inserting the rows, the effective ingestion speed is 250K/s. Each entry uses ~100 bytes of raw space when indexed. That's ~1 billion entries per hour. It should take 24h to create a 2.5TB index (roughly the maximum space that is currently available on a PostgreSQL server) with 25 billions entries.
When COPY inserts records while the index exists, the rate drops to 32K/s, i.e. it is an order of magnitude faster to create the index afterwards.
https://git.easter-eggs.org/biceps/biceps/-/tree/2ca2ed877ba06c43956ef1221ebd0724337b4c03
20 billions entries were inserted in the global index. After building the index it occupies 2.5TB, therefore each entry uses ~125 bytes of raw space. That's 25% more than with a 1 billion entries global index (i.e. 100 bytes)
It turns out building the index for 20 billions entries takes more time than expected. Creating the table took 10 hours. I stopped the create index after 18 hours. A 10 billions entries index should be created first to evaluate how long it really takes.
https://git.easter-eggs.org/biceps/biceps/-/tags/v2021-06-05
This week-end run was not very fruitful: since the global index could not be populated as expected and it was discovered Sunday morning, there was no time to fallback to a small one, for instance 10 billion entries. A run was launched and lasted ~24h to show:
The performances are as expected:
$ bench.py --file-count-ro 400 --rw-workers 40 --ro-workers 20 --file-size $((40 * 1024)) --no-warmup ... WARNING:root:Objects write 5.4K/s WARNING:root:Bytes write 110.6MB/s WARNING:root:Objects read 40.7K/s WARNING:root:Bytes read 166.2MB/s WARNING:root:Time to first byte 62ms
However the time to first byte is the average but it should be the maximum time to first byte and this needs fixing.
The grid5000 cluster was booked for next week-end with:
and this needs fixing.
do you mean the bench code needs fixing (to report the proper stats)?
While you're at it, could you report quantiles for the time to first byte, instead of just a raw maximum?
Something like:
(this all might be overkill, but...)
https://git.easter-eggs.org/biceps/biceps/-/commit/17c591e1b31e1146ee397f93a244e0b6f5e60788
Preliminary testing shows that there is roughly 1 in a million reads taking longer than 100ms. I don't think it can be avoided, even in a controlled environment there will be bumps that create longer response time. Since we're probably not interested in how time to first byte varies when it's under 100ms I'm not very motivated to create and display this information. It would be interesting to figure out when the read takes longer than 100ms but this is a complex system and figuring the root cause of these exceptional events is difficult.
I agree that some latency spikes are probably unavoidable, or at least that it would take considerable profiling and observation to get a hang on them (which is outside of the scope of this set of benchmarks). I also agree that the current results for the benchmark look on target.
I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.
I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.
how about just collecting all raw timings in an output CSV file (or several files if needed) and compute the stats downstream (e.g., with pandas)?
that would allow changing the percentiles later on as well as compute different stats, without having to rerun the benchmarks
For the record, creating 10 billions entries in the global index took:
real 872m59.325s user 192m42.553s sys 134m48.580s
I interrupted the benchmarks because it shows reads are not as expected, i.e. a large number of reads take very long and the number of reads per seconds is way more than what is needed. There is no throttling on reads only the number of workers is the limit. I was expecting they would be slowed down by other factors and not apply too much pressure on the cluster. But I was apparently wrong and throttling must be implemented to slow them down.
WARNING:root:Objects write 6.5K/s WARNING:root:Bytes write 131.8MB/s WARNING:root:Objects read 55.3K/s WARNING:root:Bytes read 226.1MB/s WARNING:root:17.84542486337239% of random reads took longer than 100.0ms
The writes are however performing in a way that is consistent with previous results despite the fact that it also writes in the global index populated with 10 billions entries.
Running the benchmark with a read workload only (the Ceph cluster is doing nothing else), with 20 workers shows 8% of requests with a latency above the threshold:
$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 20 WARNING:root:Objects read 64.2K/s WARNING:root:Bytes read 261.6MB/s WARNING:root:8.382380110322257% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [699, 687, 685, 682, 666, 654, 649, 641, 637, 633, 632, 624, 622, 622, 601, 574, 560, 558, 557, 555, 553, 553, 550, 549, 548, 547, 546, 544, 542, 535, 534, 531, 520, 507, 500, 495, 493, 492, 485, 483, 476, 475, 473, 471, 469, 468, 468, 467, 467, 464, 464, 462, 460, 460, 460, 460, 455, 453, 449, 447, 447, 447, 441, 441, 439, 438, 436, 436, 435, 435, 435, 431, 429, 426, 419, 418, 417, 415, 414, 413, 412, 411, 409, 408, 407, 406, 405, 404, 403, 403, 403, 402, 400, 399, 396, 394, 394, 394, 394, 394]
Running 5 read workers in the same conditions shows latencies above the threshold are about 0.01%
$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 5 WARNING:root:Objects read 36.8K/s WARNING:root:Bytes read 151.1MB/s WARNING:root:0.01038885483653137% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [133, 129, 121, 119, 109, 107, 107, 105, 101, 100]
When the benchmark write, the pressure of 40 workers slows down the reads significantly.
WARNING:root:Objects read 9.0K/s WARNING:root:Bytes read 36.6MB/s WARNING:root:19.843788137661704% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [4432, 3346, 3028, 2561, 2297, 2101, 1832, 1788, 1759, 1696, 1676, 1574, 1541, 1524, 1502, 1407, 1379, 1335, 1253, 1244, 1234, 1233, 1200, 1197, 1167, 1128, 1091, 1089, 1046, 1017, 1016, 1007, 980, 961, 958, 923, 905, 904, 882, 868, 864, 855, 849, 811, 7 97, 796, 794, 792, 789, 780, 775, 767, 757, 731, 726, 718, 712, 711, 708, 708, 693, 680, 665, 661, 659, 641, 626, 623, 609, 607, 594, 593, 588, 583, 581, 577 , 572, 560, 558, 554, 537, 530, 522, 522, 519, 514, 508, 495, 494, 493, 492, 489, 488, 483, 481, 481, 479, 476, 475, 474]
It is quite clear that they are competing. The next step is to try to reduce the number of write workers.
The equilibrium between reads and write is with 5 readers and 10 writers which leads to 1.2% random reads above the threshold, the worst one being 2sec. What it means is that care must be taken, application side, to throttle reads and writes otherwise the penalty is a significant degradation is latency.
$ bench.py --file-count-ro 0 --file-size 40000 --rw-workers 0 --ro-workers 5 WARNING:root:Objects read 29.4K/s WARNING:root:Bytes read 119.9MB/s WARNING:root:1.2761297725814886% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [2123, 2011, 1952, 1939, 1812, 1777, 1756, 1710, 1587, 1565, 1559, 1546, 1543, 1512, 1494, 1472, 1472, 1464, 1456, 1421, 1403, 1375, 1372, 1362, 1355, 1353, 1348, 1341, 1324, 1323, 1321, 1293, 1285, 1282, 1269, 1268, 1264, 1263, 1262, 1254, 1235, 1235, 1227, 1224, 1220, 1183, 1174, 1172, 1167, 1166, 1162, 1160, 1157, 1142, 1139, 1129, 1128, 1120, 1104, 1104, 1104, 1098, 1092, 1087, 1083, 1082, 1080, 1079, 1070, 1069, 1058, 1050, 1044, 1042, 1042, 1040, 1030, 1028, 1023, 1019, 1017, 1016, 1015, 1014, 1013, 1013, 1008, 1008, 1007, 1004, 1002, 1002, 1002, 1001, 999, 994, 987, 985, 982, 977]
$ bench.py --file-count-ro 500 --rw-workers 10 --ro-workers 0 --file-size 40000 --no-warmup --no-reset WARNING:root:Objects write 5.6K/s WARNING:root:Bytes write 113.9MB/s
Creating a 20 billions global index fails because there is not enough disk space (2.9TB is full even with tunefs -m 0).
$ time python generate.py --count $((20 * 1024 * 1024 * 1024)) && time python generate.py --index ... real 658m2.446s user 392m2.581s sys 268m17.226s Traceback (most recent call last): File "generate.py", line 213, in <module> sys.exit(asyncio.run(Application().main(sys.argv[1:]))) File "/usr/lib/python3.7/asyncio/runners.py", line 43, in run return loop.run_until_complete(main) File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete return future.result() File "generate.py", line 208, in main d.create_index() File "generate.py", line 76, in create_index c.execute("CREATE INDEX shards_index ON shards (signature)") psycopg2.errors.DiskFull: could not extend file "base/16385/16427.358": No space left on device HINT: Check free disk space. real 1081m25.672s user 0m0.066s sys 0m0.014s
When the Read Storage went over 20TB, the PGs of the Ceph pool were automatically increased (double). As a consequence backfilling started but it is throttled to not have a negative impact on performances.
$ ceph status cluster: id: 80469a64-d059-11eb-b593-3cfdfe55e820 health: HEALTH_WARN 2 pool(s) do not have an application enabled services: mon: 5 daemons, quorum ceph1,ceph21,ceph24,ceph29,ceph11 (age 24h) mgr: ceph1.bvaofb(active, since 24h), standbys: ceph9.bnncbd osd: 28 osds: 28 up (since 24h), 28 in (since 24h) data: pools: 3 pools, 65 pgs objects: 2.94M objects, 11 TiB usage: 18 TiB used, 84 TiB / 102 TiB avail pgs: 65 active+clean io: client: 8.3 MiB/s rd, 520 MiB/s wr, 76 op/s rd, 260 op/s wr
$ ceph status cluster: id: 80469a64-d059-11eb-b593-3cfdfe55e820 health: HEALTH_WARN 2 pool(s) do not have an application enabled services: mon: 5 daemons, quorum ceph1,ceph21,ceph24,ceph29,ceph11 (age 39h) mgr: ceph1.bvaofb(active, since 39h), standbys: ceph9.bnncbd osd: 28 osds: 28 up (since 9h), 28 in (since 39h); 61 remapped pgs data: pools: 3 pools, 161 pgs objects: 4.86M objects, 19 TiB usage: 29 TiB used, 73 TiB / 102 TiB avail pgs: 3592853/29185062 objects misplaced (12.311%) 100 active+clean 59 active+remapped+backfill_wait 2 active+remapped+backfilling io: client: 105 MiB/s rd, 299 op/s rd, 0 op/s wr recovery: 35 MiB/s, 8 objects/s
https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-06-18
$ bench.py --file-count-ro 350 --rw-workers 10 --ro-workers 5 --file-size $((100 * 1024)) --no-warmup ... WARNING:root:Objects write 6.4K/s WARNING:root:Bytes write 131.3MB/s WARNING:root:Objects read 24.3K/s WARNING:root:Bytes read 99.4MB/s WARNING:root:2.0859388857985817% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [10751, 8217, 7655, 7446, 7366, 6919, 6722, 6515, 6481, 6079, 5918, 5839, 5823, 5759, 5634, 5573, 5492, 5335, 5114, 5105, 5009, 4976, 4963, 4914, 4913, 4854, 4822, 4668, 4658, 4605, 4593, 4551, 4537, 4489, 4470, 4431, 4418, 4411, 4385, 4327, 4298, 4224, 4090, 4082, 4070, 4010, 3868, 3865, 3819, 3818, 3815, 3805, 3798, 3755, 3719, 3716, 3711, 3704, 3688, 3612, 3608, 3606, 3579, 3543, 3537, 3527, 3493, 3450, 3441, 3356, 3346, 3338, 3319, 3313, 3294, 3272, 3264, 3258, 3244, 3183, 3179, 3160, 3145, 3136, 3127, 3123, 3119, 3107, 3098, 3093, 3090, 3083, 3082, 3068, 3057, 3052, 3029, 3028, 3022, 3022]
Although the number of PGs increased during the benchmark run, causing recovery in the background, I don't think it had a very significant impact on performances. The 2% of slow random reads probably happen when the Read Storage are being created because the writers are not throttled and apply a high pressure on the cluster. If throttling was implemented, it is expected to be below 1% with maximum values that are not higher than 1 second.
The benchmarks will be modified to output raw data about its workload every 5 seconds for the duration of the benchmark to verify the latency is correlated with the writes.
https://git.easter-eggs.org/biceps/biceps/-/tree/f8fff3e951b9dfde33584f22a9f689cbf30579c2
https://git.easter-eggs.org/biceps/biceps/-/tree/939dae6b7530dcd9152b60b0a2dc8e0cbb5c821f
New stats look like this, with a Ceph cluster of 15 OSDs:
$ ./run-bench.sh --file-count-ro 100 --rw-workers 10 --ro-workers 5 --file-size $((10 * 1024)) --no-warmup Bytes write 188.6 MB/s Objects write 9.2 Kobject/s Bytes read 69.5 MB/s Objects read 17.0 Kobject/s 24765 random reads take longer than 100ms (5.292028514770241%)
It confirms that reads compete with writes. At the beginning there are times when no writes are performed because the objects are written to the Write Storage. All workers start writing Shards on the Read Storage at the same time (because they all start at the same time). The reads then slow down significantly and small random reads take longer than 100ms.
For the record this blog post published April 2021 has pointers on how to benchmark and tune Ceph.
https://ceph.io/community/qos-study-with-mclock-and-wpq-schedulers/
https://git.easter-eggs.org/biceps/biceps/-/tree/51bb6d17ee2d568c6b3af34accd36e2a1df77347
The write pattern changes: instead of a continuous pressure, it spikes because the writers sleep. This does not help with the read latency: they still happen when the writes spike. The benchmark has no control over this, it happens when the RBD kernel driver think it appropriate. It could be controlled by using librbd instead of mounting the RBD device or modify the parameters of the kernel module.
https://git.easter-eggs.org/biceps/biceps/-/tree/8701731b4a98042fb8dee09a9c239928db0b1dc1
$ /run-bench.sh --reader-io 500 --writer-io 80 --file-count-ro 100 --rw-workers 10 --ro-workers 5 --file-size $((10 * 1024)) --no-warmup Bytes write 99.5 MB/s Objects write 4.9 Kobject/s Bytes read 126.5 MB/s Objects read 31.0 Kobject/s 18147 random reads take longer than 100ms (1.2006558053861789%)
https://git.easter-eggs.org/biceps/biceps/-/tree/v2021-06-25
With a warmup phase and 100GB Shards. The number of PGs was incorrectly set to the ro pool instead of the ro-data pool: background recovery happened during the last third of the run approximately.
$ /run-bench.sh --reader-io 150 --writer-io 85 --file-count-ro 350 --rw-workers 10 --ro-workers 5 --file-size $((100 * 1024)) Bytes write 108.7 MB/s Objects write 5.3 Kobject/s Bytes read 103.4 MB/s Objects read 25.3 Kobject/s 494961 random reads take longer than 100ms (1.7177595208163046%)
After discussing with @olasd and @douardda today, it was decided that although the benchmarks are not 100% as expected, they are good enough. The next steps will be to:
After some cleanup, the final version is https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8. It concludes this (long) task and it can be closed.