The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.
The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.
direnv: loading bench/.envrc ========================================================= test session starts ========================================================= platform linux -- Python 3.7.3, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /root plugins: mock-3.5.1, asyncio-0.14.0 collected 10 items bench/test_bench.py .......... [100%] ========================================================= 10 passed in 55.48s ========================================================= Connection to dahu-25.grenoble.grid5000.fr closed.
I could not resist despite the fact that the benchmark is nowhere near meaningful and tried it anyway. With 2 writers it gives:
WARNING:root:Objects write/seconds 1008/s WARNING:root:Bytes write/seconds 20MB/s
and with 5 writers it gives:
WARNING:root:Objects write/seconds 2K/s WARNING:root:Bytes write/seconds 41MB/s
Meanwhile the PostgreSQL host has way too many processors for the load :-)
The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps
Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves ~500 object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.
I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:
- verify --fake-ro can reach 10K object insertions per second with 20 workers
- run the bench mark without --fake-ro and see how much MB/s it achieves
- bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
- bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
- bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024
- WARNING:root:Objects write 6.4K/s
- WARNING:root:Bytes write 131.1MB/s
- bench.py --file-count-ro 200 --rw-workers 20 --packer-workers 20 --file-size 1024
- WARNING:root:Objects write 6.1K/s
- WARNING:root:Bytes write 124.4MB/s
- Add reader to continuously read from images to simulate a read workload
- Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)
I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:
- zapping known devices even when they don't show up via the orchestrator: the internal logic waits for the required data to be available and does a better job than an attempt to wait for them to show up (sometime they don't and the reason is unclear)
- using a Docker mirror registry to avoid hitting the rate limit
Complete rewrite to:
- Use one thread per worker (using asyncio for workloads turns out to be too complicated because python3 lacks universal support, for file I/O)
- Merge the write/pack steps together for simplicity since one follows the other, always
Running the tests during ~24h showed:
- there is no significant memory leak (but there is). The memory usage stayed at 27GB and went from 800MB RSS to 1.4GB RSS.
- the throughput is does not degrade over time: creating ~6,000 image for 6TB shows the same througput as creating 20 images for 20GB.
There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.
root@ceph1:~# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 25 TiB 16 TiB 9.3 TiB 9.4 TiB 36.72 TOTAL 25 TiB 16 TiB 9.3 TiB 9.4 TiB 36.72 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 214 KiB 12 642 KiB 0 4.7 TiB ro-data 18 32 6.0 TiB 1.56M 9.3 TiB 40.04 9.3 TiB ro 19 32 11 MiB 12.16k 1.1 GiB 0 4.7 TiB
rbd bench on the images created
for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & done rm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done
Completed the tests for the rewrite, it is working.
debug and use nvme on yeti