- User Since
- Jan 8 2021, 11:21 PM (22 w, 2 d)
Creating a 20 billions global index fails because there is not enough disk space (2.9TB is full even with tunefs -m 0).
The equilibrium between reads and write is with 5 readers and 10 writers which leads to 1.2% random reads above the threshold, the worst one being 2sec. What it means is that care must be taken, application side, to throttle reads and writes otherwise the penalty is a significant degradation is latency.
When the benchmark write, the pressure of 40 workers slows down the reads significantly.
Running the benchmark with a read workload only (the Ceph cluster is doing nothing else), with 20 workers shows 8% of requests with a latency above the threshold:
In interrupted the benchmarks because it shows reads are not as expected, i.e. a large number of reads take very long and the number of reads per seconds is way more than what is needed. There is no throttling on reads only the number of workers is the limit. I was expecting they would be slowed down by other factors and not apply too much pressure on the cluster. But I was apparently wrong and throttling must be implemented to slow them down.
Sat, Jun 12
For the record, creating 10 billions entries in the global index took:
Mon, Jun 7
I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.
- Collect and display the worst time to first byte, not the average
This week-end run was not very fruitful: since the global index could not be populated as expected and it was discovered Sunday morning, there was no time to fallback to a small one, for instance 10 billion entries. A run was launched and lasted ~24h to show:
Sat, Jun 5
20 billions entries were inserted in the global index. After building the index it occupies 2.5TB, therefore each entry users ~125 bytes of raw space. That's 25% more than with a 1 billion entries global index (i.e. 100 bytes)
- Add insertion in the global index to the benchmark
Wed, Jun 2
My notes on the meeting:
Mon, May 31
- Add the generate script to ingest entries in the global index.
The call is set to Wednesday June 2nd, 2021 4pm UTC+2 at https://meet.jit.si/ApparentStreetsJokeOk
Wed, May 19
@olasd E. Lacour completed a study for a Ceph cluster today, with hardware specifications and pricing. He is available to discuss if you'd like.
Mon, May 17
- display the time to first byte for random reads
May 15 2021
Reducing the number of read workers to 20 allows writes to perform as expected. The test results are collected in the README file for archive.
Now reads perform a lot better because the miscalculation is fixed but also because the RBD is mounted read-only. It must be throttled otherwise it puts too much pressure on the cluster which underperforms on writes.
- estimate the number of objects with sequential read based on the median size
- implement read-only to experiment with various settings on an existing Read Storage
May 10 2021
- remap RBD images readonly when they are full so that there is no need to acquire read-write (not sure it matters, just an idea at this point and it's a simple thing to do)
- clobber postgres when starting the benchmarks, in case there are leftovers
- the postgres standby does not need to be hot (see above)
- add recommended tuning for PostgreSQL (assuming a machine that has 128GB RAM)
- zap the grid5000 nvme for PostgreSQL because they are not reset when the machine is deployed
With hot_standby = off the WAL is quickly flushed to the standby server when the write finish.
As soon as the write finish, the benchmark starts to read all databases as fast as it can which
significantly slows down the replication because it needs to ensure strong consistency between the
master and the standby.
Tune PostgreSQL and verify it improves the situation as follows:
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup ... WARNING:root:Objects write 6.8K/s WARNING:root:Bytes write 137.7MB/s WARNING:root:Objects read 1.5K/s WARNING:root:Bytes read 109.9MB/s
May 8 2021
After writing 1TB in 40 DB (40 * 25GB), the WAL is ~200GB i.e. ~20%:
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup
May 3 2021
While this is very creative, there is no benefit in storing small objects in git for the Software Heritage workload.
There is no need to use Ceph for the Write Storage: PostgreSQL performs well and there is no scaling problem. The size of the Write Storage is limited, by design.
It was discussed, during the Ceph Developer Summit 2021 and the conclusion was that RADOS is not the place to implement immutable optimizations. RGW is a better fit.
- Group the two postgresql nvme drives in a single logical volume to get more storage. We need 30 write workers using 100GB Shards require 3TB of postgresql storage
- Setup a second postgresql server set as a standby replication of the master: it may negatively impact the performances of the master cluster and should be included in the benchmark
- Explain the benchmark methodology & assumptions
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --no-warmup ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 117.9MB/s WARNING:root:Objects read 1.3K/s WARNING:root:Bytes read 100.4MB/s
May 2 2021
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --rand-ratio 10 ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 118.4MB/s WARNING:root:Objects read 12.3K/s WARNING:root:Bytes read 850.3MB/s
May 1 2021
Fix a race condition that failed postgresql database drops.
Apr 27 2021
The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:
Apr 20 2021
Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.
Apr 19 2021
Completed the tests for the rewrite, it is working.
Apr 18 2021
rbd bench on the images created
Apr 17 2021
There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.
https://www.grid5000.fr/w/Grenoble:Network shows the network topology
Complete rewrite to:
Apr 14 2021
- Add reader to continuously read from images to simulate a read workload
- Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)
Apr 12 2021
- bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
- bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
Apr 7 2021
The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps
Apr 6 2021
Takeaways from the session:
Mar 30 2021
Mar 26 2021
Mar 25 2021
Mar 24 2021
Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).
Mar 23 2021
The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.
The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.
Mar 17 2021
Mail thread with Chris Lu on SeaweedFS use cases with 100+ billions objects.
First draft for layer 0.
Mar 15 2021
Mar 10 2021
With a little help from the mattermost channel and after approval of the account, it was possible to boot a physical machine with a Debian GNU/Linux installed from scratch and get root access to it.