@olasd E. Lacour completed a study for a Ceph cluster today, with hardware specifications and pricing. He is available to discuss if you'd like.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 31 2021
May 19 2021
May 17 2021
- display the time to first byte for random reads
May 15 2021
Reducing the number of read workers to 20 allows writes to perform as expected. The test results are collected in the README file for archive.
Now reads perform a lot better because the miscalculation is fixed but also because the RBD is mounted read-only. It must be throttled otherwise it puts too much pressure on the cluster which underperforms on writes.
- estimate the number of objects with sequential read based on the median size
- implement read-only to experiment with various settings on an existing Read Storage
May 10 2021
- remap RBD images readonly when they are full so that there is no need to acquire read-write (not sure it matters, just an idea at this point and it's a simple thing to do)
- clobber postgres when starting the benchmarks, in case there are leftovers
- the postgres standby does not need to be hot (see above)
- add recommended tuning for PostgreSQL (assuming a machine that has 128GB RAM)
- zap the grid5000 nvme for PostgreSQL because they are not reset when the machine is deployed
With hot_standby = off the WAL is quickly flushed to the standby server when the write finish.
As soon as the write finish, the benchmark starts to read all databases as fast as it can which
significantly slows down the replication because it needs to ensure strong consistency between the
master and the standby.
Tune PostgreSQL and verify it improves the situation as follows:
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup ... WARNING:root:Objects write 6.8K/s WARNING:root:Bytes write 137.7MB/s WARNING:root:Objects read 1.5K/s WARNING:root:Bytes read 109.9MB/s
May 8 2021
After writing 1TB in 40 DB (40 * 25GB), the WAL is ~200GB i.e. ~20%:
$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 500 --rw-workers 40 --ro-workers 40 --file-size 50000 --no-warmup
May 3 2021
While this is very creative, there is no benefit in storing small objects in git for the Software Heritage workload.
There is no need to use Ceph for the Write Storage: PostgreSQL performs well and there is no scaling problem. The size of the Write Storage is limited, by design.
It was discussed, during the Ceph Developer Summit 2021 and the conclusion was that RADOS is not the place to implement immutable optimizations. RGW is a better fit.
- Group the two postgresql nvme drives in a single logical volume to get more storage. We need 30 write workers using 100GB Shards require 3TB of postgresql storage
- Setup a second postgresql server set as a standby replication of the master: it may negatively impact the performances of the master cluster and should be included in the benchmark
- Explain the benchmark methodology & assumptions
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --no-warmup ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 117.9MB/s WARNING:root:Objects read 1.3K/s WARNING:root:Bytes read 100.4MB/s
May 2 2021
$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --rand-ratio 10 ... WARNING:root:Objects write 5.8K/s WARNING:root:Bytes write 118.4MB/s WARNING:root:Objects read 12.3K/s WARNING:root:Bytes read 850.3MB/s
May 1 2021
Fix a race condition that failed postgresql database drops.
Apr 27 2021
The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:
Apr 20 2021
Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.
Apr 19 2021
Completed the tests for the rewrite, it is working.
Apr 18 2021
rbd bench on the images created
Apr 17 2021
There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.
https://www.grid5000.fr/w/Grenoble:Network shows the network topology
Complete rewrite to:
Apr 14 2021
- Add reader to continuously read from images to simulate a read workload
- Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)
Apr 12 2021
- bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
- bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
Apr 7 2021
The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps
Apr 6 2021
Takeaways from the session:
Mar 30 2021
Mar 26 2021
Mar 25 2021
Mar 24 2021
Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).
Mar 23 2021
The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.
The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.
Mar 17 2021
Mail thread with Chris Lu on SeaweedFS use cases with 100+ billions objects.
First draft for layer 0.
Mar 15 2021
Bookmarking https://leo-project.net/leofs/
Mar 10 2021
With a little help from the mattermost channel and after approval of the account, it was possible to boot a physical machine with a Debian GNU/Linux installed from scratch and get root access to it.
Thanks for helping with the labelling @rdicosmo 👍
Added a section about TCO in the design document.
Mar 9 2021
There is a mattermost channel dedicated to Grid5000 but one has to be invited to join, it is not open to the public.
Additional nvme drives for yeti should be something similar to https://www.samsung.com/semiconductor/ssd/enterprise-ssd/ but confirmation is needed to verify the machines actually have the required SFF-8639 to plug them in.
The account request was approved, I'll proceed with a minimal reservation to figure out how it is done.
Thanks for the feedback. https://www.grid5000.fr/w/Grenoble:Hardware#yeti has 1.6TB nvme which seems better. It would be better to have a total of 4TB nvme available to get closer to the target global index size (i.e. 40 bytes 100 billions entries = 4TB). I'm told it is possible to donate hardware to Grid5000: if testing with the current configuration is not convincing enough, 4 more nvme pcie drives could be donated and they would be installed in the machines. No idea how much delay to expect but its good to know it is possible.
Looking at the available hardware, here is what could be used:
Followed the instructions at https://www.grid5000.fr/w/Grid5000:Get_an_account to get an account. Waiting for approval.