The number of slow random reads reaches ~3.5% presumably because there is too much write pressure (the throttling of writes was removed).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 12 2021
The benchmarks were modified to (i) use a fixed number of random / sequential readers instead of a random choice for better predictability, (ii) introduce throttling to cap the sequential reads speed to approximately 200MB/s. A run of read only was run:
The run terminated August 11th @ 15:21 because of what appears to be a rare race condition. It was however mostly finished. The results show an unexpected degradation in the read performances. It deserves further investigation because it keeps degrading over time. The write performance are however stable and suggest the benchmark code itself may be responsible for this degradation. If the Ceph cluster was globally slowing down, both reads and writes would show a degradation in performance because previous benchmark results showed that there is a correlation between the two.
Aug 2 2021
Improve the readability of the graphs
Rehearse the run and make minor updates to make sure it runs right away this friday.
Jul 20 2021
In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store)
Jul 19 2021
The Compress, Hash and Displace: CHD Algorithm described in http://cmph.sourceforge.net/papers/esa09.pdf generates a hash function under 4MB for ~30M keys, 32 bytes each.
A 100GB file can have 25M objects (4KB median size). If a perfect hash function requires 4bits per entry, that's reading ~12MB for every lookup.
the colliding entries may be stored adjacent to each other...
I just realized that since a perfect hash function need parameters that may require additional sequential reads at the beginning of the file, it would actually make more sense to have a regular hash function with a format that allows for collisions. Even if the collisions are relatively frequent, the colliding entries may be stored adjacent to each other and will not require an additional read. They are likely to be in the same block most of the time. That would save the trouble of implementing a perfect hash function.
what are "the parameters to the perfect hash functions"? what are the possible formats?
The content of a file:
On the topic of throttling, the following discussion happened on IRC:
I misrepresented @olasd suggestions, here is the chat log on the matter.
In D6006#154829, @vlorentz wrote:why *args, **kwargs on all methods?
Jul 12 2021
On a une procédure pour ce genre de cas, je t'ai ajouté au groupe
"oar-unrestricted-adv-reservations" qui devrait lever toutes les
restrictions sur les réservations à l'avance de ressources. Tu devrais du
coup pouvoir refaire ta réservation avec le bon walltime.J'ai mis une date d'expiration au 12 septembre sur ce groupe pour être sûr
que ça suffise, mais pense bien à refaire une demande d'utilisation
spéciale si tu as un nouveau besoin hors charte après celle d'août.
Mail sent today:
Jul 10 2021
$ oarsub -t exotic -l "{cluster='dahu'}/host=30+{cluster='yeti'}/host=3,walltime=216" --reservation '2021-08-06 19:00:00' -t deploy [ADMISSION RULE] Include exotic resources in the set of reservable resources (this does NOT exclude non-exotic resources). [ADMISSION RULE] Error: Walltime too big for this job, it is limited to 168 hours
Received yesterday:
Jul 6 2021
Quote for the write storage nodes.
- Storage node 8TB
Special permission request sent:
The benchmark results using grid5000 turn out to be good enough and there will not be a need to use the resources of the Sepia lab.
Using a hash table is a better option because it is O(1) instead of O(log(n))
It is not worth the effort and using a hash table is a better option.
After some cleanup, the final version is https://git.easter-eggs.org/biceps/biceps/-/tree/7d137fcd54f265253a27346b3652e26c6c5dd5e8. It concludes this (long) task and it can be closed.
Jun 28 2021
Jun 26 2021
With a warmup phase and 100GB Shards. The number of PGs was incorrectly set to the ro pool instead of the ro-data pool: background recovery happened during the last third of the run approximately.
Jun 22 2021
- Add RBD QoS dynamically to avoid bursts
- Implement throttling for writes
For the record this blog post published April 2021 has pointers on how to benchmark and tune Ceph.
Jun 21 2021
New stats look like this, with a Ceph cluster of 15 OSDs:
- The statistics are no longer displayed as the benchmark runs, they are stored in CSV files: one line is added every 5 seconds
- IO stats are collected from the Ceph cluster every five seconds and included in the CSV files
- The stats.py file is implemented to analyze the content of the CSV files and display statistics on the benchmark run
$ bench.py --file-count-ro 350 --rw-workers 10 --ro-workers 5 --file-size $((100 * 1024)) --no-warmup ... WARNING:root:Objects write 6.4K/s WARNING:root:Bytes write 131.3MB/s WARNING:root:Objects read 24.3K/s WARNING:root:Bytes read 99.4MB/s WARNING:root:2.0859388857985817% of random reads took longer than 100.0ms WARNING:root:Worst times to first byte on random reads (ms) [10751, 8217, 7655, 7446, 7366, 6919, 6722, 6515, 6481, 6079, 5918, 5839, 5823, 5759, 5634, 5573, 5492, 5335, 5114, 5105, 5009, 4976, 4963, 4914, 4913, 4854, 4822, 4668, 4658, 4605, 4593, 4551, 4537, 4489, 4470, 4431, 4418, 4411, 4385, 4327, 4298, 4224, 4090, 4082, 4070, 4010, 3868, 3865, 3819, 3818, 3815, 3805, 3798, 3755, 3719, 3716, 3711, 3704, 3688, 3612, 3608, 3606, 3579, 3543, 3537, 3527, 3493, 3450, 3441, 3356, 3346, 3338, 3319, 3313, 3294, 3272, 3264, 3258, 3244, 3183, 3179, 3160, 3145, 3136, 3127, 3123, 3119, 3107, 3098, 3093, 3090, 3083, 3082, 3068, 3057, 3052, 3029, 3028, 3022, 3022]
Jun 20 2021
When the Read Storage went over 20TB, the PGs of the Ceph pool were automatically increased (double). As a consequence backfilling started but it is throttled to not have a negative impact on performances.
Jun 13 2021
Creating a 20 billions global index fails because there is not enough disk space (2.9TB is full even with tunefs -m 0).
The equilibrium between reads and write is with 5 readers and 10 writers which leads to 1.2% random reads above the threshold, the worst one being 2sec. What it means is that care must be taken, application side, to throttle reads and writes otherwise the penalty is a significant degradation is latency.
When the benchmark write, the pressure of 40 workers slows down the reads significantly.
Running the benchmark with a read workload only (the Ceph cluster is doing nothing else), with 20 workers shows 8% of requests with a latency above the threshold:
I interrupted the benchmarks because it shows reads are not as expected, i.e. a large number of reads take very long and the number of reads per seconds is way more than what is needed. There is no throttling on reads only the number of workers is the limit. I was expecting they would be slowed down by other factors and not apply too much pressure on the cluster. But I was apparently wrong and throttling must be implemented to slow them down.
Jun 12 2021
For the record, creating 10 billions entries in the global index took:
Jun 7 2021
In T3149#65906, @zack wrote:how about just collecting all raw timings in an output CSV file (or several files if needed) and compute the stats downstream (e.g., with pandas)?
that would allow changing the percentiles later on as well as compute different stats, without having to rerun the benchmarks
I still think that returning a histogram of response times, in buckets of 5 or 10 ms wide ranges, may be valuable? We can then derive percentiles from that if we're so inclined.
In T3149#65880, @olasd wrote:While you're at it, could you report quantiles for the time to first byte, instead of just a raw maximum?
Something like:
- best 1%
- best 10%
- best 25%
- median
- worst 25% / best 75%
- worst 10%
- worst 1%
- maximum
(this all might be overkill, but...)
- Collect and display the worst time to first byte, not the average
In T3149#65877, @douardda wrote:and this needs fixing.
do you mean the bench code needs fixing (to report the proper stats)?
This week-end run was not very fruitful: since the global index could not be populated as expected and it was discovered Sunday morning, there was no time to fallback to a small one, for instance 10 billion entries. A run was launched and lasted ~24h to show:
Jun 5 2021
20 billions entries were inserted in the global index. After building the index it occupies 2.5TB, therefore each entry uses ~125 bytes of raw space. That's 25% more than with a 1 billion entries global index (i.e. 100 bytes)
- Add insertion in the global index to the benchmark
Jun 2 2021
My notes on the meeting:
May 31 2021
- Add the generate script to ingest entries in the global index.
The call is set to Wednesday June 2nd, 2021 4pm UTC+2 at https://meet.jit.si/ApparentStreetsJokeOk