https://en.wikipedia.org/wiki/Perfect_hash_function could be used to get O(1) instead of O(log(N))
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 9 2021
Mar 8 2021
Very interesting to see how this problem was presented & solved in the Hadoop ecosystem, thanks for the links.
Mar 5 2021
Mar 4 2021
Mar 3 2021
Mar 2 2021
Mar 1 2021
Feb 24 2021
Feb 23 2021
Reopening for benchmarking purposes because there does not seem to be anything ready to use T3068.
jumpDB is 100% python therefore less than ideal for CPU performance but for the purpose of benchmarking I/O and space usage it is conveniently ready to use.
There is not enough tooling to use SST files independently of RocksDB. Maybe it is possible to use the RocksDB with a configuration that makes it so it only uses a single SST file ?
Feb 22 2021
Ambry has been a great source of inspiration and the best fit for the software heritage use case. Including the partition UUID in the object takes advantage of the immutability of the objects allows all readers to have a scale out object storage.
It turns out there are a number of suitable formats (SST from RocksDB for one), no need to re-invent this wheel.
In the T3054 proposed design, objects are packed into larger files and there is no reason to continue in this direction. There seems to be a consensus that tenths of billions of individual objects is problematic. It takes very long to enumerate, for one thing. And noone is doing that which is not a great sign.
The T3054 design evolved and this benchmark won't be needed
Feb 21 2021
Readonly partitions are stored in Sorted String Table format.
Open sourcing DataHub: LinkedIn’s metadata search and discovery platform explains how developers work on DataHub and the relationship between code internal to Linkedin and what is published as Free Software. It is not about ambry and maybe the ambry team has a completely different behavior. A similar article about ambry is dated 2016: