The current approach to the archive counters involves a significant amount of interaction with the main database, and this may lead to critical issues, as we have seen in T2828, so we need to find a way to maintain accurate counters without overloading the database.
A first attempt to maintain the archive counters without database interaction was made by having each loader adding to the global counters the number of new objects encountered. Unfortunately, this approach led to the counters ending up showing a value greater than the real one: indeed, if several workers happen to ingest forks of a same new project (and it does happen!), then the new objects are counted multiple times.
This tasks proposes a simple alternative approach to maintain archive counters that both drastically reduces interaction with the main database and avoids counter overrun.
The ArchiveCounter component
The key to the approach is a new ArchiveCounter service that maintains one standard Bloom filters and a global counter for each of the different kind of objects that are counted (contents, directories, revisions, releases, snapshots, origins). The Bloom filter allows very fast answers to the question "is this object new?" with no false negatives and a controlled probability of false positives. The global counter keeps the running count of the new objects seen so far (up to the probability of false positives).
In the following, we detail the operation just for the case of contents.
Loaders send to the ArchiveCounter the list newcontents of identifiers of the (supposedly) new contents they encounter, and ArchiveCounter performs the following simple operation:
for id in newcontents: if not(BloomFilter.add(id)): # BloomFilter.add returns True if id considered known contentcounter += 1
The operation BloomFilter.add will return a false positive (i.e. consider a content as already seen, while it is actually new, hence resulting in counter underrun), with a probability that we can control. For example (see details), we can get a 1% probability of false positives with up to 20billion objects with a 24GB Bloom filter and 7 hash functions, and we can go to 0.1% with a 36GB Bloom filter and 10 hash functions.
With these parameters, at any moment in time contentcounter contains an underapproximation of the real number of contents stored in the archive that is totally acceptable for our use case.
Every so often, one can correct the counter drift by running an expensive query on the database, and update contentcounter accordingly.
Dillinger and Panagiotis show in their 2004 paper "Bloom Filters in Probabilistic Verification", that one can resort to using just two hash functions, which reduces the computation cost w.r.t. the standard Bloom filter construction.
But in our case, the "keys" are already randomly distributed, as they are SHA1 cryptographic hashes! So we could simply skip the Bloom hashing phase altogether, and do the 2 hashing construction using (fragments of) the key to be inserted in the filter.
In order to get to nominal operation, one can progressively fill the Bloom filters as follow:
- turn ArchiveCounter on, with an empty Bloom filter
- reconfigure loaders (or swh-storage) to write to ArchiveCounter
- dump the list of hashes from swh-storage or swh-journal
- switch the archive counters to use ArchiveCounter
This is sounds because all objects created after step 2 will be written directly, and all objects created before step 3 will be backfilled by step 3. (and the overlap is dealt with by ArchiveCounter)
Implementing a Bloom filter is straightforward, and there are many libraries available, but with our space requirements, we need to benchmark existing implementations. I came across this Python3 library that has the added benefit of using a memory mapped data structure, that allows to save the filter on disk, and it could be a good candidate, but the ultimate choice needs to be made based on performance and stability, independently of the programming language used.
Dedicated page on Devopedia