Next generation archive counters
Closed, MigratedEdits Locked
Actions

Description

The current approach to the archive counters involves a significant amount of interaction with the main database, and this may lead to critical issues, as we have seen in T2828, so we need to find a way to maintain accurate counters without overloading the database.

A first attempt to maintain the archive counters without database interaction was made by having each loader adding to the global counters the number of new objects encountered. Unfortunately, this approach led to the counters ending up showing a value greater than the real one: indeed, if several workers happen to ingest forks of a same new project (and it does happen!), then the new objects are counted multiple times.

This tasks proposes a simple alternative approach to maintain archive counters that both drastically reduces interaction with the main database and avoids counter overrun.

The ArchiveCounter component

The key to the approach is a new ArchiveCounter service that maintains one standard Bloom filters and a global counter for each of the different kind of objects that are counted (contents, directories, revisions, releases, snapshots, origins). The Bloom filter allows very fast answers to the question "is this object new?" with no false negatives and a controlled probability of false positives. The global counter keeps the running count of the new objects seen so far (up to the probability of false positives).

In the following, we detail the operation just for the case of contents.

Loaders send to the ArchiveCounter the list newcontents of identifiers of the (supposedly) new contents they encounter, and ArchiveCounter performs the following simple operation:

for id in newcontents:
      if not(BloomFilter.add(id)):    # BloomFilter.add returns True if id considered known
         contentcounter += 1

The operation BloomFilter.add will return a false positive (i.e. consider a content as already seen, while it is actually new, hence resulting in counter underrun), with a probability that we can control. For example (see details), we can get a 1% probability of false positives with up to 20billion objects with a 24GB Bloom filter and 7 hash functions, and we can go to 0.1% with a 36GB Bloom filter and 10 hash functions.

With these parameters, at any moment in time contentcounter contains an underapproximation of the real number of contents stored in the archive that is totally acceptable for our use case.

Every so often, one can correct the counter drift by running an expensive query on the database, and update contentcounter accordingly.

Algorithmic remarks

Dillinger and Panagiotis show in their 2004 paper "Bloom Filters in Probabilistic Verification", that one can resort to using just two hash functions, which reduces the computation cost w.r.t. the standard Bloom filter construction.

But in our case, the "keys" are already randomly distributed, as they are SHA1 cryptographic hashes! So we could simply skip the Bloom hashing phase altogether, and do the 2 hashing construction using (fragments of) the key to be inserted in the filter.

Initialisation

In order to get to nominal operation, one can progressively fill the Bloom filters as follow:

turn ArchiveCounter on, with an empty Bloom filter
reconfigure loaders (or swh-storage) to write to ArchiveCounter
dump the list of hashes from swh-storage or swh-journal
switch the archive counters to use ArchiveCounter

This is sounds because all objects created after step 2 will be written directly, and all objects created before step 3 will be backfilled by step 3. (and the overlap is dealt with by ArchiveCounter)

Implementation

Implementing a Bloom filter is straightforward, and there are many libraries available, but with our space requirements, we need to benchmark existing implementations. I came across this Python3 library that has the added benefit of using a memory mapped data structure, that allows to save the filter on disk, and it could be a good candidate, but the ultimate choice needs to be made based on performance and stability, independently of the programming language used.

Resources
Dedicated page on Devopedia

Revisions and Commits

rCDFJ Dockerfiles for Jenkins
		D5236	rCDFJadfe6452b049 Add redis to the base build image
rDENV Development environment
		D5240	rDENVe67eb5c56d31 docker-compose: declare swh counters containers
rSPSITE puppet-swh-site
		D5588	rSPSITE77b8116b286f Activate swh-counters on all the webapps
rDCNT Archive counters
	Abandoned		D5326 Add a --version option to all the CLI commands
		D5253	rDCNTd88278619db7 Implement remote service
		D5232	rDCNTd90ee1974638 Implement counters pipeline
		D5229	rDCNT826c30e4b74c Implement the cli skeleton

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2912 Next generation archive counters
Migrated	gitlab-migration	T3146 Add pytest-redis package on the swh repository
Migrated	gitlab-migration	T3147 Package swh-counters module as a debian package
Migrated	gitlab-migration	T3159 Deploy swh-counters:v0.1.0 in staging
Migrated	gitlab-migration	T3164 Expose counters in prometheus format
Migrated	gitlab-migration	T3165 Generate historical data from the new counters series
Migrated	gitlab-migration	T3175 Prepare production environment
Migrated	gitlab-migration	T3190 counters: Error during directory topic ingestion
Migrated	gitlab-migration	T3215 Deploy the new counters in staging
Migrated	gitlab-migration	T3231 Make the source of the object counts configurable
Migrated	gitlab-migration	T3232 remove hardcoded historical values from the webapp
Migrated	gitlab-migration	T3250 Deploy the counters in production (step1 webapp1 only)
Migrated	gitlab-migration	T3251 Count authors from revisions and releases
Migrated	gitlab-migration	T3417 Cleanup the old counters environment

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm not sure to understand, the hyperloglog function is precisely used to deduplicate the messages based on their keys (at least in the poc).

The HyperLogLog data structure can be used in order to count unique elements in a set using just a small constant amount of memory, specifically 12k bytes for every HyperLogLog (plus a few bytes for the key itself).

(from https://redis.io/commands/pfcount)

A not covered situation I see, is when entries are inserted in the journal but the database insert is rollbacked.

In T2912#57643, @vlorentz wrote:

I don't think this solves the issue of overestimating the number of objects, when two threads insert the same objects at the same time.

! In T2912#57655, @vsellier wrote:

I'm not sure to understand,

Just to clarify: both the Bloom filter approach and HyperLogLog are probabilistic, so one gets an estimate and not the real value.
The difference is that with a Bloom filter we are guaranteed to always underestimate the real value, while HyperLogLog may underestimate or overestimate.
Underestimating seems preferable as one can always bump up the shown counter to the real value every now an then by updating a drift counter.
With an overestimation, updating to the real value every now and then may show up as a negative bump in the graph (as it happened before) which is counterintuitive for our users.

But if it turns out we can keep the error below 1%, then it becomes quite attractive to use up only 12k instead of several Gbs, and one can find a way to smooth the graph when a catch up is needed.

The question is whether the error figure will stay below 1% even for contents, revisions and directories, which are in the range of 10^10.

As said before, there are other uses for the Bloom filter approach so we'll need to look into it in any case, but if we have a quick and cheap satisfactory solution for the counters, why not ... :-)

Oh indeed, it sounds good then. :)

This is the results for the count of the directories and revisions (the content is still running, so there is some fresh statistics) :

Object type	Database count	Counter	Ratio
Content	9 738 530 517	9 774 472 095	1.0037
Directory	8 194 371 079	8 188 615 221	0.9993
Revision	2 044 980 813	2 059 724 031	1.0072

It seems to remains in the <1% range

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

In T2912#58062, @rdicosmo wrote:

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

The question is interesting per se, but may I suggest (for reasons discussed in the past) to just remove the graphs from the main archive.s.o page (we can keep them in grafana or elsewhere), instead of adding this as a requirement for what otherwise already looks like a good solution to the efficient counter problem?

In T2912#58063, @zack wrote:

In T2912#58062, @rdicosmo wrote:

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

may I suggest (for reasons discussed in the past) to just remove the graphs from the main archive.s.o page

We decided to keep the counters.

The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.

Another bonus point with this approach is that we could also unstuck the indexer
counters (graphs for those are stuck since november 2020) [1] [2]

They share the same difficulties when it comes to count data there.

[1] https://grafana.softwareheritage.org/goto/oMRgRsLGk

[2] the indexer storage is able to write to its own topics now, I don't recall
immediately the blocker of why that's not deployed yet (there is one but anyhow,
unrelated)

ardumont mentioned this in T2780: Enable the journal-writer for the swh-idx-storage in production.Feb 3 2021, 4:35 PM

The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.

Yes sure, it's important, I will record the counters values the next few days to be sure the progression is indeed monotonic

I asked one of the authors of the original HyperLogLog paper (not Philippe, that unfortunately passed away years ago :-()
The original HyperLogLog has three different behaviour, one for small cardinals, another for median cardinals, and a third for very large cardinals.
There is indeed a risk of breaking monotonicity at the boundaries between segments, but in each segment it is monotonic.
Our counters are already in the "very large cardinal" zone, so we should be safe with any implementation.

So, at this point, I would say it's a definite go this solution.
Thanks @vsellier :-)

Nice, thanks for confirming this at the source.

This is the behavior of the counters since my last comment (~3 days)
They seem to increase step by steps but looks monotonic on this short period.

Directories:

Contents:

Revisions:

@vsellier nice. Note that if we draw these with a y-axis starting from 0, the step shape will be really negligible, so IMHO it's really not a problem.

I wrote a proposal for the next steps [1] so we could start the work on these counters. All comments/contributions are welcome.

[1] https://hedgedoc.softwareheritage.org/s/hIqe-DAOL

Let's start the subject ;)

vsellier mentioned this in rDENV00dd72c2e514: .mrconfig: Add swh-counters module.Mar 5 2021, 11:51 AM

vsellier mentioned this in rCJSWH5e7ce5fde3da: jobs/swh-packages: Add swh-counters module.Mar 5 2021, 12:00 PM

Repository created : https://forge.softwareheritage.org/source/swh-counters/
Jenkins jobs configured : https://jenkins.softwareheritage.org/job/DCNT/

vsellier added a revision: D5229: swh-counters: Implement the cli skeleton.Mar 10 2021, 4:30 PM

rdicosmo added a project: Roadmap 2021.Mar 11 2021, 8:25 PM

vsellier added a commit: rDCNT826c30e4b74c: Implement the cli skeleton.Mar 12 2021, 10:05 AM

vsellier added a revision: D5232: Implement counters pipeline.Mar 12 2021, 10:08 AM

vsellier added a revision: D5236: Add redis to the base build image.Mar 12 2021, 12:49 PM

vsellier added a revision: D5240: docker-compose: declare swh counters containers.Mar 12 2021, 5:00 PM

vsellier added a commit: rDENVe67eb5c56d31: docker-compose: declare swh counters containers.Mar 12 2021, 6:48 PM

vsellier mentioned this in rDENV4244e588043f: docker: Revert installation of swh-counters.Mar 15 2021, 2:14 PM

vsellier added a commit: rCDFJadfe6452b049: Add redis to the base build image.Mar 15 2021, 4:09 PM

vsellier added a commit: rDCNTd90ee1974638: Implement counters pipeline.Mar 15 2021, 8:03 PM

vsellier added a revision: D5253: Implement remote service.Mar 15 2021, 8:04 PM

rdicosmo moved this task from Backlog to Work in progress on the Roadmap 2021 board.Mar 15 2021, 9:09 PM

vsellier added a commit: rDCNTd88278619db7: Implement remote service.Mar 16 2021, 6:34 PM

vsellier created subtask T3146: Add pytest-redis package on the swh repository.Mar 17 2021, 12:20 PM

vsellier closed subtask T3146: Add pytest-redis package on the swh repository as Resolved.Mar 17 2021, 4:24 PM

vsellier changed the status of subtask T3147: Package swh-counters module as a debian package from Open to Work in Progress.

vsellier closed subtask T3147: Package swh-counters module as a debian package as Resolved.Mar 17 2021, 6:18 PM

vsellier created subtask T3159: Deploy swh-counters:v0.1.0 in staging.Mar 19 2021, 12:39 PM

vsellier mentioned this in rDENV18290e1da1ef: docker: Install swh-counters in the docker image.Mar 19 2021, 2:17 PM

vsellier closed subtask T3159: Deploy swh-counters:v0.1.0 in staging as Resolved.Mar 22 2021, 5:34 PM

vsellier changed the status of subtask T3164: Expose counters in prometheus format from Open to Work in Progress.Mar 24 2021, 10:08 AM

earthizflat added a revision: D5326: Add a --version option to all the CLI commands.Mar 24 2021, 7:27 PM

vlorentz mentioned this in D5326: Add a --version option to all the CLI commands.Mar 25 2021, 10:26 AM

vsellier closed subtask T3164: Expose counters in prometheus format as Resolved.Mar 25 2021, 2:38 PM

vsellier changed the status of subtask T3165: Generate historical data from the new counters series from Open to Work in Progress.

vsellier created subtask T3175: Prepare production environment.Mar 25 2021, 2:52 PM

vsellier closed subtask T3175: Prepare production environment as Resolved.Mar 25 2021, 5:36 PM

vsellier created subtask T3190: counters: Error during directory topic ingestion.Apr 1 2021, 9:38 AM

vsellier closed subtask T3190: counters: Error during directory topic ingestion as Resolved.Apr 1 2021, 2:18 PM

vsellier closed subtask T3165: Generate historical data from the new counters series as Resolved.Apr 7 2021, 5:13 PM

vsellier created subtask T3215: Deploy the new counters in staging.

vsellier reopened subtask T3165: Generate historical data from the new counters series as Work in Progress.Apr 7 2021, 5:42 PM

vsellier closed subtask T3215: Deploy the new counters in staging as Resolved.Apr 9 2021, 6:56 PM

vsellier closed subtask T3165: Generate historical data from the new counters series as Resolved.Apr 9 2021, 7:02 PM

vsellier changed the status of subtask T3232: remove hardcoded historical values from the webapp from Open to Work in Progress.Apr 13 2021, 9:49 AM

vsellier changed the status of subtask T3231: Make the source of the object counts configurable from Open to Work in Progress.Apr 13 2021, 12:14 PM

vsellier closed subtask T3232: remove hardcoded historical values from the webapp as Resolved.Apr 13 2021, 6:59 PM

vsellier closed subtask T3231: Make the source of the object counts configurable as Resolved.Apr 14 2021, 6:40 PM

vsellier created subtask T3250: Deploy the counters in production (step1 webapp1 only).Apr 14 2021, 7:12 PM

vsellier closed subtask T3250: Deploy the counters in production (step1 webapp1 only) as Resolved.Apr 15 2021, 11:32 AM

Staging webapp[1] and webapp1 on production [2] are now configured to use swh-counters to display the historical values and the live object counts.

As planned, the authors field is no longer present as the information is not directly available from the journal.
If the field needs to be reactivated, it will be possible to compute the information from the revisions and the releases but the journal client will need to be improved.

If the final result is considered correct, the configuration can be deployed on the main webapp.

The old counters could be let in place for some time in case a rollback is needed.

[1] https://webapp.staging.swh.network
[2] https://webapp1.internal.softwareheritage.org

In T2912#63242, @vsellier wrote:

Staging webapp[1] and webapp1 on production [2] are now configured to use swh-counters to display the historical values and the live object counts.

Great to see this getting to the finish line, congrats!

The authors field is no longer present as the information is not directly available from the journal.
If the field needs to be reactivated, it will be possible to compute the information from the revisions and the releases but the journal client will need to be improved.

I would really like to keep the author counter: how complex is it to add it?

I would really like to keep the author counter: how complex is it to add it?

There is no real complexity, it's just an implementation of a dedicated journal client that deserializes the message to extract information instead of just sending the message's key to redis.
This kind of journal client will be necessary in any case if we want to extend the usage of the counters for other perimeters (metadata count, origin per forge, ...)

In T2912#63245, @vsellier wrote:

This kind of journal client will be necessary in any case if we want to extend the usage of the counters for other perimeters (metadata count, origin per forge, ...)

Let's go for it, then. May you take this over?

Let's go for it, then. May you take this over?

Yes, @vsellier opened T3251 for that part ;)

vsellier changed the status of subtask T3251: Count authors from revisions and releases from Open to Work in Progress.Apr 19 2021, 3:52 PM

vsellier closed subtask T3251: Count authors from revisions and releases as Resolved.Apr 23 2021, 1:03 PM

vsellier added a revision: D5588: Activate swh-counters on all the webapps.Apr 23 2021, 4:26 PM

rdicosmo moved this task from Work in progress to Pending validation on the Roadmap 2021 board.Apr 26 2021, 10:50 AM

ardumont added a commit: rSPSITE77b8116b286f: Activate swh-counters on all the webapps.Apr 26 2021, 11:55 AM

Last bits deployed on archive.s.o (including the author counters).

In T2912#64174, @ardumont wrote:

Last bits deployed on archive.s.o (including the author counters).

Great, works like a charm.

What about the old counter pipeline? Has it been decommissioned already?

What about the old counter pipeline? Has it been decommissioned already?

I don't think so as I do not recall seeing diffs about clean up.

In any case, it's not part of what's currently deployed (so no risk for
data mangling if that's part the concern).

In T2912#64208, @ardumont wrote:

What about the old counter pipeline? Has it been decommissioned already?

I don't think so as I do not recall seeing diffs about clean up.

In any case, it's not part of what's currently deployed (so no risk for
data mangling if that's part the concern).

I also recall now that vincent added a graph [1] recently enough.
This to try and compare a bit the counter approaches together.

So that's still using the old plumbing at least for that part.

[1] https://grafana.softwareheritage.org/goto/BlkwHorMz

> I also recall now that vincent added a graph [1] recently enough.

This to try and compare a bit the counter approaches together.

So that's still using the old plumbing at least for that part.

[1] https://grafana.softwareheritage.org/goto/BlkwHorMz

Ok thanks, we can keep this for a while to be fully confident (a couple of months).
Then it will be nice to just disable the old pipeline, to reduce the load on the DB (which was our initial objective :-))

rdicosmo moved this task from Pending validation to Done on the Roadmap 2021 board.May 8 2021, 11:13 AM

vsellier changed the status of subtask T3417: Cleanup the old counters environment from Open to Work in Progress.Aug 5 2021, 5:37 PM

The cleanup of the old counters is done so it can be closed

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T3146: Add pytest-redis package on the swh repository from Resolved to Migrated.Oct 19 2022, 6:01 PM

gitlab-migration changed the status of subtask T3147: Package swh-counters module as a debian package from Resolved to Migrated.

gitlab-migration changed the status of subtask T3159: Deploy swh-counters:v0.1.0 in staging from Resolved to Migrated.

gitlab-migration changed the status of subtask T3164: Expose counters in prometheus format from Resolved to Migrated.

gitlab-migration changed the status of subtask T3165: Generate historical data from the new counters series from Resolved to Migrated.

gitlab-migration changed the status of subtask T3175: Prepare production environment from Resolved to Migrated.

gitlab-migration changed the status of subtask T3190: counters: Error during directory topic ingestion from Resolved to Migrated.

gitlab-migration changed the status of subtask T3215: Deploy the new counters in staging from Resolved to Migrated.

gitlab-migration changed the status of subtask T3231: Make the source of the object counts configurable from Resolved to Migrated.

gitlab-migration changed the status of subtask T3250: Deploy the counters in production (step1 webapp1 only) from Resolved to Migrated.

gitlab-migration changed the status of subtask T3417: Cleanup the old counters environment from Resolved to Migrated.

gitlab-migration changed the status of subtask T3232: remove hardcoded historical values from the webapp from Resolved to Migrated.Jan 8 2023, 4:34 PM

gitlab-migration changed the status of subtask T3251: Count authors from revisions and releases from Resolved to Migrated.

	rdicosmo
	Dec 22 2020, 12:57 PM

Next generation archive countersClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Next generation archive counters
Closed, MigratedEdits Locked
Actions

Related Objects
Search...