I asked one of the authors of the original HyperLogLog paper (not Philippe, that unfortunately passed away years ago :-()
The original HyperLogLog has three different behaviour, one for small cardinals, another for median cardinals, and a third for very large cardinals.
There is indeed a risk of breaking monotonicity at the boundaries between segments, but in each segment it is monotonic.
Our counters are already in the "very large cardinal" zone, so we should be safe with any implementation.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 4 2021
The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.
Feb 3 2021
Another bonus point with this approach is that we could also unstuck the indexer
counters (graphs for those are stuck since november 2020) [1] [2]
In T2912#58063, @zack wrote:In T2912#58062, @rdicosmo wrote:Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))
may I suggest (for reasons discussed in the past) to just remove the graphs from the main archive.s.o page
We decided to keep the counters.
In T2912#58062, @rdicosmo wrote:Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))
Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))
Feb 1 2021
This is the results for the count of the directories and revisions (the content is still running, so there is some fresh statistics) :
Jan 29 2021
Oh indeed, it sounds good then. :)
In T2912#57643, @vlorentz wrote:I don't think this solves the issue of overestimating the number of objects, when two threads insert the same objects at the same time.
! In T2912#57655, @vsellier wrote:
I'm not sure to understand,
I'm not sure to understand, the hyperloglog function is precisely used to deduplicate the messages based on their keys (at least in the poc).
I don't think this solves the issue of overestimating the number of objects, when two threads insert the same objects at the same time.
For information, the poc was launched on the content topic of production, the results seems to be acceptable with a count a little more important on the redis counter, probably due to some messages sent to kafka but not persisted in the database .
Jan 28 2021
Bloom filters are still on the table for other use cases, like testing super quickly for contents that we do not have, but if nobody has strong objections, this seems the way to go for the counters (very small footprint, small under/over counting errors, thanks Philippe Flajolet's magic :-))
Jan 25 2021
It seems redis has a Hyperloglog functionnality[1] that can match with the requirements (bloom filter / limited deviation / small memory footprint / efficiency).
Jan 6 2021
The last check no longer appears in icinga.
Jan 5 2021
It looks like you already agree, but FWIW I'd also would like to have a dedicated (micro)service that keeps an up-to-date bloom filter for the entire archive, with a REST API.
It might be useful for other use cases (swh-scanner comes to mind, but I'm sure we'll find others as time passes).
In T2912#55860, @rdicosmo wrote:In T2912#55849, @olasd wrote:I think we should be able to decouple these counters completely from the loaders, and have them directly updated/handled by a client of the swh-journal. This would be a "centralized" component, but which we can parallelize quite heavily thanks to basic kafka design. We can also leverage the way kafka clients do parallelism to sidestep the locking issues arising in a potentially distributed filter.
Maybe my writing was not all that clear: I also had in mind a single centralised component (the ArchiveCounter) per Bloom filter, receiving the lists newcontents of ids from the loaders.
Getting the feed of ids from swh-journal instead of from the loaders is really neat: we avoid touching the loader code, and we gain a better capability of monitoring the load on the ArchiveCounter, so I'm all for it :-)
In T2912#55849, @olasd wrote:Thanks for sketching out this proposal! It looks quite promising (and neat!).
I'm also having the "full journal" approach in mind after a quick reading of this neat proposal :-)
Thanks for sketching out this proposal! It looks quite promising (and neat!).
Jan 4 2021
Dec 22 2020
Updated the proposal with your suggestions, thanks!
In T2912#55487, @vlorentz wrote:A Python library may be an issue, as it requires a central process with a global lock. Sharding by hash may fix the issue, though.
A Python library may be an issue, as it requires a central process with a global lock. Sharding by hash may fix the issue, though.
Dec 8 2020
changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.
Dec 7 2020
- an increment of 250Go was added via the proxmox ui (for a total of 500Go now)
- the disk was resized on the os side :
root@pergamon:~# parted /dev/vdc GNU Parted 3.2 Using /dev/vdc Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print Model: Virtio Block Device (virtblk) Disk /dev/vdc: 537GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags:
The disk will be resized to avoid the service disruption in a short term before looking at T1362
Dec 2 2020
I've now done softwareheritage=> update object_counts set single_update=true;, which will make all counters get their updates via cron. I've also shortened the cron delay to be 2 hours instead of 4 (providing an update for each counter every 18 hours).
One more:
Current status of counting the objects in one go:
I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :
Dec 1 2020
Btw, the main point of using buckets for object counts of large tables is that *very long running* transactions kill performance for the whole database, and have knock-on effects for logical replication. In effect, if the time we take to update the buckets is 300 times larger than making a single update, then we need to rethink this tradeoff...
I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.
In T2828#53767, @vsellier wrote:Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].My understanding of the "Objects added by time period dashboard" is it uses the sql_swh_archive_object_count prometheus metrics.
Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].
In T2828#53753, @vsellier wrote:Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.
All stopped workers are restarted :
vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'
Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.
In T2828#53740, @rdicosmo wrote:Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.
The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display
Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.
D4635 is landed.
As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :
vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'
D4635 is a proposal to solve the performance issues on the statistic queries
Nov 30 2020
@olasd has stopped the backfilling with :
pkill -2 -u swhstorage -f revision
(allow to flush the logs before exiting)
Half of the workers were stopped :
root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*' worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service. ...
It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :
postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; client_addr | count -----------------+------- | 3 192.168.100.18 | 0 ::1 | 1 192.168.100.210 | 60 192.168.100.102 | 64 (5 rows)
I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere
In T2828#53672, @rdicosmo wrote:Hmmm... there is definitely no need to update the counters more than once a day
Let's try a temporary workaround :
root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~ root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter
Hmmm... there is definitely no need to update the counters more than once a day
It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):
After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.