Page MenuHomeSoftware Heritage
Feed Advanced Search

Feb 4 2021

rdicosmo added a comment to T2912: Next generation archive counters.

I asked one of the authors of the original HyperLogLog paper (not Philippe, that unfortunately passed away years ago :-()
The original HyperLogLog has three different behaviour, one for small cardinals, another for median cardinals, and a third for very large cardinals.
There is indeed a risk of breaking monotonicity at the boundaries between segments, but in each segment it is monotonic.
Our counters are already in the "very large cardinal" zone, so we should be safe with any implementation.

Feb 4 2021, 10:31 PM · Roadmap 2021, System administration, Monitoring, Web app
vsellier added a comment to T2912: Next generation archive counters.

The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.

Feb 4 2021, 9:48 AM · Roadmap 2021, System administration, Monitoring, Web app

Feb 3 2021

ardumont added a comment to T2912: Next generation archive counters.

Another bonus point with this approach is that we could also unstuck the indexer
counters (graphs for those are stuck since november 2020) [1] [2]

Feb 3 2021, 4:26 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo added a comment to T2912: Next generation archive counters.
In T2912#58063, @zack wrote:

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

may I suggest (for reasons discussed in the past) to just remove the graphs from the main archive.s.o page

We decided to keep the counters.

Feb 3 2021, 4:21 PM · Roadmap 2021, System administration, Monitoring, Web app
zack added a comment to T2912: Next generation archive counters.

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

Feb 3 2021, 4:08 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo added a comment to T2912: Next generation archive counters.

Thanks @vsellier, that seems quite ok indeed. The only question left is to know if the estimator implemented is monotonic (i.e. we will never have negative bumps in the graph :-))

Feb 3 2021, 4:01 PM · Roadmap 2021, System administration, Monitoring, Web app

Feb 1 2021

vsellier added a comment to T2912: Next generation archive counters.

This is the results for the count of the directories and revisions (the content is still running, so there is some fresh statistics) :

Feb 1 2021, 10:02 AM · Roadmap 2021, System administration, Monitoring, Web app

Jan 29 2021

vlorentz added a comment to T2912: Next generation archive counters.

Oh indeed, it sounds good then. :)

Jan 29 2021, 1:11 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo added a comment to T2912: Next generation archive counters.

I don't think this solves the issue of overestimating the number of objects, when two threads insert the same objects at the same time.

! In T2912#57655, @vsellier wrote:

I'm not sure to understand,

Jan 29 2021, 1:10 PM · Roadmap 2021, System administration, Monitoring, Web app
vsellier added a comment to T2912: Next generation archive counters.

I'm not sure to understand, the hyperloglog function is precisely used to deduplicate the messages based on their keys (at least in the poc).

Jan 29 2021, 12:10 PM · Roadmap 2021, System administration, Monitoring, Web app
vlorentz added a comment to T2912: Next generation archive counters.

I don't think this solves the issue of overestimating the number of objects, when two threads insert the same objects at the same time.

Jan 29 2021, 11:58 AM · Roadmap 2021, System administration, Monitoring, Web app
vsellier added a comment to T2912: Next generation archive counters.

For information, the poc was launched on the content topic of production, the results seems to be acceptable with a count a little more important on the redis counter, probably due to some messages sent to kafka but not persisted in the database .

Jan 29 2021, 11:12 AM · Roadmap 2021, System administration, Monitoring, Web app

Jan 28 2021

rdicosmo added a comment to T2912: Next generation archive counters.

Bloom filters are still on the table for other use cases, like testing super quickly for contents that we do not have, but if nobody has strong objections, this seems the way to go for the counters (very small footprint, small under/over counting errors, thanks Philippe Flajolet's magic :-))

Jan 28 2021, 7:27 PM · Roadmap 2021, System administration, Monitoring, Web app

Jan 25 2021

vsellier added a comment to T2912: Next generation archive counters.

It seems redis has a Hyperloglog functionnality[1] that can match with the requirements (bloom filter / limited deviation / small memory footprint / efficiency).

Jan 25 2021, 12:52 PM · Roadmap 2021, System administration, Monitoring, Web app

Jan 6 2021

ardumont added a comment to T2770: Fix all icinga checks on staging webapp.

The last check no longer appears in icinga.

Jan 6 2021, 4:36 PM · Monitoring, System administration, Staging environment
ardumont closed T2770: Fix all icinga checks on staging webapp as Resolved.
Jan 6 2021, 4:36 PM · Monitoring, System administration, Staging environment
ardumont changed the status of T2770: Fix all icinga checks on staging webapp from Open to Work in Progress.
Jan 6 2021, 4:36 PM · Monitoring, System administration, Staging environment
ardumont moved T2727: Investigate end-to-end monitoring which no longer reports issues from Backlog to deployed/landed/monitoring on the System administration board.
Jan 6 2021, 3:45 PM · Monitoring, System administration

Jan 5 2021

rdicosmo added a comment to T2912: Next generation archive counters.

It looks like you already agree, but FWIW I'd also would like to have a dedicated (micro)service that keeps an up-to-date bloom filter for the entire archive, with a REST API.
It might be useful for other use cases (swh-scanner comes to mind, but I'm sure we'll find others as time passes).

Jan 5 2021, 6:05 PM · Roadmap 2021, System administration, Monitoring, Web app
zack added a comment to T2912: Next generation archive counters.
In T2912#55849, @olasd wrote:

I think we should be able to decouple these counters completely from the loaders, and have them directly updated/handled by a client of the swh-journal. This would be a "centralized" component, but which we can parallelize quite heavily thanks to basic kafka design. We can also leverage the way kafka clients do parallelism to sidestep the locking issues arising in a potentially distributed filter.

Maybe my writing was not all that clear: I also had in mind a single centralised component (the ArchiveCounter) per Bloom filter, receiving the lists newcontents of ids from the loaders.
Getting the feed of ids from swh-journal instead of from the loaders is really neat: we avoid touching the loader code, and we gain a better capability of monitoring the load on the ArchiveCounter, so I'm all for it :-)

Jan 5 2021, 6:01 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo added a comment to T2912: Next generation archive counters.
In T2912#55849, @olasd wrote:

Thanks for sketching out this proposal! It looks quite promising (and neat!).

Jan 5 2021, 5:00 PM · Roadmap 2021, System administration, Monitoring, Web app
douardda added a comment to T2912: Next generation archive counters.

I'm also having the "full journal" approach in mind after a quick reading of this neat proposal :-)

Jan 5 2021, 4:24 PM · Roadmap 2021, System administration, Monitoring, Web app
olasd added a comment to T2912: Next generation archive counters.

Thanks for sketching out this proposal! It looks quite promising (and neat!).

Jan 5 2021, 4:00 PM · Roadmap 2021, System administration, Monitoring, Web app

Jan 4 2021

rdicosmo updated the task description for T2912: Next generation archive counters.
Jan 4 2021, 6:35 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo updated the task description for T2912: Next generation archive counters.
Jan 4 2021, 12:04 PM · Roadmap 2021, System administration, Monitoring, Web app

Dec 22 2020

rdicosmo added a comment to T2912: Next generation archive counters.

Updated the proposal with your suggestions, thanks!

Dec 22 2020, 2:59 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo updated the task description for T2912: Next generation archive counters.
Dec 22 2020, 2:59 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo added a comment to T2912: Next generation archive counters.

A Python library may be an issue, as it requires a central process with a global lock. Sharding by hash may fix the issue, though.

Dec 22 2020, 2:55 PM · Roadmap 2021, System administration, Monitoring, Web app
vlorentz added a comment to T2912: Next generation archive counters.

A Python library may be an issue, as it requires a central process with a global lock. Sharding by hash may fix the issue, though.

Dec 22 2020, 2:46 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo updated the task description for T2912: Next generation archive counters.
Dec 22 2020, 1:29 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo updated the task description for T2912: Next generation archive counters.
Dec 22 2020, 1:28 PM · Roadmap 2021, System administration, Monitoring, Web app
rdicosmo triaged T2912: Next generation archive counters as Normal priority.
Dec 22 2020, 12:57 PM · Roadmap 2021, System administration, Monitoring, Web app

Dec 8 2020

vsellier closed T2828: Archive counters are no longer updated in production as Resolved.

changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.

Dec 8 2020, 7:30 PM · Monitoring, Web app, System administration

Dec 7 2020

vsellier closed T2859: Out of disk space on prometheus storage as Resolved.
Dec 7 2020, 2:38 PM · Monitoring, System administration
vsellier added a comment to T2859: Out of disk space on prometheus storage.
  • an increment of 250Go was added via the proxmox ui (for a total of 500Go now)
  • the disk was resized on the os side :
root@pergamon:~# parted /dev/vdc
GNU Parted 3.2
Using /dev/vdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 537GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Dec 7 2020, 2:38 PM · Monitoring, System administration
vsellier added a comment to T2859: Out of disk space on prometheus storage.

The disk will be resized to avoid the service disruption in a short term before looking at T1362

Dec 7 2020, 12:41 PM · Monitoring, System administration
vsellier changed the status of T2859: Out of disk space on prometheus storage from Open to Work in Progress.
Dec 7 2020, 12:38 PM · Monitoring, System administration
vsellier claimed T2859: Out of disk space on prometheus storage.
Dec 7 2020, 12:38 PM · Monitoring, System administration
vsellier triaged T2859: Out of disk space on prometheus storage as High priority.
Dec 7 2020, 12:24 PM · Monitoring, System administration

Dec 2 2020

olasd added a comment to T2828: Archive counters are no longer updated in production.

I've now done softwareheritage=> update object_counts set single_update=true;, which will make all counters get their updates via cron. I've also shortened the cron delay to be 2 hours instead of 4 (providing an update for each counter every 18 hours).

Dec 2 2020, 12:30 PM · Monitoring, Web app, System administration
olasd added a comment to T2828: Archive counters are no longer updated in production.

One more:

Dec 2 2020, 12:27 PM · Monitoring, Web app, System administration
olasd added a comment to T2828: Archive counters are no longer updated in production.

Current status of counting the objects in one go:

Dec 2 2020, 11:55 AM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :

Dec 2 2020, 9:18 AM · Monitoring, Web app, System administration

Dec 1 2020

olasd added a comment to T2828: Archive counters are no longer updated in production.

Btw, the main point of using buckets for object counts of large tables is that *very long running* transactions kill performance for the whole database, and have knock-on effects for logical replication. In effect, if the time we take to update the buckets is 300 times larger than making a single update, then we need to rethink this tradeoff...

Dec 1 2020, 3:31 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.

Dec 1 2020, 3:18 PM · Monitoring, Web app, System administration
olasd added a comment to T2828: Archive counters are no longer updated in production.

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

My understanding of the "Objects added by time period dashboard" is it uses the sql_swh_archive_object_count prometheus metrics.

Dec 1 2020, 2:59 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

Dec 1 2020, 2:47 PM · Monitoring, Web app, System administration
olasd added a comment to T2828: Archive counters are no longer updated in production.

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

Dec 1 2020, 1:10 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

All stopped workers are restarted :

vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'
Dec 1 2020, 12:48 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

Dec 1 2020, 12:26 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

Dec 1 2020, 12:22 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display

Dec 1 2020, 12:06 PM · Monitoring, Web app, System administration
rdicosmo added a comment to T2828: Archive counters are no longer updated in production.

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

Dec 1 2020, 11:55 AM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is landed.

Dec 1 2020, 11:48 AM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :

vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'
Dec 1 2020, 9:37 AM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is a proposal to solve the performance issues on the statistic queries

Dec 1 2020, 9:22 AM · Monitoring, Web app, System administration
vsellier added a revision to T2828: Archive counters are no longer updated in production: D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 9:21 AM · Monitoring, Web app, System administration

Nov 30 2020

vsellier updated subscribers of T2828: Archive counters are no longer updated in production.

@olasd has stopped the backfilling with :

pkill -2 -u swhstorage -f revision

(allow to flush the logs before exiting)

Nov 30 2020, 7:49 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Half of the workers were stopped :

root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service.
...
Nov 30 2020, 6:09 PM · Monitoring, Web app, System administration
vsellier triaged T2831: sql exporter is failing to retrieve the number of running queries as Normal priority.
Nov 30 2020, 6:05 PM · Monitoring, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :

postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
   client_addr   | count 
-----------------+-------
                 |     3
 192.168.100.18  |     0
 ::1             |     1
 192.168.100.210 |    60
 192.168.100.102 |    64
(5 rows)

I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere

Nov 30 2020, 5:36 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Hmmm... there is definitely no need to update the counters more than once a day

Nov 30 2020, 5:19 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Let's try a temporary workaround :

root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" 
root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~
root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter
Nov 30 2020, 4:18 PM · Monitoring, Web app, System administration
rdicosmo added a comment to T2828: Archive counters are no longer updated in production.

Hmmm... there is definitely no need to update the counters more than once a day

Nov 30 2020, 4:14 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):

Nov 30 2020, 4:04 PM · Monitoring, Web app, System administration
zack renamed T2828: Archive counters are no longer updated in production from Production counters not up to date to Archive counters are no longer updated in production.
Nov 30 2020, 4:02 PM · Monitoring, Web app, System administration
zack raised the priority of T2828: Archive counters are no longer updated in production from High to Unbreak Now!.
Nov 30 2020, 4:02 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.

Nov 30 2020, 3:34 PM · Monitoring, Web app, System administration
vsellier changed the status of T2828: Archive counters are no longer updated in production from Open to Work in Progress.
Nov 30 2020, 3:20 PM · Monitoring, Web app, System administration

Nov 13 2020

ardumont added a project to T2770: Fix all icinga checks on staging webapp: Monitoring.
Nov 13 2020, 1:20 PM · Monitoring, System administration, Staging environment
ardumont added a project to T2727: Investigate end-to-end monitoring which no longer reports issues: Monitoring.
Nov 13 2020, 1:19 PM · Monitoring, System administration
ardumont added a project to T2774: Fix vault end-to-end check: Monitoring.
Nov 13 2020, 1:18 PM · Vault, System administration, Monitoring
ardumont created Monitoring.
Nov 13 2020, 1:18 PM