Since the 2020-11-25 the object counters are not computed.
It impacts :
- the production counters
- at least the "Objects added by time period" grafana dashboard [1]
Since the 2020-11-25 the object counters are not computed.
It impacts :
rSPSITE puppet-swh-site | |||
D4635 | rSPSITEe0e677dca3ff exclude temporary schemas from the statistics |
After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.
The request executed by the sql exporter on belvedere is well running in a couple of ms :
softwareheritage=> select label as object_type, value from swh_stat_counters(); object_type | value -----------------+------------ content | 9152229086 directory | 7733163858 origin | 144960097 origin_visit | 817462103 person | 39726003 release | 16325031 revision | 1934143430 skipped_content | 139323 snapshot | 137260210 (9 rows) Time: 21.589 ms
The best track at the moment is it seems pergamon can't read the value from the node exporter :
vsellier@pergamon /etc/prometheus/exported-configs % time curl http://192.168.100.210:9237/metrics ^C curl http://192.168.100.210:9237/metrics 0.01s user 0.01s system 0% cpu 5:21.46 total
it never succeed or tooks too much time
Belvedere is under an heavy load since the 2020-11-25 around 23:45 UTC which seems to match the last prometheus counter complete update :
It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):
postgres=# select now(); select * from pg_stat_activity where state != 'idle'; select now(); -[ RECORD 1 ]---------------------- now | 2020-11-30 15:00:17.358361+00 ... -[ RECORD 1 ]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------ datid | 16402 datname | softwareheritage-scheduler pid | 3360876 usesysid | 10 usename | postgres application_name | prometheus-sql-exporter client_addr | client_hostname | client_port | -1 backend_start | 2020-11-30 13:41:09.391675+00 xact_start | 2020-11-30 14:56:03.064342+00 query_start | 2020-11-30 14:56:03.064342+00 state_change | 2020-11-30 14:56:03.064353+00 wait_event_type | IO wait_event | DataFileRead state | active backend_xid | backend_xmin | 3436325849 query | with task_count_by_bucket as ( + | -- get the count of tasks by delay bucket. Tasks are grouped by their + | -- characteristics (type, status, policy, priority, current interval), + | -- then by delay buckets that are 1 hour wide between -24 and +24 hours, + | -- and 1 day wide outside of this range. + | -- A positive delay means the task execution is late wrt scheduling. + | select + | "type", + | status, + | "policy", + | priority, + | current_interval, + | ( + | -- select the bucket widths + | case when delay between - 24 * 3600 and 24 * 3600 then + | (ceil(delay / 3600)::bigint) * 3600 + | else + | (ceil(delay / (24 * 3600))::bigint) * 24 * 3600 + | end + | ) as delay_bucket, + | count(*) + | from + | task + | join lateral ( + | -- this is where the "positive = late" convention is set + | select + | extract(epoch from (now() - next_run)) as delay + | ) as d on true + | group by + | "type", + | status, + | "policy", + | priority, + | current_interval, + | delay_bucket + | order by + | backend_type | client backend
Several are running in parallel :
postgres=# select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query; select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query; -[ RECORD 1 ]--------------------------------------------------------------------- query | with task_count_by_bucket as ( + | -- get the count of tasks by delay bucket. Tasks are grouped by their + | -- characteristics (type, status, policy, priority, current interval), + | -- then by delay buckets that are 1 hour wide between -24 and +24 hours,+ | -- and 1 day wide outside of this range. + | -- A positive delay means the task execution is late wrt scheduling. + | select + | "type", + | status, + | "policy", + | priority, + | current_interval, + | ( + | -- select the bucket widths + | case when delay between - 24 * 3600 and 24 * 3600 then + | (ceil(delay / 3600)::bigint) * 3600 + | else + | (ceil(delay / (24 * 3600))::bigint) * 24 * 3600 + | end + | ) as delay_bucket, + | count(*) + | from + | task + | join lateral ( + | -- this is where the "positive = late" convention is set + | select + | extract(epoch from (now() - next_run)) as delay + | ) as d on true + | group by + | "type", + | status, + | "policy", + | priority, + | current_interval, + | delay_bucket + | order by + | count | 3 Time: 1.362 ms
comment edited to add a filter on the idle requests
Let's try a temporary workaround :
root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~ root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter
The request is not executed anymore but there is another one taking a long time on the main swh database this time :
postgres=# select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query; select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query; -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- query | SELECT current_database()::text AS datname, COALESCE(schemaname::text, 'null') AS schemaname, COALESCE(relname::text, 'null') AS relname, SUM(COALESCE(heap_blks_read, 0) + COALESCE(heap_blks_hit, 0) + COALESCE(idx_blks_hit, 0) + COALESCE(idx_blks_read, 0) + COALESCE(toast_blks_hit, 0) + COALESCE(toast_blks_read, 0) + COALESCE(tidx_blks_hit, 0) + COALESCE(tidx_blks_read, 0)) * 8192::bigint as reads FROM pg_statio_user_tables FULL JOIN (VALUES(0)) filler(i) ON TRUE GROUP BY 1, 2, 3 count | 1
The counters by themselves are well updated once a day.
The problem is on the probes used to follow the behavior of the archive, These are the counters that are aggregated to compute the daily counter. (if I have well understood all the pipeline ;) )
It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :
postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; client_addr | count -----------------+------- | 3 192.168.100.18 | 0 ::1 | 1 192.168.100.210 | 60 192.168.100.102 | 64 (5 rows)
I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere
Half of the workers were stopped :
root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*' worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service. ...
The load on belvedere falls from more than 130 to less than ~100 but it seems it's not enough to have the metrics in a decent delay
@olasd has stopped the backfilling with :
pkill -2 -u swhstorage -f revision
(allow to flush the logs before exiting)
It remains a lot of requests doing a ranged queries on the contents, They could be issued by the the indexers restarted this morning.
They were also stopped :
root@pergamon:/etc/clustershell# sudo clush -b -w @azure-workers 'puppet agent --disable "Reduce belvedere load"; cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl disable $unit; done; systemctl stop swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service'
As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :
vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'
D4635 is landed.
The metrics are now retrieved in less than 20s
curl http://192.168.100.210:9237/metrics 0.01s user 0.03s system 0% cpu 18.079 total
Let's see if things are getting better
Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.
The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display
[1] https://grafana.softwareheritage.org/goto/-dNByJ0Mk
[2] https://grafana.softwareheritage.org/goto/RRG3s1AMz
The problem was on some requests used for the internal monitoring of postgresql (dashboard [1] in my previous comment) which starts to have a bad behavior when postgresql is under heavy load.
The swh objects counters stats by themselves are not expensive. The missing updates were a collateral damages of the monitoring issue.
Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.
All stopped workers are restarted :
vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'
Yes and no.
There's two types of archive counters:
The main issue is that the update of "counts per bucket" is now taking a (very) long time, for some object types (currently, revision):
Tue Dec 1 11:59:11 2020 (every 0.1s) swh_update_counter_bucketed ----------------------------- (1 row) Time: 237774.102 ms (03:57.774)
We don't parallelize bucketed counts by object type, and there's a lot of buckets (256 or 4096 per object type) so the counters don't really get updated that often...
Maybe updating the counters "one shot" for all tables would be more sensible now... This needs to be timed.
Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].
My understanding of the "Objects added by time period dashboard" is it uses the sql_swh_archive_object_count prometheus metrics.
This metric seems to be populated by the prometheus sql exporter which executes the following query calling the swh_stat_counters() stored procedure :
select label as object_type, value from swh_stat_counters()
This stored procedure[1] returns the estimated number of rows of the different tables based on the postgresql statistics.
did I miss something ?
[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$886
[2] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$844
Absolutely.
This metric seems to be populated by the prometheus sql exporter which executes the following query calling the swh_stat_counters() stored procedure :
select label as object_type, value from swh_stat_counters()
That is correct.
This stored procedure[1] returns the estimated number of rows of the different tables based on the postgresql statistics.
[2] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$844
(I guess you meant [2]) No, these functions do not use postgresql statistics anymore. Count estimates using the postgresql statistics can go backwards, which generates confused questions from users.
The swh_stat_counters() function uses the object_counts table, which is updated by both:
29 2-22/4 * * * /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"
Updates of the object_counts_bucketed table are done by the swh_update_counter_bucketed() function, which is run in a loop on belvedere. That function does an exact count() on the range with the oldest last_update in the object_counts_bucketed table.
Now, it turns out that one iteration of the loop to update one of the bucketed revision counts takes four minutes (there's 4096 of them), while doing a full count of the table takes... 17 minutes. *sigh*.
I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.
Btw, the main point of using buckets for object counts of large tables is that *very long running* transactions kill performance for the whole database, and have knock-on effects for logical replication. In effect, if the time we take to update the buckets is 300 times larger than making a single update, then we need to rethink this tradeoff...
I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :
PNG:
Source: P888
Current status of counting the objects in one go:
$ psql service=swh psql (13.1 (Debian 13.1-1+b1), serveur 12.5 (Debian 12.5-1.pgdg100+1)) Connexion SSL (protocole : TLSv1.3, chiffrement : TLS_AES_256_GCM_SHA384, bits : 256, compression : désactivé) Saisissez « help » pour l'aide. 10:19 guest@softwareheritage => begin; set transaction read only; select count(*) from directory; rollback; BEGIN Temps : 13,052 ms SET Temps : 11,685 ms count ──────────── 7792777182 (1 ligne) Durée : 2689236,266 ms (44:49,236) ROLLBACK Temps : 12,618 ms 11:05 guest@softwareheritage => begin; set transaction read only; select count(*) from content; rollback; BEGIN Temps : 21,801 ms SET Temps : 19,987 ms count ──────────── 9240031866 (1 ligne) Durée : 1858340,245 ms (30:58,340) ROLLBACK Temps : 11,412 ms
This is much shorter than what it used to be when the buckets were implemented (thanks to parallel workers, I guess), so I think we should just make all counters update one-shot now.
One more:
11:38 guest@softwareheritage => begin; set transaction read only; select count(*) from revision; rollback; BEGIN Temps : 32,545 ms SET Temps : 16,530 ms count ──────────── 1947575558 (1 ligne) Durée : 1210886,321 ms (20:10,886) ROLLBACK Temps : 12,746 ms
So, yeah, looks like doing this is tractable.
I've now done softwareheritage=> update object_counts set single_update=true;, which will make all counters get their updates via cron. I've also shortened the cron delay to be 2 hours instead of 4 (providing an update for each counter every 18 hours).
changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.