Page MenuHomeSoftware Heritage

Archive counters are no longer updated in production
Closed, MigratedEdits Locked

Description

Since the 2020-11-25 the object counters are not computed.
It impacts :

  • the production counters
  • at least the "Objects added by time period" grafana dashboard [1]

[1] https://grafana.softwareheritage.org/d/BAQISLkGk/objects-added-by-time-period?orgId=1&var-environment=production&var-interval=1d&var-object_type=All

Event Timeline

vsellier changed the task status from Open to Work in Progress.Nov 30 2020, 3:20 PM
vsellier triaged this task as High priority.
vsellier created this task.

After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.

The request executed by the sql exporter on belvedere is well running in a couple of ms :

softwareheritage=> select label as object_type, value from swh_stat_counters();
   object_type   |   value    
-----------------+------------
 content         | 9152229086
 directory       | 7733163858
 origin          |  144960097
 origin_visit    |  817462103
 person          |   39726003
 release         |   16325031
 revision        | 1934143430
 skipped_content |     139323
 snapshot        |  137260210
(9 rows)

Time: 21.589 ms

The best track at the moment is it seems pergamon can't read the value from the node exporter :

vsellier@pergamon /etc/prometheus/exported-configs
 % time curl http://192.168.100.210:9237/metrics
^C
curl http://192.168.100.210:9237/metrics  0.01s user 0.01s system 0% cpu 5:21.46 total

it never succeed or tooks too much time

Belvedere is under an heavy load since the 2020-11-25 around 23:45 UTC which seems to match the last prometheus counter complete update :

zack renamed this task from Production counters not up to date to Archive counters are no longer updated in production.Nov 30 2020, 4:02 PM
zack raised the priority of this task from High to Unbreak Now!.
zack added a subscriber: rdicosmo.

It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):

postgres=# select now(); select * from pg_stat_activity where state != 'idle';    
select now();
-[ RECORD 1 ]----------------------
now | 2020-11-30 15:00:17.358361+00
...
-[ RECORD 1 ]----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 16402
datname          | softwareheritage-scheduler
pid              | 3360876
usesysid         | 10
usename          | postgres
application_name | prometheus-sql-exporter
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2020-11-30 13:41:09.391675+00
xact_start       | 2020-11-30 14:56:03.064342+00
query_start      | 2020-11-30 14:56:03.064342+00
state_change     | 2020-11-30 14:56:03.064353+00
wait_event_type  | IO
wait_event       | DataFileRead
state            | active
backend_xid      | 
backend_xmin     | 3436325849
query            | with task_count_by_bucket as (                                                                                                                                                                                                                                                                            
                                                                                                                                                            +
                 |   -- get the count of tasks by delay bucket. Tasks are grouped by their                                                                                                                                                                                                                                   
                                                                                                                                                            +
                 |   -- characteristics (type, status, policy, priority, current interval),                                                                                                                                                                                                                                  
                                                                                                                                                            +
                 |   -- then by delay buckets that are 1 hour wide between -24 and +24 hours,                                                                                                                                                                                                                                
                                                                                                                                                            +
                 |   -- and 1 day wide outside of this range.                                                                                                                                                                                                                                                                
                                                                                                                                                            +
                 |   -- A positive delay means the task execution is late wrt scheduling.                                                                                                                                                                                                                                    
                                                                                                                                                            +
                 |   select                                                                                                                                                                                                                                                                                                  
                                                                                                                                                            +
                 |     "type",                                                                                                                                                                                                                                                                                               
                                                                                                                                                            +
                 |     status,                                                                                                                                                                                                                                                                                               
                                                                                                                                                            +
                 |     "policy",                                                                                                                                                                                                                                                                                             
                                                                                                                                                            +
                 |     priority,                                                                                                                                                                                                                                                                                             
                                                                                                                                                            +
                 |     current_interval,                                                                                                                                                                                                                                                                                     
                                                                                                                                                            +
                 |     (                                                                                                                                                                                                                                                                                                     
                                                                                                                                                            +
                 |       -- select the bucket widths                                                                                                                                                                                                                                                                         
                                                                                                                                                            +
                 |       case when delay between - 24 * 3600 and 24 * 3600 then                                                                                                                                                                                                                                              
                                                                                                                                                            +
                 |         (ceil(delay / 3600)::bigint) * 3600                                                                                                                                                                                                                                                               
                                                                                                                                                            +
                 |       else                                                                                                                                                                                                                                                                                                
                                                                                                                                                            +
                 |         (ceil(delay / (24 * 3600))::bigint) * 24 * 3600                                                                                                                                                                                                                                                   
                                                                                                                                                            +
                 |       end                                                                                                                                                                                                                                                                                                 
                                                                                                                                                            +
                 |     ) as delay_bucket,                                                                                                                                                                                                                                                                                    
                                                                                                                                                            +
                 |     count(*)                                                                                                                                                                                                                                                                                              
                                                                                                                                                            +
                 |   from                                                                                                                                                                                                                                                                                                    
                                                                                                                                                            +
                 |     task                                                                                                                                                                                                                                                                                                  
                                                                                                                                                            +
                 |     join lateral (                                                                                                                                                                                                                                                                                        
                                                                                                                                                            +
                 |       -- this is where the "positive = late" convention is set                                                                                                                                                                                                                                            
                                                                                                                                                            +
                 |       select                                                                                                                                                                                                                                                                                              
                                                                                                                                                            +
                 |         extract(epoch from (now() - next_run)) as delay                                                                                                                                                                                                                                                   
                                                                                                                                                            +
                 |     ) as d on true                                                                                                                                                                                                                                                                                        
                                                                                                                                                            +
                 |     group by                                                                                                                                                                                                                                                                                              
                                                                                                                                                            +
                 |       "type",                                                                                                                                                                                                                                                                                             
                                                                                                                                                            +
                 |       status,                                                                                                                                                                                                                                                                                             
                                                                                                                                                            +
                 |       "policy",                                                                                                                                                                                                                                                                                           
                                                                                                                                                            +
                 |       priority,                                                                                                                                                                                                                                                                                           
                                                                                                                                                            +
                 |       current_interval,                                                                                                                                                                                                                                                                                   
                                                                                                                                                            +
                 |       delay_bucket                                                                                                                                                                                                                                                                                        
                                                                                                                                                            +
                 |     order by                                                                                                                                                                                                                                                                                              
                                                                                                                                                            +
                 |   
backend_type     | client backend

Several are running in parallel :

postgres=# select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query;
select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query;
-[ RECORD 1 ]---------------------------------------------------------------------
query | with task_count_by_bucket as (                                            +
      |   -- get the count of tasks by delay bucket. Tasks are grouped by their   +
      |   -- characteristics (type, status, policy, priority, current interval),  +
      |   -- then by delay buckets that are 1 hour wide between -24 and +24 hours,+
      |   -- and 1 day wide outside of this range.                                +
      |   -- A positive delay means the task execution is late wrt scheduling.    +
      |   select                                                                  +
      |     "type",                                                               +
      |     status,                                                               +
      |     "policy",                                                             +
      |     priority,                                                             +
      |     current_interval,                                                     +
      |     (                                                                     +
      |       -- select the bucket widths                                         +
      |       case when delay between - 24 * 3600 and 24 * 3600 then              +
      |         (ceil(delay / 3600)::bigint) * 3600                               +
      |       else                                                                +
      |         (ceil(delay / (24 * 3600))::bigint) * 24 * 3600                   +
      |       end                                                                 +
      |     ) as delay_bucket,                                                    +
      |     count(*)                                                              +
      |   from                                                                    +
      |     task                                                                  +
      |     join lateral (                                                        +
      |       -- this is where the "positive = late" convention is set            +
      |       select                                                              +
      |         extract(epoch from (now() - next_run)) as delay                   +
      |     ) as d on true                                                        +
      |     group by                                                              +
      |       "type",                                                             +
      |       status,                                                             +
      |       "policy",                                                           +
      |       priority,                                                           +
      |       current_interval,                                                   +
      |       delay_bucket                                                        +
      |     order by                                                              +
      |   
count | 3

Time: 1.362 ms

comment edited to add a filter on the idle requests

Hmmm... there is definitely no need to update the counters more than once a day

Let's try a temporary workaround :

root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" 
root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~
root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter

The request is not executed anymore but there is another one taking a long time on the main swh database this time :

postgres=# select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query;
select query, count(datid) from pg_stat_activity where application_name='prometheus-sql-exporter' and state != 'idle' group by query;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query | SELECT current_database()::text AS datname, COALESCE(schemaname::text, 'null') AS schemaname, COALESCE(relname::text, 'null') AS relname, SUM(COALESCE(heap_blks_read, 0) + COALESCE(heap_blks_hit, 0) + COALESCE(idx_blks_hit, 0) + COALESCE(idx_blks_read, 0) + COALESCE(toast_blks_hit, 0) + COALESCE(toast_blks_read, 0) + COALESCE(tidx_blks_hit, 0) + COALESCE(tidx_blks_read, 0)) * 8192::bigint as reads FROM pg_statio_user_tables FULL JOIN (VALUES(0)) filler(i) ON TRUE GROUP BY 1, 2, 3
count | 1

Hmmm... there is definitely no need to update the counters more than once a day

The counters by themselves are well updated once a day.

The problem is on the probes used to follow the behavior of the archive, These are the counters that are aggregated to compute the daily counter. (if I have well understood all the pipeline ;) )

It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :

postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
   client_addr   | count 
-----------------+-------
                 |     3
 192.168.100.18  |     0
 ::1             |     1
 192.168.100.210 |    60
 192.168.100.102 |    64
(5 rows)

I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere

Half of the workers were stopped :

root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service.
...

The load on belvedere falls from more than 130 to less than ~100 but it seems it's not enough to have the metrics in a decent delay

@olasd has stopped the backfilling with :

pkill -2 -u swhstorage -f revision

(allow to flush the logs before exiting)

It remains a lot of requests doing a ranged queries on the contents, They could be issued by the the indexers restarted this morning.
They were also stopped :

root@pergamon:/etc/clustershell# sudo clush -b -w @azure-workers 'puppet agent --disable "Reduce belvedere load"; cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl disable $unit; done; systemctl stop swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service'

D4635 is a proposal to solve the performance issues on the statistic queries

As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :

vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'

D4635 is landed.

The metrics are now retrieved in less than 20s

curl http://192.168.100.210:9237/metrics  0.01s user 0.03s system 0% cpu 18.079 total

Let's see if things are getting better

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display

[1] https://grafana.softwareheritage.org/goto/-dNByJ0Mk
[2] https://grafana.softwareheritage.org/goto/RRG3s1AMz

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

The problem was on some requests used for the internal monitoring of postgresql (dashboard [1] in my previous comment) which starts to have a bad behavior when postgresql is under heavy load.
The swh objects counters stats by themselves are not expensive. The missing updates were a collateral damages of the monitoring issue.

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

All stopped workers are restarted :

vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

Yes and no.

There's two types of archive counters:

  • single-update counters, where the size of the table is counted with a single count on the full table; there's 5 of these counters (skipped content, origin_visit, origin_visit_status, snapshot, person), and one of them is updated (by cron) every 4 hours, so they're updated every 20 hours
  • bucketed counters, where the size of the table is counted by object range (content, directory, revision, release). Ranges are updated continuously, one after the other, and the main table counter is updated once for every 256 updates of a bucket.

The main issue is that the update of "counts per bucket" is now taking a (very) long time, for some object types (currently, revision):

Tue Dec  1 11:59:11 2020 (every 0.1s)

 swh_update_counter_bucketed 
-----------------------------
 
(1 row)

Time: 237774.102 ms (03:57.774)

We don't parallelize bucketed counts by object type, and there's a lot of buckets (256 or 4096 per object type) so the counters don't really get updated that often...

Maybe updating the counters "one shot" for all tables would be more sensible now... This needs to be timed.

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

My understanding of the "Objects added by time period dashboard" is it uses the sql_swh_archive_object_count prometheus metrics.

This metric seems to be populated by the prometheus sql exporter which executes the following query calling the swh_stat_counters() stored procedure :

select label as object_type, value from swh_stat_counters()

This stored procedure[1] returns the estimated number of rows of the different tables based on the postgresql statistics.

did I miss something ?

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$886
[2] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$844

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

My understanding of the "Objects added by time period dashboard" is it uses the sql_swh_archive_object_count prometheus metrics.

Absolutely.

This metric seems to be populated by the prometheus sql exporter which executes the following query calling the swh_stat_counters() stored procedure :

select label as object_type, value from swh_stat_counters()

That is correct.

This stored procedure[1] returns the estimated number of rows of the different tables based on the postgresql statistics.

[2] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$844

(I guess you meant [2]) No, these functions do not use postgresql statistics anymore. Count estimates using the postgresql statistics can go backwards, which generates confused questions from users.

The swh_stat_counters() function uses the object_counts table, which is updated by both:

  • the swh_update_counter function for "single-update counters". This does an exact count(*) on the table passed as argument. This function is called by the following crontab entry for postgres@belvedere:
29 2-22/4  *   *   *     /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"
  • the swh_update_counters_from_buckets trigger on the object_counts_bucketed table, which happens every 256 updates of that table.

Updates of the object_counts_bucketed table are done by the swh_update_counter_bucketed() function, which is run in a loop on belvedere. That function does an exact count() on the range with the oldest last_update in the object_counts_bucketed table.

Now, it turns out that one iteration of the loop to update one of the bucketed revision counts takes four minutes (there's 4096 of them), while doing a full count of the table takes... 17 minutes. *sigh*.

I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.

Btw, the main point of using buckets for object counts of large tables is that *very long running* transactions kill performance for the whole database, and have knock-on effects for logical replication. In effect, if the time we take to update the buckets is 300 times larger than making a single update, then we need to rethink this tradeoff...

I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :

PNG:

Source: P888

Current status of counting the objects in one go:

$ psql service=swh              
psql (13.1 (Debian 13.1-1+b1), serveur 12.5 (Debian 12.5-1.pgdg100+1))
Connexion SSL (protocole : TLSv1.3, chiffrement : TLS_AES_256_GCM_SHA384, bits : 256, compression : désactivé)
Saisissez « help » pour l'aide.

10:19 guest@softwareheritage => begin; set transaction read only; select count(*) from directory; rollback;
BEGIN
Temps : 13,052 ms
SET
Temps : 11,685 ms
   count    
────────────
 7792777182
(1 ligne)

Durée : 2689236,266 ms (44:49,236)
ROLLBACK
Temps : 12,618 ms
11:05 guest@softwareheritage => begin; set transaction read only; select count(*) from content; rollback;
BEGIN
Temps : 21,801 ms
SET
Temps : 19,987 ms
   count    
────────────
 9240031866
(1 ligne)

Durée : 1858340,245 ms (30:58,340)
ROLLBACK
Temps : 11,412 ms

This is much shorter than what it used to be when the buckets were implemented (thanks to parallel workers, I guess), so I think we should just make all counters update one-shot now.

One more:

11:38 guest@softwareheritage => begin; set transaction read only; select count(*) from revision; rollback;
BEGIN
Temps : 32,545 ms
SET
Temps : 16,530 ms
   count    
────────────
 1947575558
(1 ligne)

Durée : 1210886,321 ms (20:10,886)
ROLLBACK
Temps : 12,746 ms

So, yeah, looks like doing this is tractable.

I've now done softwareheritage=> update object_counts set single_update=true;, which will make all counters get their updates via cron. I've also shortened the cron delay to be 2 hours instead of 4 (providing an update for each counter every 18 hours).

changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.