Page MenuHomeSoftware Heritage

scrubber: Investigate the apparent lock (staging)
Closed, ResolvedPublic

Description

Plenty of apparently stuck insert queries in the db [1] with a db still empty [2] (out of [3])
So something must be wrong somewhere.

Let's investigate and fix that.

[1]

1022114 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B     503 h    transactionid    N             active   INSERT INTO datastore (package, class, instance) VALUES ('storage', 'postgresql',
'user=guest password=xxx dbname=swh host=db1.internal.staging.swh.network port=5432') ON CONFLICT DO NOTHING RETURNING id

And nothing actually written in the scrubber db as of yet since its deployment around 3 weeks ago.

15:24:26 swh-scrubber@db1:5432=> select * from corrupt_object ;
+----+-----------+--------+------------------+
| id | datastore | object | first_occurrence |
+----+-----------+--------+------------------+
+----+-----------+--------+------------------+
(0 rows)

Time: 7.313 ms
15:29:39 swh-scrubber@db1:5432=> select * from datastore;
+----+---------+-------+----------+
| id | package | class | instance |
+----+---------+-------+----------+
+----+---------+-------+----------+
(0 rows)

Time: 13.705 ms
15:29:43 swh-scrubber@db1:5432=> select * from object_origin ;
+-----------+------------+--------------+
| object_id | origin_url | last_attempt |
+-----------+------------+--------------+
+-----------+------------+--------------+
(0 rows)

Time: 5.291 ms

[3]

root@db1:~# sudo -i -u postgres pg_activity -p 5433 dbname=swh-scrubber

Event Timeline

ardumont triaged this task as Normal priority.May 10 2022, 3:30 PM
ardumont created this task.

Well, i don't get it the apparent locking situation.
Services that do need to write to the db have the proper credentials for that.

I've tentatively took the query that was the "apparent" source of the "stuckness" and run it myself.
Note that while the other services were running, the query would apparently be locked as well.
So i stopped the servicse first. And then the query ran successfully.

Hopefully, that will "unstuck" the check services.
For now, i no longer see what's described in the task description.

swh-scrubber=# INSERT INTO datastore (package, class, instance) VALUES ('storage', 'postgresql',
'user=guest password=xxx dbname=swh host=db1.internal.staging.swh.network port=5432') ON CONFLICT DO NOTHING RETURNING
id;
 id
----
 39
(1 row)

INSERT 0 1
swh-scrubber=# select * from dbversion ;
 version |            release            |    description
---------+-------------------------------+-------------------
       1 | 2022-04-06 14:34:47.749172+00 | DB initialization
(1 row)

swh-scrubber=# select * from datastore;
 id | package |   class    |                                      instance
----+---------+------------+------------------------------------------------------------------------------------
 39 | storage | postgresql | user=guest password=xxx dbname=swh host=db1.internal.staging.swh.network port=5432
(1 row)
ardumont changed the task status from Open to Work in Progress.May 25 2022, 2:07 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

And now time to unstuck the debian build...

And now time to unstuck the debian build...

ok now [1], python3-swh.scrubber_0.0.2-1~swh2~bpo10+1_all.deb built.

[1] https://jenkins.softwareheritage.org/view/swh%20master%20(draft)/job/debian/job/packages/job/DSCRUB/job/gbp-buildpackage/

It was still not ok as the upgrade tooling also expects the sql/upgrades/<final-version>.sql to exist and be packaged.
There was nothing there hence [1]

[1] rDSCRUBf24850469c02a54ac7f621c0d318510a675b8c67

Finally, after installing postgresql-client-11 and stopping scrubber services [1]:

swhworker@scrubber0:~$ swh db --config-file /etc/softwareheritage/scrubber/postgres.yml upgrade scrubber --to-version=2 --module-config-key=scrubber_db
INFO:swh.core.db.db_utils:Executing migration script '/usr/lib/python3/dist-packages/swh/scrubber/sql/upgrades/2.sql'

[1]

root@scrubber0:~# apt install -y postgresql-client-11
...
root@scrubber0:~# systemctl list-units | grep swh-scrubber | awk '{print $1}' | xargs -r systemctl stop

There seem to still exist staling queries [1] and nothing gets written to the db [2]:

[1]

942102 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B  03:56.23       ClientRead    N      idle in trans   INSERT INTO corrupt_object (id, datastore, object) VALUES
('swh:1:rev:00241b15fa8ccd21e1b43726ea131a189c14074e', 39, '\x8ca76d657373616765c44d46697273742076657273696f6e73206f6620746865206d616e207061676573206175746f2d67656e6572617465642066726f6d2074686520594f444c0a736f757263652e0a4a6572656d792e0aa
6617574686f7283a866756c6c6e616d65c41e4a6572656d7920416c6c69736f6e203c6a72614073616d62612e6f72673ea46e616d65c40e4a6572656d7920416c6c69736f6ea5656d61696cc40d6a72614073616d62612e6f7267a9636f6d6d697474657283a866756c6c6e616d65c41e4a6572656d7920
416c6c69736f6e203c6a72614073616d62612e6f72673ea46e616d65c40e4a6572656d7920416c6c69736f6ea5656d61696cc40d6a72614073616d62612e6f7267a46461746582a974696d657374616d7082a77365636f6e6473ce3648e7f6ac6d6963726f7365636f6e647300ac6f66667365745f62797
46573c4052b30303030ae636f6d6d69747465725f6461746582a974696d657374616d7082a77365636f6e6473ce3648e7f6ac6d6963726f7365636f6e647300ac6f66667365745f6279746573c4052b30303030a474797065a3676974a96469726563746f7279c414c89ae2f4c6c38e
1202586 swh                                        None            local    5.8  0.9   21.58M       0B  02:37.00                     N             active   autovacuum: VACUUM pg_catalog.pg_depend
1534606 swh                                        None            local    6.7  0.5   25.15M       0B  01:59.00                     N             active   autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute
877019 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B  01:36.67       ClientRead    N      idle in trans   INSERT INTO corrupt_object (id, datastore, object) VALUES
('swh:1:rev:bbda3dd7422a434f19ca8968dab8f9f210254c2f', 39, '\x8ca76d657373616765c501374368616e676520746865206465736372697074696f6e206f6e202d2d776974682d70616d5f736d627061737320746f206d616b6520697420636c6561726572202d20736f6d652070656f70650
a7365656d20746f207468696e6b20796f75206e6565642074686973206d6f64756c6520666f72206e6f726d616c2073616d62612f50414d206f7065726174696f6e2e0a0a726572756e206175746f636f6e660a726572756e206175746f6865616465720a0a284e6f74652074686174202d2d776974682d
70616d5f736d6270617373207374696c6c20646f65736e2774206275696c642c20627574206174206c65617374207768656e20697420646f65730a616761696e206f6e6c792070656f706c652077686f2061637475616c6c79206e6565642069742077696c6c2073656c656374206974292e0aa66175746
86f7283a866756c6c6e616d65c424416e6472657720426172746c657474203c61626172746c65744073616d62612e6f72673ea46e616d65c40f416e6472657720426172746c657474a5656d61696cc41261626172746c65744073616d62612e6f7267a9636f6d6d697474657283a866
20080  swh                                         swh     127.0.0.1/32    0.9 13.5    5.79M    7.10K  00:31.91     DataFileRead    N             active   SELECT swh_content_add()
842292 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B  00:14.72       ClientRead    N      idle in trans   INSERT INTO corrupt_object (id, datastore, object) VALUES
('swh:1:rev:4463ffeed91163314d2d903a271b29e87edaadcf', 39, '\x8ca76d657373616765c431466978207479706f20696e205749465349474e414c45442028617320706572205761696465722773207265706f7274290aa6617574686f7283a866756c6c6e616d65c420416c6578616e6465722
0426f6b6f766f79203c61624073616d62612e6f72673ea46e616d65c411416c6578616e64657220426f6b6f766f79a5656d61696cc40c61624073616d62612e6f7267a9636f6d6d697474657283a866756c6c6e616d65c420416c6578616e64657220426f6b6f766f79203c61624073616d62612e6f7267
3ea46e616d65c411416c6578616e64657220426f6b6f766f79a5656d61696cc40c61624073616d62612e6f7267a46461746582a974696d657374616d7082a77365636f6e6473ce3e4bc7d8ac6d6963726f7365636f6e647300ac6f66667365745f6279746573c4052b30303030ae636f6d6d69747465725
f6461746582a974696d657374616d7082a77365636f6e6473ce3e4bc7d8ac6d6963726f7365636f6e647300ac6f66667365745f6279746573c4052b30303030a474797065a3676974a96469726563746f7279c414f80a727ad48e6239eb15b2713251c1407096c959a973796e746865
786221 swh                                         swh     127.0.0.1/32    0.9  8.4    4.18M       0B  00:09.66     DataFileRead    N             active   SELECT swh_directory_add()
898654 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B  00:07.89       ClientRead    N      idle in trans   INSERT INTO corrupt_object (id, datastore, object) VALUES
('swh:1:rev:88a97d2174eee884de52e848d30bc2c459bf28bb', 39, '\x8ca76d657373616765c40b6d616b652070726f746f0aa6617574686f7283a866756c6c6e616d65c41b54696d20506f74746572203c74706f744073616d62612e6f72673ea46e616d65c40a54696d20506f74746572a5656d6
1696cc40e74706f744073616d62612e6f7267a9636f6d6d697474657283a866756c6c6e616d65c41b54696d20506f74746572203c74706f744073616d62612e6f72673ea46e616d65c40a54696d20506f74746572a5656d61696cc40e74706f744073616d62612e6f7267a46461746582a974696d657374
616d7082a77365636f6e6473ce3d7316f0ac6d6963726f7365636f6e647300ac6f66667365745f6279746573c4052b30303030ae636f6d6d69747465725f6461746582a974696d657374616d7082a77365636f6e6473ce3d7316f0ac6d6963726f7365636f6e647300ac6f66667365745f6279746573c40
52b30303030a474797065a3676974a96469726563746f7279c4140db16b279ea775d24c1bce63bf5821583825c8eba973796e746865746963c2a86d65746164617461c0a7706172656e747391c414ac84b220aa2910abec309f2dc048adb84f609786a26964c41488a97d2174eee884

[2]

18:34:36 swh-scrubber@db1:5432=> select now(), count(*) from corrupt_object ;
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2022-05-30 16:34:44.382227+00 |     0 |
+-------------------------------+-------+
(1 row)

Time: 20.708 ms

We have plenty of processes stuck with "idle in transaction". According to [1], this means
"waiting for client inside a BEGIN block", so there might be issues in the scrubber code
[2]?

[1] https://www.postgresql.org/docs/current/monitoring-ps.html

[2]

postgres@db1:~$ psql -c 'SHOW cluster_name'
 cluster_name
--------------
 12/main
(1 row)
postgres@db1:~$ date; ps aux|grep "12/main" | grep -i scrubber
Mon May 30 16:46:54 UTC 2022
postgres  842292  0.0  0.0 34459184 47672 ?      Ss   16:11   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47018) idle in transaction
postgres  877019  0.0  0.0 34459184 47444 ?      Ss   16:12   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47022) idle in transaction
postgres  898654  0.0  0.0 34459184 46948 ?      Ss   16:12   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47028) idle in transaction
postgres  942102  0.0  0.0 34459184 50244 ?      Ss   16:13   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47034) idle in transaction
postgres 1552799  0.0  0.0 34459052 30020 ?      Ss   16:31   0:00 postgres: 12/main: postgres swh-scrubber 127.0.0.1(47078) idle
postgres 1608296  0.0  0.4 34460248 431276 ?     Ss   16:34   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47108) idle in transaction
postgres 2009649  0.0  0.0 34461424 46660 ?      Ss   13:41   0:01 postgres: 12/main: postgres swh-scrubber [local] idle

Nothing bulged, same id process...
(Well, new ones appeared but old ones are still waiting for something)

postgres@db1:~$ date; ps aux|grep "12/main" | grep -i scrubber
Tue May 31 07:04:29 UTC 2022
postgres  842292  0.0  0.1 34459184 126352 ?     Ss   May30   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47018) idle in transaction
postgres  877019  0.0  0.1 34459184 119512 ?     Ss   May30   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47022) idle in transaction
postgres  898654  0.0  0.1 34459184 120492 ?     Ss   May30   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47028) idle in transaction
postgres  942102  0.0  0.1 34459184 128324 ?     Ss   May30   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47034) idle in transaction
postgres 1015246  0.0  0.0 34459296 47228 ?      Ss   May30   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47744) idle in transaction
postgres 1608296  0.0  0.4 34460248 431384 ?     Ss   May30   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47108) idle in transaction
postgres 1905265  0.0  0.0 34459184 44716 ?      Ss   May30   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47148) idle in transaction
postgres 2441677  0.0  0.0 34461424 44632 ?      Ss   05:41   0:00 postgres: 12/main: postgres swh-scrubber [local] idle
postgres 2881245  0.0  0.0 34459184 44796 ?      Ss   May30   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(47332) idle in transaction
postgres 3819510  0.0  0.0 34459052 29652 ?      Ss   07:01   0:00 postgres: 12/main: postgres swh-scrubber 127.0.0.1(49700) idle
postgres 3819547  0.8  0.1 34459296 166492 ?     Ss   07:01   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(49704) idle in transaction
postgres 3819643  0.8  0.1 34459296 167740 ?     Ss   07:01   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(49710) idle in transaction
postgres 3820081  0.8  0.1 34459296 166948 ?     Ss   07:01   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(49716) idle in transaction
postgres 3820171  0.8  0.1 34459484 167128 ?     Ss   07:01   0:01 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(49722) idle in transaction

Stopping all services and letting run only 1 checker service.

The insert into corrupt objects query [1] appears and gets locked on its own [2].
All the while, the checker process continues on checking the remaining ranges...
So no data gets inserted somehow.x

[1]

Tue May 31 08:55:29 UTC 2022
postgres 1343325  0.0  0.0 34459580 33296 ?      Ss   08:42   0:00 postgres: 12/main: postgres swh-scrubber 127.0.0.1(50056) idle
postgres 1345429  1.6  0.0 34460784 45996 ?      Ss   08:42   0:12 postgres: 12/main: postgres swh-scrubber [local] idle
postgres 1386205  0.0  0.0 34460068 69728 ?      Ss   08:44   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(50064) idle
postgres 1455131  0.0  0.0 34459184 41044 ?      Ss   08:51   0:00 postgres: 12/main: swh-scrubber swh-scrubber 127.0.0.1(50078) idle in transaction
postgres 2441677  0.0  0.0 34461424 46588 ?      Ss   05:41   0:01 postgres: 12/main: postgres swh-scrubber [local] idle

[2]

1455131 swh-scrubber                       swh-scrubber     127.0.0.1/32    0.0  0.0       0B       0B  00:16.40       ClientRead    N      idle in trans   INSERT INTO corrupt_object (id, datastore, object) VALUES
('swh:1:rev:003600223028e16d0d77bef5c40874363ed508d2', 39, '\x8ca76d657373616765c4a1446f6e277420626f7468657220747279696e6720746f20636f6e6e65637420746f2077696e62696e6464206461656d6f6e2069662074686520554e495820736f636b657420646f65730a6e6f742
065786973742e0a0a5265696e7374617465642074686520736563757269747920636865636b206f6e2074686520554e495820736f636b6574202d206e6f7420737572652077686572652069742077656e742e0aa6617574686f7283a866756c6c6e616d65c41b54696d20506f74746572203c74706f7440
73616d62612e6f72673ea46e616d65c40a54696d20506f74746572a5656d61696cc40e74706f744073616d62612e6f7267a9636f6d6d697474657283a866756c6c6e616d65c41b54696d20506f74746572203c74706f744073616d62612e6f72673ea46e616d65c40a54696d20506f74746572a5656d616
96cc40e74706f744073616d62612e6f7267a46461746582a974696d657374616d7082a77365636f6e6473ce3906621eac6d6963726f7365636f6e647300ac6f66667365745f6279746573c4052b30303030ae636f6d6d69747465725f6461746582a974696d657374616d7082a77365

With [1] applied, the insert query got executed!

[1] D7914

And v0.0.6 tagged and deployed.

We now have corrupt object being stored in the db:

11:26:40 swh-scrubber@db1:5432=> select count(*) from corrupt_object ;
+-------+
| count |
+-------+
| 59238 |
+-------+
(1 row)

Time: 21.458 ms

No more idle in transaction queries and corrupt_object still growing [1].
We can close this.

[1] That's mitigated good news. For the good part, scrubber checkers are now actually able to write to the db.
The not so good part is that we do have corrupt objects [2].

[2]

14:22:03 swh-scrubber@db1:5432=> select count(*) from corrupt_object ;
+-------+
| count |
+-------+
| 75454 |
+-------+
(1 row)

Time: 226.231 ms
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.