Page MenuHomeSoftware Heritage

Clean up raw_extrinsic_metadata table
Closed, ResolvedPublic

Description

It's currently holding information we migrated away from (loaders now write metadata on
revision objects instead of snapshots, D4142 D4242).

The table is also big for a fresh new table (~2 months old), that won't be sustainable.

So here is the new plan (initial plan [1]):

  • Drop raw_extrinsic_metadata from the replication subscription
  • Then apply P846

[1]

> delete from raw_extrinsic_metadata 
where id like 'swh:1:snp:%' 
  and  (format = 'replicate-npm-package-json'
           or format = 'pypi-project-json'
  );
> explain ...
 Delete on raw_extrinsic_metadata  (cost=0.00..2366.00 rows=1 width=6)
   ->  Seq Scan on raw_extrinsic_metadata  (cost=0.00..2366.00 rows=1 width=6)
         Filter: ((id ~~ 'swh:1:snp:%'::text) AND ((format = 'replicate-npm-package-json'::text) OR (format = 'pypi-project-json'::text)))
(3 rows)

Event Timeline

ardumont triaged this task as Normal priority.Nov 2 2020, 2:02 PM
ardumont created this task.
ardumont updated the task description. (Show Details)

It's actually less happy from the main db though:

psql service=admin-swh
psql (12.2, server 12.4 (Debian 12.4-1.pgdg100+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

softwareheritage=> explain delete from raw_extrinsic_metadata
where id like 'swh:1:snp:%'
  and  (format = 'replicate-npm-package-json'
           or format = 'pypi-project-json'
  );
ERROR:  cannot delete from table "raw_extrinsic_metadata" because it does not have a replica identity and publishes deletes
HINT:  To enable deleting from the table, set REPLICA IDENTITY using ALTER TABLE.

What's the problem in the ERMDS?
Can you be more precise in the title, because seems it's a formatting issue for NPM or PyPi packages.. but I'm not sure.

@moranegg the table grew over 500GB over a couple of months.

We changed loaders to write metadata on revision objects instead of snapshots: D4142 and D4242

What's suggested by the psql prompt seems to the point. As far as my understanding goes, [2] from [1] would actually writes in replication logs the rows to be deleted from the replica.

[1]

alter table raw_extrinsic_metadata set replica identity;

[2] https://www.postgresql.org/docs/current/sql-altertable.html

Instead of a like query which is not indexed, you should use a ranged query which will be able to use an index instead (id >= 'swh:1:snp:' and id < 'swh:1:snp;')

What's suggested by the psql prompt seems to the point. As far as my understanding goes, [2] from [1] would actually writes in replication logs the rows to be deleted from the replica.

[1]

alter table raw_extrinsic_metadata set replica identity;

[2] https://www.postgresql.org/docs/current/sql-altertable.html

I don't think this is needed:

  • the replica isn't really initialized yet
  • there's a bunch of planned changes to this table, most notably adding a primary key (T2703)

In the current situation of the replica, you can just drop the table from the publication while you're doing the removal.

In the current situation of the replica, you can just drop the table from the publication while you're doing the removal.

Ack, thanks.

In the current situation of the replica, you can just drop the table from the
publication while you're doing the removal.

Ack, thanks.

Ok, so here it goes, dropping the table from the publication:

On the main db host:

softwareheritage=# \conninfo
You are connected to database "softwareheritage" as user "postgres" via socket in "/var/run/postgresql" at port "5433".
softwareheritage=# alter publication softwareheritage drop table raw_extrinsic_metadata;
ALTER PUBLICATION

On the replica host, updating the replication subscription:

softwareheritage=# alter subscription softwareheritage_replica refresh publication;
ALTER SUBSCRIPTION
softwareheritage=# select relname from pg_subscription_rel left join pg_class on srrelid=oid where relname like 'raw%';
 relname
---------
(0 rows)

Now on to your initial hints about ranges which I missed the first time...
(Thanks ;)

Now on to your initial hints about ranges which I missed the first time...
(Thanks ;)

P846

ardumont changed the task status from Open to Work in Progress.Nov 2 2020, 5:48 PM

running on belevedere in a root tmux session btw (first delete query passed [1], I triggered the rest in that tmux session):

root@belvedere:~# tmux attach -t 2749-cleanup-raw-extrinsic-metadata

[1]

softwareheritage=# delete from raw_extrinsic_metadata
where id >= 'swh:1:snp:' and id < 'swh:1:snp:1111111111111111111111111111111111111111'
and (format = 'replicate-npm-package-json'
     or format = 'pypi-project-json');
DELETE 4216052

ETA should be roughly done during the night, tomorrow morning for sure (currently at the range between 888... and 999...).

(between yesterday evening and this morning, a typo in one of the query prevented from running all the ranges so only 3 got done ¯\_(ツ)_/¯).

Almost, one last range eee... to fff... to run [1]

[1] I had yet another typos in my initial paste i fixed yesterday evening.

Almost, one last range eee... to fff... to run

done


I triggerred a last query after it to ensure nothing was left:

softwareheritage=# delete from raw_extrinsic_metadata
where id >= 'swh:1:snp:' and id < 'swh:1:snp;'
and (format = 'replicate-npm-package-json'
     or format = 'pypi-project-json');
DELETE 0

it is.

Closing.

ardumont claimed this task.