Page MenuHomeSoftware Heritage

google import: Clean up wrong revisions
Closed, ResolvedPublic

Description

At one point, the mercurial loader was not creating the right revisions (wrong hash on parents' revision column).
This resulted in wrong hashes.

This behavior is fixed with 2caf1b1fe56835b6a03f34c94dec74e514e15fbf.

We need to clean up those entries though.

Tables impacted:

  • revision (entries with type 'hg')
  • revision_history

Event Timeline

ardumont renamed this task from google import: Clean up wrong revision to google import: Clean up wrong revisions.
ardumont updated the task description. (Show Details)

Analysis

  1. We can 'simply' delete the revision of type 'hg' as no other mercurial revision exists as of today.
  1. That will be slow though as no index exists on the revision's type.

Implementation possibilities

constraint

There is no 'on delete cascade' in our model.
So either, adding a constraint and delete from the revision table (on delete cascade, then taking care of the deletion in the revision_history).

  1. That could be costly and rebuild an index...
  2. dangerous even

queries

So, i guess remains only to delete from revision_history first and then revision.
Something like:

delete from revision_history where id in (select id from revision where type='hg');
delete from revision where type='hg';

CTE

From discussion on irc (thanks to @olasd):

with deleted as (delete from revision where type='hg' returning id) delete from revision_history where id in (select id from deleted);

Note: I'm skeptical on that one though regarding the foreign key constraint. Explain is ok with this though.

stored procedure

Probably using a stored procedure would be better to reuse the listing.
Listing one time and iterate over then delete form revision_history then from revision.

In any case, we need to make a backup dump prior to touching those tables!

ardumont added a comment.EditedFeb 20 2018, 3:45 PM

In any case, we need to make a backup dump prior to touching those tables!

Backup running on prado:

postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz
olasd added a comment.Feb 20 2018, 4:14 PM

In any case, we need to make a backup dump prior to touching those tables!
Backup running on prado:

postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz

Running a full backup of the table for the handful of revisions concerned here is a bit overkill! (better be safe than sorry and all that, but still...)

In the future, please use /srv/softwareheritage/postgres/export (local ssd storage) as initial storage to make the backup go as fast as possible.

As for the actual removal query, you can use:

with to_delete as (
    select id from revision where type='hg'
), delete_history as (
    delete from revision_history where id in (select id from to_delete)
) delete from revision where id in (select id from to_delete);

which will only perform one full scan of the revision table.

Running a full backup of the table for the handful of revisions concerned here is a bit overkill! (better be safe than sorry and all that, but still...)

Yes. Well, we said a backup.

What would have been more reasonable, only the data?

In the future, please use /srv/softwareheritage/postgres/export (local ssd storage) as initial storage to make the backup go as fast as possible.

Ack. I did not know nor see that partition. Thanks for the heads up.

As for the actual removal query, you can use:
...
which will only perform one full scan of the revision table.

Thanks!
It indeed worked on my local db!

ardumont updated the task description. (Show Details)Feb 20 2018, 4:25 PM

postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz

Not the right way as the / disk got used and quickly got filled.
This has been stopped and restarted according to the prior exchange by @olasd and prior to any server explosion.

TL;DR Command is ok but the location is not. Place yourself to the right folder before starting to dump... (as disk is used).

olasd added a comment.Feb 21 2018, 4:57 PM

FWIW the backup has now completed.

Thanks for the heads up.

ardumont changed the task status from Open to Work in Progress.Feb 23 2018, 10:30 AM

Status:

  • [DONE] backup
  • [IN-PROGRESS] Clean up in progress
ardumont closed this task as Resolved.Feb 24 2018, 5:32 PM