Page MenuHomeSoftware Heritage

hg loader: Clean up wrong snapshots/releases during hg loading of googlecode
Closed, ResolvedPublic

Description

As T1156 explained, wrong snapshots and releases were created during the mercurial loading of the googlecode origins.
Those need to be cleaned up prior to rescheduling them.

Event Timeline

ardumont triaged this task as Normal priority.
ardumont renamed this task from Clean up wrong releases to hg loader: Clean up wrong releases during hg loading of googlecode.
ardumont renamed this task from hg loader: Clean up wrong releases during hg loading of googlecode to hg loader: Clean up wrong snapshots/releases during hg loading of googlecode.Jul 25 2018, 2:04 PM
ardumont updated the task description. (Show Details)
ardumont added a comment.EditedJul 25 2018, 2:25 PM

To find those releases:

select count(r.*)
from origin o
inner join origin_visit ov on o.id=ov.origin
inner join snapshot s on ov.snapshot_id=s.object_id
inner join snapshot_branches sbs on sbs.snapshot_id=s.object_id
inner join snapshot_branch sb on sbs.branch_id=sb.object_id
inner join release r on r.id=sb.target
where sb.target_type='release' and o.type='hg';
┌─────────┐
│  count  │
├─────────┤
│ 1069235 │
└─────────┘
(1 row)

Indeed, querying up to the revision results in no revision at all.
This is expected since those release targets non-existing revisions (their id are really changeset ids):

select count(rev.*)
from origin o
inner join origin_visit ov on o.id=ov.origin
inner join snapshot s on ov.snapshot_id=s.object_id
inner join snapshot_branches sbs on sbs.snapshot_id=s.object_id
inner join snapshot_branch sb on sbs.branch_id=sb.object_id
inner join release r on r.id=sb.target
inner join revision rev on rev.id=r.target
where sb.target_type='release' and o.type='hg';
┌───────┐
│ count │
├───────┤
│     0 │
└───────┘
(1 row)

P286 is still wip as some now unclear exception occurs.

18:24:34 *softwareheritage-dev@[local]:5432=#   select * from cleanup_wrong_data();
ERROR:  value for domain sha1_git violates check constraint "sha1_git_check" <~~~~~ what is it talking about...? no clue yet... The releases are done been cleaned up
                                                                                    and at that moment, it's supposed to only deal with bigint (snapshot and branch
                                                                                   identifiers) there...
CONTEXT:  PL/pgSQL function cleanup_wrong_snapshots_and_releases() line 11 at FOR over SELECT rows
SQL statement "SELECT cleanup_wrong_snapshots_and_releases()"
PL/pgSQL function cleanup_wrong_data() line 3 at PERFORM
Time: 1.033 ms

Still the gist of it is there.

ardumont changed the task status from Open to Work in Progress.Jul 25 2018, 6:28 PM

P286 is still wip as some now unclear exception occurs.

It's now fixed and ready.

@zack or @olasd if you have some time to review P286 at one point in time, that would be awesome.

Local tests are fine.

In the end, after:

  • ingesting repositories with v0.0.11 loader mercurial (bugged version)
  • cleaning up the model (P286)
  • running the mercurial origins loading with latest loader mercurial version (v0.0.12 with anlambert's fix T1155)
  • checking the snapshots up to the revision, we have data:
select distinct r.id, s.object_id, sbs.branch_id
from origin o
inner join origin_visit ov on (o.id=ov.origin and o.type='hg')
inner join snapshot s on ov.snapshot_id=s.object_id
inner join snapshot_branches sbs on s.object_id=sbs.snapshot_id
inner join snapshot_branch sb on sbs.branch_id=sb.object_id
inner join release r on (sb.target_type='release' and r.id=sb.target)
limit 10;

Something that we currently do not have in the swh db.

zack added a comment.Jul 26 2018, 6:18 PM

@zack or @olasd if you have some time to review P286 at one point in time, that would be awesome.

I punt to @olasd , as I won't have bandwidth to take care of this one while away.

Heads up on this btw.

As asked on irc, P286 was reworked to trigger the listing of all objects (only snapshots* and releases were created that way at first).
Now there also exists a "temporary" table for the origin_visit and fetch_history which are also impacted.

All objects that need cleaned up are now created in a temp table (prior to any clean up step):

  • temp_to_cleanup_release
  • temp_to_cleanup_snapshot_branches
  • temp_to_cleanup_origin_visit
  • temp_to_cleanup_fetch_history

Only the part copying to temporary table has been triggered so far on prado [1].

P286 has been adapted to reflect those changes [2]

[1] https://forge.softwareheritage.org/P286$161

[2] https://forge.softwareheritage.org/P286$163#1956

ardumont closed this task as Resolved.Sep 20 2018, 11:29 AM
ardumont claimed this task.