As T1156 explained, wrong snapshots and releases were created during the mercurial loading of the googlecode origins.
Those need to be cleaned up prior to rescheduling them.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T367 ingest Google Code repositories | ||
Migrated | gitlab-migration | T682 Ingest Google Code Mercurial repositories | ||
Migrated | gitlab-migration | T1156 Fix release targets of already loaded mercurial type origins | ||
Migrated | gitlab-migration | T1158 hg loader: Clean up wrong snapshots/releases during hg loading of googlecode | ||
Migrated | gitlab-migration | T1159 hg loader: Schedule oneshot tasks for googlecode origin ingestion |
Event Timeline
To find those releases:
select count(r.*) from origin o inner join origin_visit ov on o.id=ov.origin inner join snapshot s on ov.snapshot_id=s.object_id inner join snapshot_branches sbs on sbs.snapshot_id=s.object_id inner join snapshot_branch sb on sbs.branch_id=sb.object_id inner join release r on r.id=sb.target where sb.target_type='release' and o.type='hg'; ┌─────────┐ │ count │ ├─────────┤ │ 1069235 │ └─────────┘ (1 row)
Indeed, querying up to the revision results in no revision at all.
This is expected since those release targets non-existing revisions (their id are really changeset ids):
select count(rev.*) from origin o inner join origin_visit ov on o.id=ov.origin inner join snapshot s on ov.snapshot_id=s.object_id inner join snapshot_branches sbs on sbs.snapshot_id=s.object_id inner join snapshot_branch sb on sbs.branch_id=sb.object_id inner join release r on r.id=sb.target inner join revision rev on rev.id=r.target where sb.target_type='release' and o.type='hg'; ┌───────┐ │ count │ ├───────┤ │ 0 │ └───────┘ (1 row)
P286 is still wip as some now unclear exception occurs.
18:24:34 *softwareheritage-dev@[local]:5432=# select * from cleanup_wrong_data(); ERROR: value for domain sha1_git violates check constraint "sha1_git_check" <~~~~~ what is it talking about...? no clue yet... The releases are done been cleaned up and at that moment, it's supposed to only deal with bigint (snapshot and branch identifiers) there... CONTEXT: PL/pgSQL function cleanup_wrong_snapshots_and_releases() line 11 at FOR over SELECT rows SQL statement "SELECT cleanup_wrong_snapshots_and_releases()" PL/pgSQL function cleanup_wrong_data() line 3 at PERFORM Time: 1.033 ms
Still the gist of it is there.
Local tests are fine.
In the end, after:
- ingesting repositories with v0.0.11 loader mercurial (bugged version)
- cleaning up the model (P286)
- running the mercurial origins loading with latest loader mercurial version (v0.0.12 with anlambert's fix T1155)
- checking the snapshots up to the revision, we have data:
select distinct r.id, s.object_id, sbs.branch_id from origin o inner join origin_visit ov on (o.id=ov.origin and o.type='hg') inner join snapshot s on ov.snapshot_id=s.object_id inner join snapshot_branches sbs on s.object_id=sbs.snapshot_id inner join snapshot_branch sb on sbs.branch_id=sb.object_id inner join release r on (sb.target_type='release' and r.id=sb.target) limit 10;
Something that we currently do not have in the swh db.
Heads up on this btw.
As asked on irc, P286 was reworked to trigger the listing of all objects (only snapshots* and releases were created that way at first).
Now there also exists a "temporary" table for the origin_visit and fetch_history which are also impacted.
All objects that need cleaned up are now created in a temp table (prior to any clean up step):
- temp_to_cleanup_release
- temp_to_cleanup_snapshot_branches
- temp_to_cleanup_origin_visit
- temp_to_cleanup_fetch_history
Only the part copying to temporary table has been triggered so far on prado [1].
P286 has been adapted to reflect those changes [2]