Page MenuHomeSoftware Heritage

Check and complete the import
Started, Work in Progress, HighPublic


Our import of is not complete, for example

olasd@uffizi:/srv/storage/space/mirrors/$ grep worldview full_mapping.txt worldview/worldview-gitorious-wiki.git worldview/worldview.git

but no trace of shows up in our search results.

We need to:

  • cross-check what has been ingested
  • complete what is missing

Also, create a process/checklist for future ingestions to avoid these situations.

Event Timeline

rdicosmo triaged this task as High priority.Tue, May 19, 9:49 AM
rdicosmo created this task.

After dumping all origins starting with in the archive:

\copy (select url from origin where url >= '' and url < 'https://gitorious.org0') to origins csv;

And pulling the full mapping from uffizi

We have

$ comm -1 -3 <(sort origins) <(cut -f1 -s -d' ' < full_mapping.txt | sort) | wc -l

4021 missing origins.

We also have a single origin with no full visit:

select url from origin where url >= '' and url < 'https://gitorious.org0' and not exists (select 1 from origin_visit where = origin_visit.origin and origin_visit.snapshot is not null);
(1 ligne)
olasd changed the task status from Open to Work in Progress.Tue, May 19, 5:02 PM

The code for loading git repositories from disk hasn't been run in production in a while, so I've decided to run the imports of the missing repos manually.

So far (after 374 repos processed), the single (recurrent) issue has been on empty repositories with a HEAD pointing at a (non-existent) refs/heads/master branch: the loader crashes as "dangling" aliases are currently forbidden in snapshots.

I'll do a further pass on these repos; I expect the simplest solution would be to prune branch aliases that point to non-existent branches (which will make these repos end up with an empty snapshot).