Page MenuHomeSoftware Heritage

Check and complete the import
Closed, ResolvedPublic


Our import of is not complete, for example

olasd@uffizi:/srv/storage/space/mirrors/$ grep worldview full_mapping.txt worldview/worldview-gitorious-wiki.git worldview/worldview.git

but no trace of shows up in our search results.

We need to:

  • cross-check what has been ingested
  • complete what is missing

Also, create a process/checklist for future ingestions to avoid these situations.

Event Timeline

rdicosmo triaged this task as High priority.May 19 2020, 9:49 AM
rdicosmo created this task.

After dumping all origins starting with in the archive:

\copy (select url from origin where url >= '' and url < 'https://gitorious.org0') to origins csv;

And pulling the full mapping from uffizi

We have

$ comm -1 -3 <(sort origins) <(cut -f1 -s -d' ' < full_mapping.txt | sort) | wc -l

4021 missing origins.

We also have a single origin with no full visit:

select url from origin where url >= '' and url < 'https://gitorious.org0' and not exists (select 1 from origin_visit where = origin_visit.origin and origin_visit.snapshot is not null);
(1 ligne)
olasd changed the task status from Open to Work in Progress.May 19 2020, 5:02 PM

The code for loading git repositories from disk hasn't been run in production in a while, so I've decided to run the imports of the missing repos manually.

So far (after 374 repos processed), the single (recurrent) issue has been on empty repositories with a HEAD pointing at a (non-existent) refs/heads/master branch: the loader crashes as "dangling" aliases are currently forbidden in snapshots.

I'll do a further pass on these repos; I expect the simplest solution would be to prune branch aliases that point to non-existent branches (which will make these repos end up with an empty snapshot).

olasd added a comment.May 29 2020, 5:16 PM

After the first (naive, I guess) pass, 1470 repositories are still missing.

I've landed a few fixes to swh.loader.git and swh.loader.core, deployed them, and re-started importing these missing repositories.

The following repositories failed to import. Their on-disk structure is either completely empty, or only contains refs (no actual git objects stored):

The following repository fails to import due to a MemoryError:

All other repositories have been imported successfully.

olasd closed this task as Resolved.Jun 19 2020, 10:20 AM
olasd claimed this task.

We still need to try to ingest the zeq2 repo, but that can be done in a followup task.