Page MenuHomeSoftware Heritage

Check and complete the gitorious.org import
Closed, MigratedEdits Locked

Description

Our import of gitorious.org is not complete, for example

olasd@uffizi:/srv/storage/space/mirrors/gitorious.org$ grep worldview full_mapping.txt 
https://gitorious.org/worldview/worldview-gitorious-wiki.git worldview/worldview-gitorious-wiki.git
https://gitorious.org/worldview/worldview.git worldview/worldview.git

but no trace of gitorious.org/worldview shows up in our search results.

We need to:

  • cross-check what has been ingested
  • complete what is missing

Also, create a process/checklist for future ingestions to avoid these situations.

Event Timeline

rdicosmo created this task.

After dumping all origins starting with https://gitorious.org/ in the archive:

\copy (select url from origin where url >= 'https://gitorious.org/' and url < 'https://gitorious.org0') to origins csv;

And pulling the full mapping from uffizi

We have

$ comm -1 -3 <(sort origins) <(cut -f1 -s -d' ' < full_mapping.txt | sort) | wc -l
4021

4021 missing origins.

We also have a single origin with no full visit:

select url from origin where url >= 'https://gitorious.org/' and url < 'https://gitorious.org0' and not exists (select 1 from origin_visit where origin.id = origin_visit.origin and origin_visit.snapshot is not null);
                                    url                                     
────────────────────────────────────────────────────────────────────────────
 https://gitorious.org/haskell-threads-pool/adepts-haskell-threads-pool.git
(1 ligne)
olasd changed the task status from Open to Work in Progress.May 19 2020, 5:02 PM

The code for loading git repositories from disk hasn't been run in production in a while, so I've decided to run the imports of the missing repos manually.

So far (after 374 repos processed), the single (recurrent) issue has been on empty repositories with a HEAD pointing at a (non-existent) refs/heads/master branch: the loader crashes as "dangling" aliases are currently forbidden in snapshots.

I'll do a further pass on these repos; I expect the simplest solution would be to prune branch aliases that point to non-existent branches (which will make these repos end up with an empty snapshot).

After the first (naive, I guess) pass, 1470 repositories are still missing.

I've landed a few fixes to swh.loader.git and swh.loader.core, deployed them, and re-started importing these missing repositories.

The following repositories failed to import. Their on-disk structure is either completely empty, or only contains refs (no actual git objects stored):

https://gitorious.org/amusewikifarm/amusewikifarm-gitorious-wiki.git
https://gitorious.org/autopkgtest/autopkgtest-gitorious-wiki.git
https://gitorious.org/colibri4k/colibri4k.git
https://gitorious.org/debian-samba/debian-samba-gitorious-wiki.git
https://gitorious.org/dotfiles-glatzor/dotfiles-glatzor-gitorious-wiki.git
https://gitorious.org/e2c2/e2c2-gitorious-wiki.git
https://gitorious.org/edf6flda/edf6flda-gitorious-wiki.git
https://gitorious.org/eso-addon-librarian-data-convertor/eso-addon-librarian-data-convertor-gitorious-wiki.git
https://gitorious.org/jamclouds/jamclouds.git
https://gitorious.org/spd-sample/spd-sample-gitorious-wiki.git
https://gitorious.org/traction-edge/traction-edge-gitorious-wiki.git
https://gitorious.org/unixc/unixc.git

The following repository fails to import due to a MemoryError:

https://gitorious.org/zeq2/zeq2.git

All other repositories have been imported successfully.

olasd claimed this task.

We still need to try to ingest the zeq2 repo, but that can be done in a followup task.