Page MenuHomeSoftware Heritage

ingest Google Code Git repositories
Closed, MigratedEdits Locked

Description

We have retrieved the git repositories and not yet ingested them.
This task is about the actual ingestion using our loader-git.

Note:

  • Like every other mirror/backup, it's stored at /srv/storage/space/mirrors/, under a dedicated root directory 'code.google.com' (in uffizi).
  • /srv/storage/space/mirrors/code.google.com/sources/INDEX.filesystem to list all googlecode's repositories on disk.

Requirements:

  • filtering the git repositories (we only have the INDEX.filesystem which lists of all googlecode repositories for now, be it of types git, svn or hg). There is a project.json in the same folder as the archive which contains the mention 'repoType' with possible value as either 'git', 'svn', or 'hg'.
  • As we did for the googlecode's svn repositories, we need to reconstruct their url: https://<project-name>.googlecode.com/
  • all git repositories are archive files (mostly zip). So, we either need to uncompress every archive first or as with the googlecode svn loader, let the worker uncompress first the archive in a temporary directory and then load the git repository.
  • at last, generate a INDEX-git-archive (same structure as the one we used for gitorious) with format: <origin_url> <path-to-git-repository-tree-or-archive>

Related Objects

Event Timeline

zack renamed this task from inject googlecode's git repositories into swh to ingest Google Code Git repositories.Feb 12 2017, 6:14 PM
zack added a project: Restricted Project.
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:37 PM
at last, generate a full_mapping.txt (mirroring the one from gitorious) mentioning <origin_url> <path-to-git-repository-tree-or-archive>.

INDEX-git-archives is the file listing only zip archive file with git repositories.
It also maps the the origin-url to use for the injection.

$ head /srv/storage/space/mirrors/code.googlecode.com/sources/INDEX-git-archives
https://00000books.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/00000books/00000books-source-archive.zip
https://0000som143-osmand.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/0000som143-osmand/0000som143-osmand-source-archive.zip
https://001coldblade-authenticator.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/001coldblade-authenticator/001coldblade-authenticator-source-archive.zip
https://0043113-aa.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/0043113-aa/0043113-aa-source-archive.zip
https://005-iphone-project.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/005-iphone-project/005-iphone-project-source-archive.zip
https://007xsq-sadsad.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/007xsq-sadsad/007xsq-sadsad-source-archive.zip
https://00mohit-mohit.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/00mohit-mohit/00mohit-mohit-source-archive.zip
https://010pepe010-moog.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/010pepe010-moog/010pepe010-moog-source-archive.zip
https://010smithzhang-ddd.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/010smithzhang-ddd/010smithzhang-ddd-source-archive.zip
https://014miharu-hirayama.googlecode.com /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/014miharu-hirayama/014miharu-hirayama-source-archive.zip
...

As in T617, the origin date to use for injection is 'Tue, 3 May 2016 17:16:32 +0200'. We retrieved all googlecode repositories together (git, svn, hg).

starting-date: 2017-02-15 14:42:27,724

The command to trigger the messages is (from worker01 but should be limited to it):

$ cat /srv/storage/space/mirrors/code.google.com/sources/INDEX-git-archives | SWH_WORKER_INSTANCE=swh_loader_git_archive ./load_googlecode.py --visit-date 'Tue, 3 May 2016 17:16:32 +0200'
{origin_url: 'https://zvaigzdinas-hih.googlecode.com', archive_path: '/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/z/zvaigzdinas-hih/zvaigzdinas-hih-source-archive.zip', date: 'Tue, 3 May 2016 17:16:32 +0200'}
{origin_url: 'https://zvalenti-generations.googlecode.com', archive_path: '/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/z/zvalenti-generations/zvalenti-generations-source-archive.zip', date: 'Tue, 3 May 2016 17:16:
32 +0200'}
{origin_url: 'https://zvarioz.googlecode.com', archive_path: '/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/z/zvarioz/zvarioz-source-archive.zip', date: 'Tue, 3 May 2016 17:16:32 +0200'}

where:

Visit dates have been fixed for the origins already injected.

ardumont changed the task status from Open to Work in Progress.Feb 15 2017, 7:53 PM
zack removed projects: Restricted Project, Git loader.Apr 5 2017, 2:04 PM

It's the same explanation as for the gitorious injection (T312). Only the numbers change:

  • Missing: 5.19% (4.5k out of 88.3k)
  • Failure: 13.8% (72.1k out of 83.7k)

Rescheduled and currently running.

As of now, ingestion, after multiple (re)schedulings, has been done.

86314 / 88307 have been ingested with full visits.

This gives ~2.25% of errors.

Those errors needs to be analyzed (T675).