Gitorious import: ingest repositories
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Feb 22 2016, 12:28 PM

Description

The list of all old gitorious repositories as well as their actual content is now available as the gitorious valhalla, maintained by Archiveteam. We should inject all those repositories into Software Heritage.

It's not a lot of content (~120K Git repositories).

Here is what they say to people interested in mirroring:

Please don't try to mirror the contents of this web server. It's 5 terabytes (after deduplication!) and the storage is slow at the moment. If you'd like to copy the data out, please email first and we can arrange something better for everyone.

The contact email address is: gitorious-%25@xrtc.net

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T312 Gitorious import: ingest repositories
Migrated	gitlab-migration	T343 retrieve gitorious repositories from the gitorious valhalla
Migrated	gitlab-migration	T360 create gid 5000 and add swhworker to it (to ingest gitorious repos)
Migrated	gitlab-migration	T674 Gitorious import: Examine ingestion logs for errors and list them if any
Migrated	gitlab-migration	T815 Gitorious import: Release time conversion issue when no release date is provided
Migrated	gitlab-migration	T814 Gitorious import: unexisting object retrieval makes the loading fail
Migrated	gitlab-migration	T816 Gitorious import: loose object parsing error with corrupted file as empty one
Migrated	gitlab-migration	T819 Gitorious import: ObjectFormatException raised when badly formatted tag object exists in the repository
Migrated	gitlab-migration	T822 Gitorious import: ObjectFormatException raised when badly formatted object (around date?)
Migrated	gitlab-migration	T911 gitorious import: UnicodeDecodeError when reading references

Event Timeline

zack created this task.Feb 22 2016, 12:28 PM

Herald added a project: Staff. · View Herald TranscriptFeb 22 2016, 12:28 PM

zack renamed this task from ingest archived gitorious repositories to ingest gitorious repositories.Feb 22 2016, 12:37 PM

zack added a project: Origin-Gitorious.

Here is the complete list of URL that can be used to "git clone" (via HTTPS) all the repositories available from the Gitorious valhalla:

gitorious-list.txt.gz790 KBDownload

(FWIW I'm not suggesting to start using them, we should first try to contact them and see if there are better options. But this remains a viable plan B.)

We are now all set to start (after having automated it properly…) the transfer of Gitorious stuff to SWH.

Below the last exchange with the Gitorious valhalla people, with the needed technical details.

On Fri, Mar 04, 2016 at 11:25:10AM +0100, Stefano Zacchiroli wrote:

[snip] how to go forward with the Gitorious transfer to Software Heritage. Here is a summary of the open issues to discuss before proceeding:

is sending a physical drive back and forth (paid by us) an option?

It's far too much hassle, IP transit is much simpler.

failing that, is paying your bandwidth an option? any idea of how much will it be?

The weird thing is I'm not sure ... the datacenter says that they bill us 95th percentile for it but so far as I know we've never gotten a bill for bandwidth. So don't crank it up too much and it should not be a problem?

failing that, we'll do the month-long transfer. In which case we still need to discuss the following:

a) who will do the traffic shaping? we can do it locally on your machine using something as simple as pv. Would that be OK with you?

That sounds good to me.

b) to avoid interruptions that would force restarting from scratch we propose to split the device in 1GB blocks (with "dd seek=...") and transfer each of them separately (e.g., with nc over an SSH tunnel)

Sure, if you'd like. That sounds like a thing which is best automated on your end. :) If you are able to share, I'm interested in seeing what you come up with.

When I transferred this fs to valhalla, I had to restart the transfer only once. I told dd (sending and receiving both) to seek back to the most recent 10 GB boundary, and it worked just fine.

c) for compression we propose to use lz4. Can you install liblz4-tool on your machine?

Done. Please run it under 'nice'.

d) even with the above precaution, transferring a mounted FS with dd is pretty scary (as there are changes that might happen even with read-only mounted FS). Do you have the option of creating some block-level snapshot, e.g., with LVM?

The filesystem is very much not going to change. In case you're worried still:

It is mounted read-only.

It is served by a network block device server that has been configured to not accept writes.

The fs itself is an ext4 image inside another filesystem, image is chmod 0444, and the outer filesystem mounted read-only.

And the LVM logical volume that contains this all is set to read- only.

As for durability, the volume is replicated with RAID1 (mirror) by LVM. No other data is presently stored on that volume group.

If you're ready to start copying, go ahead. You should have permission to read /dev/nbd0 already.

zack raised the priority of this task from Normal to High.Mar 5 2016, 10:38 AM

zack lowered the priority of this task from High to Normal.Mar 5 2016, 10:41 AM

zack mentioned this in T343: retrieve gitorious repositories from the gitorious valhalla.

zack created subtask T343: retrieve gitorious repositories from the gitorious valhalla.

zack changed the status of subtask T343: retrieve gitorious repositories from the gitorious valhalla from Open to Work in Progress.Mar 9 2016, 10:27 AM

zack removed projects: Developers, Staff.Mar 10 2016, 5:51 PM

zack closed subtask T343: retrieve gitorious repositories from the gitorious valhalla as Resolved.Mar 29 2016, 12:05 PM

zack created subtask T360: create gid 5000 and add swhworker to it (to ingest gitorious repos).Apr 1 2016, 12:02 PM

olasd closed subtask T360: create gid 5000 and add swhworker to it (to ingest gitorious repos) as Resolved.May 12 2016, 2:44 PM

Here are all the information I have about the on-disk gitorious layout (credit: astrid):

Can you tell me more about the file layout/organization?
(disclaimer: I've never looked into how Gitorious, the software
platform, stores Git repositories) Are the hardlinks just the result
of asynchronous deduplication (e.g., with tools like fdupes) run on
a bunch of bare Git repositories, or is it more complex than that
(e.g., a huge, global Git loose object store)?

It's in fact rather simple, but with some wrinkles.

Each repository is stored as a bare git repository (as is created by
'git clone --bare'), so it can be worked with directly. It is my
understanding that gitorious used to run 'git gc' on a rolling
schedule, but I'm not sure how recently that has been done. I
certainly haven't.

When a user clicks 'clone', a full clone is made with git-clone; they
are on the same filesystem so git automatically uses hardlinks to
avoid copying objects unnecessarily. If the original repository is
named e.g. '/gitorious/mainline.git', and the user who clicks "clone"
is named 'zopa', then the cloned repository is named
'/gitorious/zopas-mainline.git'.

Each user has a wiki, which is named as
'/username/username-gitorious-wiki.git'. It seems that wikis were
created for all users regardless of whether they ever used them, so
there are many empty wiki repositories.

Originally, every repository was named with hashed names, such as
'/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git'

When they were preparing to send me the data, the gitorious folks
started to rename them to the canonical names as I explained above.
However, because they created one directory for each user, they ran
into the maximum hardlinks that you can make in ext4. So about half
of them got renamed and half of them are still in hashed form. They
gave me a list of all the hashed-name mappings, in
'/home/astrid/mapping.txt.gz'.

Because this was a complete mess, I created a directory of symlinks
outside the image with the canonical names, pointing into it:

lrwxrwxrwx 1 root root 50 Jun 30 2015 /srv/gitorious/repositories/gitorious:mainline.git -> /mnt/gitorious/repositories/gitorious/mainline.git
lrwxrwxrwx 1 root root 74 Jun 30 2015 /srv/gitorious/repositories/zzn:zzn.git -> /mnt/gitorious/repositories/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git

I have the image mounted on '/mnt/gitorious'. So to return the data
for 'zzn/zzn.git', the webserver transforms the '/' into a ':' and
serves the request with '/srv/gitorious' as the http root directory,
following symlinks.

I've collapsed the two mappings into a single file: /srv/softwareheritage/mirrors/gitorious.org/full_mapping.txt

I'm now running a git fsck on all the repositories. Output and results in worker01:/tmp/fsck.

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM

rdicosmo added a parent task: Unknown Object (Maniphest Task).May 25 2016, 4:05 PM

The full mapping of gitorious repositories URLs to on-disk location is at uffizi:/srv/storage/space/mirrors/gitorious.org/full_mapping.txt

ardumont created subtask T674: Gitorious import: Examine ingestion logs for errors and list them if any.Feb 10 2017, 12:42 PM

ardumont mentioned this in rDSNIP46e1a7fa7d8f: Add load_gitorious.py script to send repositories for ingestion.Feb 10 2017, 3:16 PM

ardumont mentioned this in rSPPROFe458f7b405c4: swh::deploy::worker::swh_loader_git_disk: Add disk injection worker.Feb 10 2017, 3:52 PM

ardumont mentioned this in rSPSITE1c4e5337c34e: data/defaults: Add swh-loader-git-disk instance.

ardumont mentioned this in rSPSITE1c7d84449f1f: data/hostname/swh-workers: Deploy swh-loader-git-disk worker.Feb 10 2017, 4:07 PM

ardumont mentioned this in rSPPROFd8d3fc32b656: worker::swh_loader_git_disk: Don't consume from swh_loader_git queue.Feb 10 2017, 4:12 PM

start-date: Fri Feb 10 16:40:00 UTC 2017

ardumont changed the task status from Open to Work in Progress.Feb 10 2017, 8:15 PM

ardumont claimed this task.

Command to trigger the messages (from worker01):

cat /srv/storage/space/mirrors/gitorious.org/full_mapping.txt | SWH_WORKER_INSTANCE=swh_loader_git_disk ./load_gitorious.py --root-repositories /srv/storage/space/mirrors/gitorious.org/mnt/repositories

(The script defaults to use the right queue 'swh_loader_git_express' and the right origin-date 'Wed, 30 Mar 2016 09:40:04 +0200')

source: load_gitorious.py

zack added a project: Restricted Project.Feb 12 2017, 6:13 PM

zack renamed this task from ingest gitorious repositories to ingest Gitorious repositories.Feb 12 2017, 6:37 PM

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

zack removed a subtask: T674: Gitorious import: Examine ingestion logs for errors and list them if any.Feb 15 2017, 4:13 PM

zack added a parent task: T674: Gitorious import: Examine ingestion logs for errors and list them if any.

ardumont mentioned this in rDLDG1e8df3b1b1fc: loader: Fix fetch_date override.Feb 15 2017, 6:44 PM

Visit dates have been fixed for the origins already injected.

zack added a project: Archive content.Apr 7 2017, 11:00 AM

Update on this.

The first initial import finished around 4th of April 2017.

After analysis, there were:

missing repositories 11.9% (14.3k out of 120.3k). There were not logged in error. Either I missed them initially (Occam's razor and everything but that does not feel right to me...), or that might be the issue we had about the db (@olasd fixed). Loaders being unable to connect to db and thus no log...
repositories in errors 3.5% (102.3k out of 106k) mostly due to the same issue referenced in the googlecode svn ingestion task.

All those have been rescheduled since 20th of April 2017 (well in multiple steps...).
They are currently being consumed.

ardumont mentioned this in T673: ingest Google Code Git repositories.Apr 26 2017, 11:08 AM

As of now, ingestion, after multiple (re)schedulings, has been done.

116188 / 120381 have been ingested with full visits.

This gives ~3.48% of errors.

Those errors needs to be analyzed (T674).

ardumont renamed this task from ingest Gitorious repositories to Gitorious import: ingest repositories.Oct 3 2017, 10:13 AM

ardumont removed a parent task: T674: Gitorious import: Examine ingestion logs for errors and list them if any.

ardumont added a subtask: T674: Gitorious import: Examine ingestion logs for errors and list them if any.

ardumont closed this task as Resolved.Apr 12 2018, 2:04 PM

ardumont closed subtask T674: Gitorious import: Examine ingestion logs for errors and list them if any as Resolved.

ardumont mentioned this in rSPSITEe458f7b405c4: swh::deploy::worker::swh_loader_git_disk: Add disk injection worker.Jun 15 2018, 2:29 PM