Page MenuHomeSoftware Heritage

Gitorious import: ingest repositories
Closed, ResolvedPublic

Description

The list of all old gitorious repositories as well as their actual content is now available as the gitorious valhalla, maintained by Archiveteam. We should inject all those repositories into Software Heritage.

It's not a lot of content (~120K Git repositories).

Here is what they say to people interested in mirroring:

Please don't try to mirror the contents of this web server. It's 5 terabytes (after deduplication!) and the storage is slow at the moment. If you'd like to copy the data out, please email first and we can arrange something better for everyone.

The contact email address is: gitorious-%25@xrtc.net

Related Objects

Event Timeline

zack renamed this task from ingest archived gitorious repositories to ingest gitorious repositories.Feb 22 2016, 12:37 PM
zack added a project: Origin-Gitorious.

Here is the complete list of URL that can be used to "git clone" (via HTTPS) all the repositories available from the Gitorious valhalla:

.

(FWIW I'm not suggesting to start using them, we should first try to contact them and see if there are better options. But this remains a viable plan B.)

We are now all set to start (after having automated it properly…) the transfer of Gitorious stuff to SWH.

Below the last exchange with the Gitorious valhalla people, with the needed technical details.

On Fri, Mar 04, 2016 at 11:25:10AM +0100, Stefano Zacchiroli wrote:

[snip] how to go forward with the Gitorious transfer to Software Heritage. Here is a summary of the open issues to discuss before proceeding:

  1. is sending a physical drive back and forth (paid by us) an option?

It's far too much hassle, IP transit is much simpler.

  1. failing that, is paying your bandwidth an option? any idea of how much will it be?

The weird thing is I'm not sure ... the datacenter says that they bill us 95th percentile for it but so far as I know we've never gotten a bill for bandwidth. So don't crank it up too much and it should not be a problem?

  1. failing that, we'll do the month-long transfer. In which case we still need to discuss the following:

a) who will do the traffic shaping? we can do it locally on your machine using something as simple as pv. Would that be OK with you?

That sounds good to me.

b) to avoid interruptions that would force restarting from scratch we propose to split the device in 1GB blocks (with "dd seek=...") and transfer each of them separately (e.g., with nc over an SSH tunnel)

Sure, if you'd like. That sounds like a thing which is best automated on your end. :) If you are able to share, I'm interested in seeing what you come up with.

When I transferred this fs to valhalla, I had to restart the transfer only once. I told dd (sending and receiving both) to seek back to the most recent 10 GB boundary, and it worked just fine.

c) for compression we propose to use lz4. Can you install liblz4-tool on your machine?

Done. Please run it under 'nice'.

d) even with the above precaution, transferring a mounted FS with dd is pretty scary (as there are changes that might happen even with read-only mounted FS). Do you have the option of creating some block-level snapshot, e.g., with LVM?

The filesystem is very much not going to change. In case you're worried still:

  1. It is mounted read-only.
  2. It is served by a network block device server that has been configured to not accept writes.
  3. The fs itself is an ext4 image inside another filesystem, image is chmod 0444, and the outer filesystem mounted read-only.
  4. And the LVM logical volume that contains this all is set to read- only.
  5. As for durability, the volume is replicated with RAID1 (mirror) by LVM. No other data is presently stored on that volume group.

If you're ready to start copying, go ahead. You should have permission to read /dev/nbd0 already.

zack raised the priority of this task from Normal to High.Mar 5 2016, 10:38 AM

Here are all the information I have about the on-disk gitorious layout (credit: astrid):

Can you tell me more about the file layout/organization?
(disclaimer: I've never looked into how Gitorious, the software
platform, stores Git repositories) Are the hardlinks just the result
of asynchronous deduplication (e.g., with tools like fdupes) run on
a bunch of bare Git repositories, or is it more complex than that
(e.g., a huge, global Git loose object store)?

It's in fact rather simple, but with some wrinkles.

Each repository is stored as a bare git repository (as is created by
'git clone --bare'), so it can be worked with directly. It is my
understanding that gitorious used to run 'git gc' on a rolling
schedule, but I'm not sure how recently that has been done. I
certainly haven't.

When a user clicks 'clone', a full clone is made with git-clone; they
are on the same filesystem so git automatically uses hardlinks to
avoid copying objects unnecessarily. If the original repository is
named e.g. '/gitorious/mainline.git', and the user who clicks "clone"
is named 'zopa', then the cloned repository is named
'/gitorious/zopas-mainline.git'.

Each user has a wiki, which is named as
'/username/username-gitorious-wiki.git'. It seems that wikis were
created for all users regardless of whether they ever used them, so
there are many empty wiki repositories.

Originally, every repository was named with hashed names, such as
'/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git'

When they were preparing to send me the data, the gitorious folks
started to rename them to the canonical names as I explained above.
However, because they created one directory for each user, they ran
into the maximum hardlinks that you can make in ext4. So about half
of them got renamed and half of them are still in hashed form. They
gave me a list of all the hashed-name mappings, in
'/home/astrid/mapping.txt.gz'.

Because this was a complete mess, I created a directory of symlinks
outside the image with the canonical names, pointing into it:

lrwxrwxrwx 1 root root 50 Jun 30 2015 /srv/gitorious/repositories/gitorious:mainline.git -> /mnt/gitorious/repositories/gitorious/mainline.git
lrwxrwxrwx 1 root root 74 Jun 30 2015 /srv/gitorious/repositories/zzn:zzn.git -> /mnt/gitorious/repositories/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git

I have the image mounted on '/mnt/gitorious'. So to return the data
for 'zzn/zzn.git', the webserver transforms the '/' into a ':' and
serves the request with '/srv/gitorious' as the http root directory,
following symlinks.

I've collapsed the two mappings into a single file: /srv/softwareheritage/mirrors/gitorious.org/full_mapping.txt

I'm now running a git fsck on all the repositories. Output and results in worker01:/tmp/fsck.

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM
rdicosmo added a parent task: Unknown Object (Maniphest Task).May 25 2016, 4:05 PM

The full mapping of gitorious repositories URLs to on-disk location is at uffizi:/srv/storage/space/mirrors/gitorious.org/full_mapping.txt

start-date: Fri Feb 10 16:40:00 UTC 2017

ardumont changed the task status from Open to Work in Progress.Feb 10 2017, 8:15 PM
ardumont claimed this task.

Command to trigger the messages (from worker01):

cat /srv/storage/space/mirrors/gitorious.org/full_mapping.txt | SWH_WORKER_INSTANCE=swh_loader_git_disk ./load_gitorious.py --root-repositories /srv/storage/space/mirrors/gitorious.org/mnt/repositories

(The script defaults to use the right queue 'swh_loader_git_express' and the right origin-date 'Wed, 30 Mar 2016 09:40:04 +0200')

source: load_gitorious.py

zack added a project: Restricted Project.Feb 12 2017, 6:13 PM
zack renamed this task from ingest gitorious repositories to ingest Gitorious repositories.Feb 12 2017, 6:37 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

Visit dates have been fixed for the origins already injected.

Update on this.

The first initial import finished around 4th of April 2017.

After analysis, there were:

  • missing repositories 11.9% (14.3k out of 120.3k). There were not logged in error. Either I missed them initially (Occam's razor and everything but that does not feel right to me...), or that might be the issue we had about the db (@olasd fixed). Loaders being unable to connect to db and thus no log...
  • repositories in errors 3.5% (102.3k out of 106k) mostly due to the same issue referenced in the googlecode svn ingestion task.

All those have been rescheduled since 20th of April 2017 (well in multiple steps...).
They are currently being consumed.

As of now, ingestion, after multiple (re)schedulings, has been done.

116188 / 120381 have been ingested with full visits.

This gives ~3.48% of errors.

Those errors needs to be analyzed (T674).