retrieve code.google.com repositories
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Apr 9 2016, 8:49 AM

Description

We have URL indexes for all of them already, retrieved thanks to collaboration via the relevant Google team.
The indexes are on uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T368 retrieve code.google.com repositories

Event Timeline

zack created this task.Apr 9 2016, 8:49 AM

Attention: as of today, we have a bit less than two months left before Google erases *all* the original VCS from Google Code. After that date, only the archived version will remain, that may be incorrect.
So we have only a bit less than two months to report bugs up to them

rdicosmo created subtask Unknown Object (Maniphest Task).Apr 9 2016, 8:26 PM

rdicosmo removed a subtask: Unknown Object (Maniphest Task).Apr 9 2016, 8:46 PM

zack assigned this task to ardumont.Apr 11 2016, 10:32 AM

repository: https://forge.softwareheritage.org/diffusion/61/

worker01 is now fetching and checking the source archives from google archive.

As of Tue Apr 12 15:58:05 CEST 2016, 1.379.243 archives to fetch.

Details for one job:

parse a gs:// url and transforms it according to the README's rule (uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes/README)
deriving the file's url as metadata (mediaLink, length, crc32c, md5Hash, etc...)
writes on disk such metadata file (same location as the zip to retrieve with suffix .json)
deriving the project.json metadata file, retrieve it and store on disk
deriving the actual content from the mediaLink entry (exactly the url described in url)
writes on disk such content
checks that the content file's metadata (md5, length) match the one described in file metadata
flag as corrupted the file if it does not (by renaming with suffix .corrupted).

worker01 is done.

Uffizi's disk state regarding those source archives (including project.json):

ardumont@uffizi:/srv/storage/space/mirrors/code.google.com/sources/v2$ date; du -sh *; date
Sun May  1 14:08:23 CEST 2016
9.0G    apache-extras.org
44T     code.google.com
82G     eclipselabs.org
Sun May  1 17:07:31 CEST 2016

In T368#5716, @ardumont wrote:

worker01 is done.

Great! (So can this task be closed?)

But then, what's happening to the swh_fetcher_googlecode_archive queue on rabbitmq? It has been filled up again, and it's up to 1.2M jobs in there, with one consumer working on it.

It's a second round-trip.

I need to check eventual failures tomorrow (today, i did not have time to).
So instead of letting it do nothing, i let it work some more.

After checks, there are:

342 files in error (problem during fetch time)
158 corrupted files (bad length or md5 checksums mismatch)

I will reschedule them in the queue and keep the eventual error messages for sending them back to our google contacts.

Rescheduled and no more errors now.

ardumont closed this task as Resolved.May 3 2016, 2:10 PM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM

This task has been migrated to GitLab.

retrieve code.google.com repositoriesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

retrieve code.google.com repositories
Closed, MigratedEdits Locked
Actions

Related Objects
Search...