Page MenuHomeSoftware Heritage

retrieve code.google.com repositories
Closed, ResolvedPublic

Description

We have URL indexes for all of them already, retrieved thanks to collaboration via the relevant Google team.
The indexes are on uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes

Related Objects

StatusAssignedTask
OpenNone
Resolvedardumont

Event Timeline

zack created this task.Apr 9 2016, 8:49 AM

Attention: as of today, we have a bit less than two months left before Google erases *all* the original VCS from Google Code. After that date, only the archived version will remain, that may be incorrect.
So we have only a bit less than two months to report bugs up to them

rdicosmo created subtask Unknown Object (Maniphest Task).Apr 9 2016, 8:26 PM
rdicosmo removed a subtask: Unknown Object (Maniphest Task).Apr 9 2016, 8:46 PM
zack assigned this task to ardumont.Apr 11 2016, 10:32 AM
ardumont changed the task status from Open to Work in Progress.Apr 12 2016, 1:37 PM
ardumont added a comment.EditedApr 12 2016, 4:01 PM

worker01 is now fetching and checking the source archives from google archive.

As of Tue Apr 12 15:58:05 CEST 2016, 1.379.243 archives to fetch.

Details for one job:

  • parse a gs:// url and transforms it according to the README's rule (uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes/README)
  • deriving the file's url as metadata (mediaLink, length, crc32c, md5Hash, etc...)
  • writes on disk such metadata file (same location as the zip to retrieve with suffix .json)
  • deriving the project.json metadata file, retrieve it and store on disk
  • deriving the actual content from the mediaLink entry (exactly the url described in url)
  • writes on disk such content
  • checks that the content file's metadata (md5, length) match the one described in file metadata
  • flag as corrupted the file if it does not (by renaming with suffix .corrupted).

worker01 is done.

Uffizi's disk state regarding those source archives (including project.json):

ardumont@uffizi:/srv/storage/space/mirrors/code.google.com/sources/v2$ date; du -sh *; date
Sun May  1 14:08:23 CEST 2016
9.0G    apache-extras.org
44T     code.google.com
82G     eclipselabs.org
Sun May  1 17:07:31 CEST 2016
zack added a comment.EditedMay 1 2016, 7:18 PM
In T368#5716, @ardumont wrote:

worker01 is done.

Great! (So can this task be closed?)

But then, what's happening to the swh_fetcher_googlecode_archive queue on rabbitmq? It has been filled up again, and it's up to 1.2M jobs in there, with one consumer working on it.

It's a second round-trip.

I need to check eventual failures tomorrow (today, i did not have time to).
So instead of letting it do nothing, i let it work some more.

After checks, there are:

  • 342 files in error (problem during fetch time)
  • 158 corrupted files (bad length or md5 checksums mismatch)

I will reschedule them in the queue and keep the eventual error messages for sending them back to our google contacts.

Rescheduled and no more errors now.

ardumont closed this task as Resolved.May 3 2016, 2:10 PM
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM