We have URL indexes for all of them already, retrieved thanks to collaboration via the relevant Google team.
The indexes are on uffizi:/srv/softwareheritage/mirrors/

Attention: as of today, we have a bit less than two months left before Google erases *all* the original VCS from Google Code. After that date, only the archived version will remain, that may be incorrect.
So we have only a bit less than two months to report bugs up to them

worker01 is now fetching and checking the source archives from google archive.

As of Tue Apr 12 15:58:05 CEST 2016, 1.379.243 archives to fetch.

Details for one job:

  • parse a gs:// url and transforms it according to the README's rule (uffizi:/srv/softwareheritage/mirrors/
  • deriving the file's url as metadata (mediaLink, length, crc32c, md5Hash, etc...)
  • writes on disk such metadata file (same location as the zip to retrieve with suffix .json)
  • deriving the project.json metadata file, retrieve it and store on disk
  • deriving the actual content from the mediaLink entry (exactly the url described in url)
  • writes on disk such content
  • checks that the content file's metadata (md5, length) match the one described in file metadata
  • flag as corrupted the file if it does not (by renaming with suffix .corrupted).

worker01 is done.

Uffizi's disk state regarding those source archives (including project.json):

ardumont@uffizi:/srv/storage/space/mirrors/$ date; du -sh *; date
Sun May  1 14:08:23 CEST 2016
Sun May  1 17:07:31 CEST 2016
In T368#5716, @ardumont wrote:

worker01 is done.

Great! (So can this task be closed?)

But then, what's happening to the swh_fetcher_googlecode_archive queue on rabbitmq? It has been filled up again, and it's up to 1.2M jobs in there, with one consumer working on it.

It's a second round-trip.

I need to check eventual failures tomorrow (today, i did not have time to).
So instead of letting it do nothing, i let it work some more.

After checks, there are:

  • 342 files in error (problem during fetch time)
  • 158 corrupted files (bad length or md5 checksums mismatch)

I will reschedule them in the queue and keep the eventual error messages for sending them back to our google contacts.

Rescheduled and no more errors now.

