We have URL indexes for all of them already, retrieved thanks to collaboration via the relevant Google team.
The indexes are on uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T367 ingest Google Code repositories | ||
Migrated | gitlab-migration | T368 retrieve code.google.com repositories |
Event Timeline
Attention: as of today, we have a bit less than two months left before Google erases *all* the original VCS from Google Code. After that date, only the archived version will remain, that may be incorrect.
So we have only a bit less than two months to report bugs up to them
worker01 is now fetching and checking the source archives from google archive.
As of Tue Apr 12 15:58:05 CEST 2016, 1.379.243 archives to fetch.
Details for one job:
- parse a gs:// url and transforms it according to the README's rule (uffizi:/srv/softwareheritage/mirrors/code.google.com/indexes/README)
- deriving the file's url as metadata (mediaLink, length, crc32c, md5Hash, etc...)
- writes on disk such metadata file (same location as the zip to retrieve with suffix .json)
- deriving the project.json metadata file, retrieve it and store on disk
- deriving the actual content from the mediaLink entry (exactly the url described in url)
- writes on disk such content
- checks that the content file's metadata (md5, length) match the one described in file metadata
- flag as corrupted the file if it does not (by renaming with suffix .corrupted).
worker01 is done.
Uffizi's disk state regarding those source archives (including project.json):
ardumont@uffizi:/srv/storage/space/mirrors/code.google.com/sources/v2$ date; du -sh *; date Sun May 1 14:08:23 CEST 2016 9.0G apache-extras.org 44T code.google.com 82G eclipselabs.org Sun May 1 17:07:31 CEST 2016
Great! (So can this task be closed?)
But then, what's happening to the swh_fetcher_googlecode_archive queue on rabbitmq? It has been filled up again, and it's up to 1.2M jobs in there, with one consumer working on it.
It's a second round-trip.
I need to check eventual failures tomorrow (today, i did not have time to).
So instead of letting it do nothing, i let it work some more.
After checks, there are:
- 342 files in error (problem during fetch time)
- 158 corrupted files (bad length or md5 checksums mismatch)
I will reschedule them in the queue and keep the eventual error messages for sending them back to our google contacts.