Change Details

# Goal The goal of this repository is to retrieve data from antelink's storage that swh does not already store. This, in order to reduce the actual cloud service cost of storing antelink's data. Out of scope for now: - The check/deletion of the replicated data from s3 (when the loading will be done) (cf. T309). - The metadata information relative to content will be retrieved later (cf. T310). # Status There exists: - an old backup on the machine sesi-pv-lc2.inria.fr (sesi-pv-lc2) in `/antelink/` - an `antelink` db on swh's side with one table 'content' with id (sha1 of the uncompressed file), path (referencing the path to the compressed content on the sesi-pv-lc2 machine). # Notes 314 899 904 contents are referenced in the table 'content' (db antelink). This represents an ideal representation of the backup in sesi-pv-lc2 but no longer the reality. But: - some data have been lost by inria admin back in october/november (bad manipulation). Those are to be considered missing. - some other data may have been corrupted (checksums could no longer match)... Those are to be considered missing. Also: - the sesi-pv-lc2 is not complete in regards to s3 storage so an extract through `aws-cli s3 ls` has been done by Guillaume. This is to be used as input in conjunction with the antelink db. - Bucket s3 contains gzipped contents. - sesi-pv-lc2 contains gzipped contents. - the size in the `aws-cli s3 ls` represents the size of the compressed data Implementation detail change in storage: - A new table `content_large` in swh-storage with the same structure as content and content_missing will be created. This table will be used to store content metadata when the size is larger than our actual threshold size. - The antelink content are to be injected as blob