retrieve content from s3 and store it in SWH storage
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Feb 9 2016, 12:00 PM

Description

Goal

The goal of this repository is to retrieve data from antelink's
storage that swh does not already store.

This, in order to reduce the actual cloud service cost of storing
antelink's data.

Out of scope for now:

The check/deletion of the replicated data from s3 when the loading will be done (cf. T309).
The metadata information relative to content will be retrieved later (cf. T310).

Status

There exists:

an old backup on the machine sesi-pv-lc2.inria.fr (sesi-pv-lc2) in /antelink/
an antelink db on swh's side with one table 'content' with id (sha1 of the uncompressed file), path (referencing the path to the compressed content on the sesi-pv-lc2 machine).

Notes

314 899 904 contents are referenced in the table 'content' (db
antelink). This represents an ideal representation of the backup in
sesi-pv-lc2 but no longer the reality.

But:

some data have been lost by inria admin back in october/november (bad manipulation). Those are to be considered missing.
some other data may have been corrupted (checksums could no longer match)... Those are to be considered missing.

Also:

the sesi-pv-lc2 is not complete in regards to s3 storage so an extract through aws-cli s3 ls has been done by Guillaume. This is to be used as input in conjunction with the antelink db.
Bucket s3 contains gzipped contents.
sesi-pv-lc2 contains gzipped contents.
the size in the aws-cli s3 ls represents the size of the compressed data

Implementation detail change in storage:

A new table content_large in swh-storage with the same structure

as content and content_missing will be created. This table will be
used to store content metadata when the size is larger than our actual
threshold size.

The antelink content are to be injected as blob

Related Objects

Mentioned Here: T309: Delete duplicated antelink/antepedia content from s3
T310: import antelink metadata

Event Timeline

ardumont created this task.Feb 9 2016, 12:00 PM

Herald added a project: Staff. · View Herald TranscriptFeb 9 2016, 12:00 PM

ardumont added projects: Antelink loader, Developers.Feb 9 2016, 12:18 PM

ardumont added a project: Storage manager.

ardumont updated the task description. (Show Details)Feb 9 2016, 12:20 PM

ardumont renamed this task from swh-loader-antelink bootstrap - Retrieve content from s3 and store inside swh-storage to Retrieve content from s3 and store inside swh-storage.Feb 9 2016, 12:29 PM

ardumont merged a task: T319: S3 content files downloader and injection in swh.Feb 23 2016, 4:12 PM

zack removed a project: Staff.Mar 10 2016, 5:49 PM

zack removed a project: Developers.Mar 10 2016, 5:54 PM

zack renamed this task from Retrieve content from s3 and store inside swh-storage to retrieve content from s3 and store it in SWH storage.Apr 27 2016, 9:01 PM

zack assigned this task to ardumont.

zack lowered the priority of this task from High to Normal.

ardumont closed this task as Resolved.May 11 2016, 1:50 PM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM

This task has been migrated to GitLab.

retrieve content from s3 and store it in SWH storageClosed, MigratedEdits LockedActions