Page MenuHomeSoftware Heritage

Update the archiver specification

Authored by qcampos on Jul 29 2016, 3:33 PM.

Diff Detail

rDSTO Storage manager
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos retitled this revision from to Update the archiver specification.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)
qcampos added reviewers: olasd, zack.
zack requested changes to this revision.Jul 31 2016, 6:02 PM
zack edited edge metadata.
zack added inline comments.

"Peer-to-peer" (or P2P) is the standard spelling of this notion


A more substantial comment here is that the recent changes in how the archiver works didn't really turn it into a P2P system. Most notably we still rely on a centralized director that: a) spots that more copies are needed, and b) decides what-to-copy-where.

So I propose to call this section "Peer-to-peer topology". And maybe add a paragraph to state that, whereas peers are in general considered to be equal, coordination is currently centralized.


Here is a plug to reiterate, if needed, that the source/destination decision is not up to individual nodes.


why is the discussion of mtime gone from here?


same as above: why is this gone?


I don't understand what "but elapsed time" means here, please clarify/rewrite.


nitpick: there are capital letters at the beginning of bullet points, whereas they aren't there all over the document


nitpick: writing this "for each (content, source, destination)" would make it less odd


typo: "transfert" -> "transfer"


typo: "api" -> "API"


maybe add a comment here that these are sample/initial archive names


The schema of the JSON here should be described, in a way that allows validation, ideally using json-schema.
See sql/json/ in swh-storage for examples.

This revision now requires changes to proceed.Jul 31 2016, 6:02 PM
qcampos edited edge metadata.
qcampos marked 11 inline comments as done.

Corrections of the specs and add the json schema description


The new specification does not check /mtime/ in the director. It only count the number of effectively present copies to select contents that need or not more copies.
The check of the /mtime/s of /ongoing/ status is now done in the worker only (note: this was previously done twice).

This makes the batches less accurate about contents that really needs archival, but they are generated quicker, and as this check was done anyway in the worker, we can save a db request for each content.

zack requested changes to this revision.Aug 1 2016, 2:36 PM
zack edited edge metadata.

All good! Just a couple of new nits remaining.


OK, fair enough. Thanks for clarifying.


Several separator "," are missing in this JSON.
Pro tip: you can check whether it's correct or not using json-glib-format (which I've just installed on your machine ;-))


A description of valid values is missing here, adding the following line here should do:

"enum": ["missing", "ongoing", "present", "corrupted"]

(but please check the JSON Schema spec because I'm not 100% sure)

This revision now requires changes to proceed.Aug 1 2016, 2:36 PM
qcampos edited edge metadata.
qcampos marked 5 inline comments as done.

Correct json format and move integrity check from copier to worker (previous position was a mistake)


"enum" seems valid. Thanks !

zack edited edge metadata.
This revision is now accepted and ready to land.Aug 1 2016, 2:50 PM
This revision was automatically updated to reflect the committed changes.