Page MenuHomeSoftware Heritage

Update the archiver specification
ClosedPublic

Authored by qcampos on Jul 29 2016, 3:33 PM.

Diff Detail

Repository
rDSTO Storage manager
Branch
T512-doc
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 326
Build 485: Software Heritage Python tests
Build 484: arc lint + arc unit

Event Timeline

qcampos retitled this revision from to Update the archiver specification.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)
qcampos added reviewers: olasd, zack.
zack requested changes to this revision.Jul 31 2016, 6:02 PM
zack edited edge metadata.
zack added inline comments.
docs/archiver-blueprint.md
15

"Peer-to-peer" (or P2P) is the standard spelling of this notion

15–24

A more substantial comment here is that the recent changes in how the archiver works didn't really turn it into a P2P system. Most notably we still rely on a centralized director that: a) spots that more copies are needed, and b) decides what-to-copy-where.

So I propose to call this section "Peer-to-peer topology". And maybe add a paragraph to state that, whereas peers are in general considered to be equal, coordination is currently centralized.

34–40

Here is a plug to reiterate, if needed, that the source/destination decision is not up to individual nodes.

97–103

why is the discussion of mtime gone from here?

114

same as above: why is this gone?

119

I don't understand what "but elapsed time" means here, please clarify/rewrite.

128–134

nitpick: there are capital letters at the beginning of bullet points, whereas they aren't there all over the document

130

nitpick: writing this "for each (content, source, destination)" would make it less odd

133

typo: "transfert" -> "transfer"

163–174

typo: "api" -> "API"

182–183

maybe add a comment here that these are sample/initial archive names

201

The schema of the JSON here should be described, in a way that allows validation, ideally using json-schema.
See sql/json/ in swh-storage for examples.

This revision now requires changes to proceed.Jul 31 2016, 6:02 PM
qcampos edited edge metadata.
qcampos marked 11 inline comments as done.

Corrections of the specs and add the json schema description

docs/archiver-blueprint.md
97–103

The new specification does not check /mtime/ in the director. It only count the number of effectively present copies to select contents that need or not more copies.
The check of the /mtime/s of /ongoing/ status is now done in the worker only (note: this was previously done twice).

This makes the batches less accurate about contents that really needs archival, but they are generated quicker, and as this check was done anyway in the worker, we can save a db request for each content.

zack requested changes to this revision.Aug 1 2016, 2:36 PM
zack edited edge metadata.

All good! Just a couple of new nits remaining.

docs/archiver-blueprint.md
97–103

OK, fair enough. Thanks for clarifying.

210–229

Several separator "," are missing in this JSON.
Pro tip: you can check whether it's correct or not using json-glib-format (which I've just installed on your machine ;-))

220–221

A description of valid values is missing here, adding the following line here should do:

"enum": ["missing", "ongoing", "present", "corrupted"]

(but please check the JSON Schema spec because I'm not 100% sure)

This revision now requires changes to proceed.Aug 1 2016, 2:36 PM
qcampos edited edge metadata.
qcampos marked 5 inline comments as done.

Correct json format and move integrity check from copier to worker (previous position was a mistake)

docs/archiver-blueprint.md
220–221

"enum" seems valid. Thanks !

zack edited edge metadata.
This revision is now accepted and ready to land.Aug 1 2016, 2:50 PM
This revision was automatically updated to reflect the committed changes.