Details

Reviewers

Group Reviewers

Maniphest Tasks

T423: Make a content integrity checker that can run on a Tier 1 node
T304: content integrity checker

Commits

rDSTOCeb8ada459224: Add some tests for the content integrity checker
rDSTOCe4881f0a2959: Create a content integrity checker that runs in local to verify objects
rDSTOC26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…
rDSTOCb21613965b98: Also, add a get_random_contents access to the remote API
R65:e4881f0a2959: Create a content integrity checker that runs in local to verify objects
R65:26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…
R65:eb8ada459224: Add some tests for the content integrity checker
R65:b21613965b98: Also, add a get_random_contents access to the remote API
rDSTOb21613965b98: Also, add a get_random_contents access to the remote API
rDSTOeb8ada459224: Add some tests for the content integrity checker
rDSTOe4881f0a2959: Create a content integrity checker that runs in local to verify objects
rDSTO26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…

Summary

Add a restore_bytes method to the object storage
Create a content integrity checker that runs in local to verify objects
Add some tests for the content integrity checker

Test Plan

Detection of valid/invalid content is fine
A content is repaired is possible
A content that cannot be repaired does not make the checker crash

Diff Detail

Repository

rDSTO Storage manager

Branch

T304

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 128
Build 185: Software Heritage Python tests
Build 184: arc lint + arc unit

Event Timeline

qcampos updated this revision to Diff 97.May 26 2016, 12:17 PM

qcampos added a task: T304: content integrity checker.

qcampos retitled this revision from to Create a content integrity checker.

qcampos updated this object.

qcampos edited the test plan for this revision. (Show Details)

Herald added a reviewer: Reviewers. · View Herald TranscriptMay 26 2016, 12:17 PM

qcampos added inline comments.May 26 2016, 12:21 PM

swh/storage/objstorage/objstorage.py
194–221	As you can see lines (180-181), there is : if obj_id in self: return obj_id That allow the objstorage to not write a file if already present, but in case the content is corrupted, there is no way to restore it (At least in the API). Thats the reason I added a `restore_bytes` method that is so similar.
swh/storage/tests/test_checker.py
48–49	Add a mock that simulate the needed part of a storage in order to simplify the tests.

Add the 'Closes T304' tag to the last commit messages.

Herald edited edge metadata. · View Herald TranscriptMay 26 2016, 12:23 PM

ardumont added a subscriber: ardumont.May 26 2016, 4:48 PM

ardumont added inline comments.

swh/storage/checker/checker.py
9	barf, missing c, AbstractChecker ^^

In D31#633, @qcampos wrote:

Add the 'Closes T304' tag to the last commit messages.

swh/storage/checker/checker.py
24	Why don't you use the same docstring in the abstract functions? I like those in the main docstring (here) better than the function ones ^^
51	`which content`
57	`should implement`
swh/storage/checker/local_checker.py
12 ↗	(On Diff #98)	Barf, missing c, ContentChecker. You are consistent ^^
19 ↗	(On Diff #98)	`dictionary`
73 ↗	(On Diff #98)	I'm curious about this, why would there be no content?
swh/storage/objstorage/objstorage.py
194–221	ok. Maybe adding a flag to the initial add_bytes function to check for existing presence (defaulting to true to keep the existing behavior) would be ok too. To avoid duplication. What do you think? Something like def add_bytes(self, bytes, obj_id=None, check_presence=True): Then, in add_bytes: if check_presence and obj_id in self: return obj_id At last, your restore_bytes is just add_bytes with that check_presence flag to False.
195	`Restore`
swh/storage/tests/test_checker.py
37	I created a flag 'fs' for filesystem instead of db when it's fs related in swh-model. I am under the impression it's only fs related here.

ardumont added a comment.May 26 2016, 5:26 PM

This comment was removed by ardumont.

Overall, this seems good, just:

beware the typos, especially in the class and function/methods names
add an entry point to trigger an actual check ^^ (or did i miss it?)

And this is good to go.

Typos corrections
Removing useless abstract class
Add a way to launch the checker in cl

Herald edited edge metadata. · View Herald TranscriptMay 26 2016, 7:11 PM

Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).

swh/storage/checker/local_checker.py
73 ↗	(On Diff #98)	In case the storage we are using don't have the content we want. When objstorage raises a `ObjNotFoundError` on a get, the storage put a `None` in the list, probably to allow the other contents to be retrieved. A problem here is we don't know which content failed (well, we do de facto by looking at the code, but nothing in `storage.content_get` contract gives us insurrance that the responses and the requests are in the same order). So here, should I use the fact that it's the same order? Do the requests for the repair one-by-one ?

ardumont added inline comments.May 26 2016, 7:45 PM

swh/storage/checker/checker.py
100	Continuing the discussion here since phab won't let me do it at the end of the previous comment. In case the storage we are using don't have the content we want. When objstorage raises a ObjNotFoundError on a get, the storage put a None in the list, probably to allow the other contents to be retrieved. Ah, yes, Indeed. Thanks for refreshing me on this. So here, should I use the fact that it's the same order? No. Or, yes, if you update its contract about that order being the same as the input. Also, the contract does not state that an unknown content shoud be returned as None either. My point is, this is not the definitive api. This was opened initially for the web-ui and was pretty basic. So you could improve the content_get api to what would be the best here (and update the docstring accordingly). Bunch of questions: is it normal to ask the backup server for a content it does not have? I realize i am not sure what a backup server is. Could it be slave node (in the content archiver sense)? Also could it make sense it's not one backup server but multiple ones then? Do the requests for the repair one-by-one ? I don't like that idea much. If the api knows how to deal in batch, keep it in batch.
swh/storage/objstorage/objstorage.py
195	you missed the typo ^^

Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).

Ack.

Still, nothing prevents this from being merged and patched later with a smarter approach.
I don't want to be pushy, just sayin' ^^, that's what we did with the content archiver IIRC.

Add a method to get random objects from an object storage (local & remote API)
Use the previous method to get a sample in the checker
Add a way to log contents that could not be restored by the checker

Herald edited edge metadata. · View Herald TranscriptMay 27 2016, 1:28 PM

qcampos added a task: T423: Make a content integrity checker that can run on a Tier 1 node.May 27 2016, 1:28 PM

Correct some misleading documentation

Herald edited edge metadata. · View Herald TranscriptMay 27 2016, 1:32 PM

Aside from some small remarks, this is good to go.

swh/storage/objstorage/api/server.py
60	Are you sure the `list` is needed here? `encode_data` comes from `swh.storage.api.common.encode_data_server`. This uses `swh.core.serializers.msgpack_dumps` which checks for generator type and already consumes as list if it is. Can you please double check by removing it and running a test for example?
swh/storage/objstorage/objstorage.py
327	This is just a note to maybe trigger a conversation, not necessarily a change on the code ^^ I know we have a 'tmp' folder in /srv/softwareheritage/objects. I don't remember the purpose of it though. Anyway, this seems like an implementation detail which could be abstracted away (as a blacklist folders option for example). (Again not right now)
swh/storage/tests/test_objstorage.py
137	Can you please add a pre-check to validate the content is not already there. Then add it and then check it's indeed there.

Remove an unnecessary list() transformation.

Herald edited edge metadata. · View Herald TranscriptMay 27 2016, 2:26 PM

qcampos added inline comments.May 27 2016, 2:26 PM

swh/storage/objstorage/api/server.py
60	Works fine without it. Thanks! Didn't notice that encode_data did the work.
swh/storage/objstorage/objstorage.py
327	The tmp folder is used by the object storage to create a temporary file as it compute the checksums on the fly. It then move the file to it's right location, renaming it according to the sha1. I know it lack of genericity, but as we are into objstorage implementation I thought it was ok.
swh/storage/tests/test_objstorage.py
137	You mean the content that is added to the storage at line 136 ?

qcampos marked 2 inline comments as done.May 27 2016, 2:27 PM

ardumont added inline comments.May 27 2016, 2:36 PM

swh/storage/objstorage/objstorage.py
327	The tmp folder is used by the object storage to create a temporary file as it compute the checksums on the fly. It then move the file to it's right location, renaming it according to the sha1. Awesome, thanks for refreshing me on this. I know it lack of genericity, but as we are into objstorage implementation I thought it was ok. Like i said, it's not that big of a deal.
swh/storage/tests/test_objstorage.py
137	Yes ^^ I don't know if the test class is stateful (meaning it shares state, there could be something already inside which match). So making sure it's not there before adding and checking it's there seems reasonable.

qcampos added inline comments.May 27 2016, 2:46 PM

swh/storage/tests/test_objstorage.py
137	Tests are stateless, as each test runs on its own empty objstorage (`setUp` is called each time). So I guess there is no need for a test before. Also, isn't the after-adding test a little redundant with the `add_bytes` method test ?

ardumont added inline comments.May 27 2016, 2:48 PM

swh/storage/tests/test_objstorage.py
137	Tests are stateless, as each test runs on its own empty objstorage (setUp is called each time). So I guess there is no need for a test before. Awesome then! Also, isn't the after-adding test a little redundant with the add_bytes method test ? Indeed! ^^ Good to go then