- Add a restore_bytes method to the object storage
- Create a content integrity checker that runs in local to verify objects
- Add some tests for the content integrity checker
Details
- Reviewers
ardumont - Group Reviewers
Reviewers - Maniphest Tasks
- T423: Make a content integrity checker that can run on a Tier 1 node
T304: content integrity checker - Commits
- rDSTOCeb8ada459224: Add some tests for the content integrity checker
rDSTOCe4881f0a2959: Create a content integrity checker that runs in local to verify objects
rDSTOC26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…
rDSTOCb21613965b98: Also, add a get_random_contents access to the remote API
R65:e4881f0a2959: Create a content integrity checker that runs in local to verify objects
R65:26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…
R65:eb8ada459224: Add some tests for the content integrity checker
R65:b21613965b98: Also, add a get_random_contents access to the remote API
rDSTOb21613965b98: Also, add a get_random_contents access to the remote API
rDSTOeb8ada459224: Add some tests for the content integrity checker
rDSTOe4881f0a2959: Create a content integrity checker that runs in local to verify objects
rDSTO26e933b0eb3e: Add some methods to the object storage in order to allow a content integrity…
- Detection of valid/invalid content is fine
- A content is repaired is possible
- A content that cannot be repaired does not make the checker crash
Diff Detail
- Repository
- rDSTO Storage manager
- Branch
- T304
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 128 Build 185: Software Heritage Python tests Build 184: arc lint + arc unit
Event Timeline
swh/storage/objstorage/objstorage.py | ||
---|---|---|
194–221 | As you can see lines (180-181), there is : if obj_id in self: return obj_id That allow the objstorage to not write a file if already present, but in case the content is corrupted, there is no way to restore it (At least in the API). | |
swh/storage/tests/test_checker.py | ||
48–49 | Add a mock that simulate the needed part of a storage in order to simplify the tests. |
swh/storage/checker/checker.py | ||
---|---|---|
9 | barf, missing c, AbstractChecker ^^ |
^^
swh/storage/checker/checker.py | ||
---|---|---|
24 | Why don't you use the same docstring in the abstract functions? | |
51 | which content | |
57 | should implement | |
swh/storage/checker/local_checker.py | ||
12 ↗ | (On Diff #98) | Barf, missing c, ContentChecker. |
19 ↗ | (On Diff #98) | dictionary |
73 ↗ | (On Diff #98) | I'm curious about this, why would there be no content? |
swh/storage/objstorage/objstorage.py | ||
194–221 | ok. Maybe adding a flag to the initial add_bytes function to check for existing presence (defaulting to true to keep the existing behavior) would be ok too. To avoid duplication. Something like def add_bytes(self, bytes, obj_id=None, check_presence=True): Then, in add_bytes: if check_presence and obj_id in self: return obj_id At last, your restore_bytes is just add_bytes with that check_presence flag to False. | |
195 | Restore | |
swh/storage/tests/test_checker.py | ||
37 | I created a flag 'fs' for filesystem instead of db when it's fs related in swh-model. |
Overall, this seems good, just:
- beware the typos, especially in the class and function/methods names
- add an entry point to trigger an actual check ^^ (or did i miss it?)
And this is good to go.
- Typos corrections
- Removing useless abstract class
- Add a way to launch the checker in cl
Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).
swh/storage/checker/local_checker.py | ||
---|---|---|
73 ↗ | (On Diff #98) | In case the storage we are using don't have the content we want. When objstorage raises a ObjNotFoundError on a get, the storage put a None in the list, probably to allow the other contents to be retrieved. A problem here is we don't know which content failed (well, we do de facto by looking at the code, but nothing in storage.content_get contract gives us insurrance that the responses and the requests are in the same order). |
swh/storage/checker/checker.py | ||
---|---|---|
100 | Continuing the discussion here since phab won't let me do it at the end of the previous comment.
Ah, yes, Indeed.
No. Also, the contract does not state that an unknown content shoud be returned as None either. Bunch of questions:
I don't like that idea much. | |
swh/storage/objstorage/objstorage.py | ||
195 | you missed the typo ^^ |
Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).
Ack.
Still, nothing prevents this from being merged and patched later with a smarter approach.
I don't want to be pushy, just sayin' ^^, that's what we did with the content archiver IIRC.
- Add a method to get random objects from an object storage (local & remote API)
- Use the previous method to get a sample in the checker
- Add a way to log contents that could not be restored by the checker
Aside from some small remarks, this is good to go.
swh/storage/objstorage/api/server.py | ||
---|---|---|
60 | Are you sure the list is needed here? encode_data comes from swh.storage.api.common.encode_data_server. Can you please double check by removing it and running a test for example? | |
swh/storage/objstorage/objstorage.py | ||
327 | This is just a note to maybe trigger a conversation, not necessarily a change on the code ^^ I know we have a 'tmp' folder in /srv/softwareheritage/objects. Anyway, this seems like an implementation detail which could be abstracted away (as a blacklist folders option for example). | |
swh/storage/tests/test_objstorage.py | ||
137 | Can you please add a pre-check to validate the content is not already there. |
swh/storage/objstorage/api/server.py | ||
---|---|---|
60 | Works fine without it. Thanks! | |
swh/storage/objstorage/objstorage.py | ||
327 | The tmp folder is used by the object storage to create a temporary file as it compute the checksums on the fly. It then move the file to it's right location, renaming it according to the sha1. I know it lack of genericity, but as we are into objstorage implementation I thought it was ok. | |
swh/storage/tests/test_objstorage.py | ||
137 | You mean the content that is added to the storage at line 136 ? |
swh/storage/objstorage/objstorage.py | ||
---|---|---|
327 |
Awesome, thanks for refreshing me on this.
Like i said, it's not that big of a deal. | |
swh/storage/tests/test_objstorage.py | ||
137 | Yes ^^ I don't know if the test class is stateful (meaning it shares state, there could be something already inside which match). So making sure it's not there before adding and checking it's there seems reasonable. |
swh/storage/tests/test_objstorage.py | ||
---|---|---|
137 | Tests are stateless, as each test runs on its own empty objstorage (setUp is called each time). So I guess there is no need for a test before. Also, isn't the after-adding test a little redundant with the add_bytes method test ? |
swh/storage/tests/test_objstorage.py | ||
---|---|---|
137 |
Awesome then!
Indeed! ^^ Good to go then |