Page MenuHomeSoftware Heritage

Create a content integrity checker
ClosedPublic

Authored by qcampos on May 26 2016, 12:17 PM.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos retitled this revision from to Create a content integrity checker.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)
swh/storage/objstorage/objstorage.py
194–221

As you can see lines (180-181), there is :

if obj_id in self:
        return obj_id

That allow the objstorage to not write a file if already present, but in case the content is corrupted, there is no way to restore it (At least in the API).
Thats the reason I added a restore_bytes method that is so similar.

swh/storage/tests/test_checker.py
48–49 ↗(On Diff #97)

Add a mock that simulate the needed part of a storage in order to simplify the tests.

qcampos edited edge metadata.

Add the 'Closes T304' tag to the last commit messages.

ardumont added inline comments.
swh/storage/checker/checker.py
9 ↗(On Diff #98)

barf, missing c, AbstractChecker ^^

In D31#633, @qcampos wrote:

Add the 'Closes T304' tag to the last commit messages.

^^

swh/storage/checker/checker.py
24 ↗(On Diff #98)

Why don't you use the same docstring in the abstract functions?
I like those in the main docstring (here) better than the function ones ^^

51 ↗(On Diff #98)

which content

57 ↗(On Diff #98)

should implement

swh/storage/checker/local_checker.py
12 ↗(On Diff #98)

Barf, missing c, ContentChecker.
You are consistent ^^

19 ↗(On Diff #98)

dictionary

73 ↗(On Diff #98)

I'm curious about this, why would there be no content?

swh/storage/objstorage/objstorage.py
194–221

ok.

Maybe adding a flag to the initial add_bytes function to check for existing presence (defaulting to true to keep the existing behavior) would be ok too. To avoid duplication.
What do you think?

Something like

def add_bytes(self, bytes, obj_id=None, check_presence=True):

Then, in add_bytes:

if check_presence and obj_id in self:
        return obj_id

At last, your restore_bytes is just add_bytes with that check_presence flag to False.

195

Restore

swh/storage/tests/test_checker.py
37 ↗(On Diff #98)

I created a flag 'fs' for filesystem instead of db when it's fs related in swh-model.
I am under the impression it's only fs related here.

This comment was removed by ardumont.

Overall, this seems good, just:

  • beware the typos, especially in the class and function/methods names
  • add an entry point to trigger an actual check ^^ (or did i miss it?)

And this is good to go.

qcampos marked 11 inline comments as done.
qcampos edited edge metadata.
  • Typos corrections
  • Removing useless abstract class
  • Add a way to launch the checker in cl

Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).

swh/storage/checker/local_checker.py
73 ↗(On Diff #98)

In case the storage we are using don't have the content we want. When objstorage raises a ObjNotFoundError on a get, the storage put a None in the list, probably to allow the other contents to be retrieved.

A problem here is we don't know which content failed (well, we do de facto by looking at the code, but nothing in storage.content_get contract gives us insurrance that the responses and the requests are in the same order).
So here, should I use the fact that it's the same order? Do the requests for the repair one-by-one ?

swh/storage/checker/checker.py
100 ↗(On Diff #100)

Continuing the discussion here since phab won't let me do it at the end of the previous comment.

In case the storage we are using don't have the content we want.
When objstorage raises a ObjNotFoundError on a get, the storage put a None in the list, probably to allow the other contents to be retrieved.

Ah, yes, Indeed.
Thanks for refreshing me on this.

So here, should I use the fact that it's the same order?

No.
Or, yes, if you update its contract about that order being the same as the input.

Also, the contract does not state that an unknown content shoud be returned as None either.
My point is, this is not the definitive api.
This was opened initially for the web-ui and was pretty basic.
So you could improve the content_get api to what would be the best here (and update the docstring accordingly).

Bunch of questions:

  • is it normal to ask the backup server for a content it does not have?
  • I realize i am not sure what a backup server is. Could it be slave node (in the content archiver sense)?
  • Also could it make sense it's not one backup server but multiple ones then?

Do the requests for the repair one-by-one ?

I don't like that idea much.
If the api knows how to deal in batch, keep it in batch.

swh/storage/objstorage/objstorage.py
195

you missed the typo ^^

Actually, the checker is not yet ready, as there is still an important question about how to choose the content that should be checker.
We want a stateless probabilistic method (See zack's comment on T304).

Ack.

Still, nothing prevents this from being merged and patched later with a smarter approach.
I don't want to be pushy, just sayin' ^^, that's what we did with the content archiver IIRC.

qcampos marked an inline comment as done.
qcampos edited edge metadata.
  • Add a method to get random objects from an object storage (local & remote API)
  • Use the previous method to get a sample in the checker
  • Add a way to log contents that could not be restored by the checker
qcampos edited edge metadata.

Correct some misleading documentation

Aside from some small remarks, this is good to go.

swh/storage/objstorage/api/server.py
61 ↗(On Diff #102)

Are you sure the list is needed here?

encode_data comes from swh.storage.api.common.encode_data_server.
This uses swh.core.serializers.msgpack_dumps which checks for generator type and already consumes as list if it is.

Can you please double check by removing it and running a test for example?

swh/storage/objstorage/objstorage.py
327

This is just a note to maybe trigger a conversation, not necessarily a change on the code ^^

I know we have a 'tmp' folder in /srv/softwareheritage/objects.
I don't remember the purpose of it though.

Anyway, this seems like an implementation detail which could be abstracted away (as a blacklist folders option for example).
(Again not right now)

swh/storage/tests/test_objstorage.py
137

Can you please add a pre-check to validate the content is not already there.
Then add it and then check it's indeed there.

qcampos edited edge metadata.

Remove an unnecessary list() transformation.

swh/storage/objstorage/api/server.py
61 ↗(On Diff #102)

Works fine without it. Thanks!
Didn't notice that encode_data did the work.

swh/storage/objstorage/objstorage.py
327

The tmp folder is used by the object storage to create a temporary file as it compute the checksums on the fly. It then move the file to it's right location, renaming it according to the sha1.

I know it lack of genericity, but as we are into objstorage implementation I thought it was ok.

swh/storage/tests/test_objstorage.py
137

You mean the content that is added to the storage at line 136 ?

swh/storage/objstorage/objstorage.py
327

The tmp folder is used by the object storage to create a temporary file as it compute the checksums on the fly. It then move the file to it's right location, renaming it according to the sha1.

Awesome, thanks for refreshing me on this.

I know it lack of genericity, but as we are into objstorage implementation I thought it was ok.

Like i said, it's not that big of a deal.

swh/storage/tests/test_objstorage.py
137

Yes ^^

I don't know if the test class is stateful (meaning it shares state, there could be something already inside which match).

So making sure it's not there before adding and checking it's there seems reasonable.

swh/storage/tests/test_objstorage.py
137

Tests are stateless, as each test runs on its own empty objstorage (setUp is called each time). So I guess there is no need for a test before.

Also, isn't the after-adding test a little redundant with the add_bytes method test ?

swh/storage/tests/test_objstorage.py
137

Tests are stateless, as each test runs on its own empty objstorage (setUp is called each time). So I guess there is no need for a test before.

Awesome then!

Also, isn't the after-adding test a little redundant with the add_bytes method test ?

Indeed! ^^

Good to go then

This revision is now accepted and ready to land.May 27 2016, 2:48 PM
This revision was automatically updated to reflect the committed changes.
zack changed the visibility from "All Users" to "Public (No Login Required)".