Page MenuHomeSoftware Heritage

Add a validating storage proxy, to check ids before insertion.
ClosedPublic

Authored by vlorentz on Aug 25 2020, 6:08 PM.

Details

Summary

It can be used to prevent buggy or malicious clients of the storage
from adding invalid objects.

On the other hand, it will be computationally expensive, especially for contents.
What do you think?

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D3844 (id=13539)

Rebasing onto 4532a4dc64...

First, rewinding head to replay your work on top of it...
Applying: Add a validating storage proxy, to check ids before insertion.
Changes applied before test
commit ffb5a75b456d8ec23df1ba9f519406a133eaca7a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 25 18:08:27 2020 +0200

    Add a validating storage proxy, to check ids before insertion.
    
    It can be used to prevent buggy or malicious clients of the storage
    from adding invalid objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/875/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/875/console

Build is green

Patch application report for D3844 (id=13540)

Rebasing onto 4532a4dc64...

First, rewinding head to replay your work on top of it...
Applying: Add a validating storage proxy, to check ids before insertion.
Changes applied before test
commit a9ac2164bfba38b198b3bbcbe3426d8a9a0f04b8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 25 18:08:27 2020 +0200

    Add a validating storage proxy, to check ids before insertion.
    
    It can be used to prevent buggy or malicious clients of the storage
    from adding invalid objects.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/876/ for more details.

On the other hand, it will be computationally expensive, especially for contents.
What do you think?

It's computationally expensive but not IO expensive (no queries to the backend), so this cost is rather trivial to assume: buy more hardware and add more swh.storage instances. It would be interesting to give it a try on a test session (docker or staging) and run every loader on a bunch of origins to see if ti catches some bugs, at least.

@douardda but even if it's not a bottleneck, we'll have to check it doesn't slow down loaders too much because of the extra latency.

I did not look into details but I think that was initially what we wanted to do
with the "real" validate proxy in the first place [1]

In that regard, if it can prevent issues, i think it's fine.
Although, we don't have yet other untrusted clients which push to storage...

Maybe that could catch potential serialization issues or something though?

[1] recall the other "validate" one (removed now) we transformed into a "convert me those dicts to data model objects" layer ¯\_(ツ)_/¯)

looks good to me

if we want to test it on staging or docker first, then needs to be in master in the first place heh ;)

swh/storage/tests/test_validate.py
48

test_validating...

This revision is now accepted and ready to land.Aug 26 2020, 4:36 PM
This revision was landed with ongoing or failed builds.Aug 28 2020, 1:19 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D3844 (id=13606)

Rebasing onto 5afd985ebd...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-882-D3844.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/882/ for more details.