Page MenuHomeSoftware Heritage

storage_checker: Do not re-check ranges already marked as checked
ClosedPublic

Authored by vlorentz on Oct 7 2022, 3:42 PM.

Details

Summary

For now this is a naive implementation, which does never rechecks.

Depends on D8609

Resolves T4527.

Diff Detail

Repository
rDSCRUB Datastore Scrubber
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8641 (id=31211)

Could not rebase; Attempt merge onto fef8a513df...

Updating fef8a51..2822428
Fast-forward
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |   4 +-
 swh/scrubber/storage_checker.py               |  73 ++++++++++-
 swh/scrubber/tests/test_storage_postgresql.py | 172 ++++++++++++++++++++++++--
 4 files changed, 242 insertions(+), 11 deletions(-)
Changes applied before test
commit 2822428737168ec2ae8e147c261f860fdbd5e359
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Oct 7 15:42:20 2022 +0200

    storage_checker: Do not re-check ranges already marked as checked
    
    For now this is a naive implementation, which does never rechecks.

commit 84fa17c00be8746aa2c08d15eeae68913da40842
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/71/ for more details.

douardda added a subscriber: douardda.

lgtm

swh/scrubber/storage_checker.py
148

why 'currently' here? is it something you want to change in the future?

This revision is now accepted and ready to land.Oct 11 2022, 10:27 AM
swh/scrubber/storage_checker.py
148

Yes because, for example, changing the size of intervals would cause everything to be re-checked immediately, which is wasteful.

But it's also not super easy to do it efficiently (interval trees would be optimal but complicated; there is probably some good-enough heuristic to be found that works for us)