Page MenuHomeSoftware Heritage

storage_checker: Do not re-check ranges already marked as checked
ClosedPublic

Authored by vlorentz on Oct 7 2022, 3:42 PM.

Details

Summary

For now this is a naive implementation, which does never rechecks.

Depends on D8609

Resolves T4527.

Diff Detail

Repository
rDSCRUB Datastore Scrubber
Branch
checkpoint
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32166
Build 50371: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50370: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8641 (id=31211)

Could not rebase; Attempt merge onto fef8a513df...

Updating fef8a51..2822428
Fast-forward
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |   4 +-
 swh/scrubber/storage_checker.py               |  73 ++++++++++-
 swh/scrubber/tests/test_storage_postgresql.py | 172 ++++++++++++++++++++++++--
 4 files changed, 242 insertions(+), 11 deletions(-)
Changes applied before test
commit 2822428737168ec2ae8e147c261f860fdbd5e359
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Oct 7 15:42:20 2022 +0200

    storage_checker: Do not re-check ranges already marked as checked
    
    For now this is a naive implementation, which does never rechecks.

commit 84fa17c00be8746aa2c08d15eeae68913da40842
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/71/ for more details.

douardda added a subscriber: douardda.

lgtm

swh/scrubber/storage_checker.py
149

why 'currently' here? is it something you want to change in the future?

This revision is now accepted and ready to land.Oct 11 2022, 10:27 AM
swh/scrubber/storage_checker.py
149

Yes because, for example, changing the size of intervals would cause everything to be re-checked immediately, which is wasteful.

But it's also not super easy to do it efficiently (interval trees would be optimal but complicated; there is probably some good-enough heuristic to be found that works for us)