Details
Details
- Reviewers
douardda - Group Reviewers
Reviewers - Maniphest Tasks
- T4527: scrubber: keep a state file for postgresql datastores
- Commits
- rDSCRUB282242873716: storage_checker: Do not re-check ranges already marked as checked
Diff Detail
Diff Detail
- Repository
- rDSCRUB Datastore Scrubber
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Comment Actions
Build is green
Patch application report for D8641 (id=31211)
Could not rebase; Attempt merge onto fef8a513df...
Updating fef8a51..2822428 Fast-forward docs/README.rst | 4 + swh/scrubber/db.py | 4 +- swh/scrubber/storage_checker.py | 73 ++++++++++- swh/scrubber/tests/test_storage_postgresql.py | 172 ++++++++++++++++++++++++-- 4 files changed, 242 insertions(+), 11 deletions(-)
Changes applied before test
commit 2822428737168ec2ae8e147c261f860fdbd5e359 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Oct 7 15:42:20 2022 +0200 storage_checker: Do not re-check ranges already marked as checked For now this is a naive implementation, which does never rechecks. commit 84fa17c00be8746aa2c08d15eeae68913da40842 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Oct 4 14:18:50 2022 +0200 storage_checker: Notify database when ranges are fully checked For now, this does not use this information to deduplicate work.
See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/71/ for more details.
Comment Actions
lgtm
swh/scrubber/storage_checker.py | ||
---|---|---|
148 | why 'currently' here? is it something you want to change in the future? |
swh/scrubber/storage_checker.py | ||
---|---|---|
148 | Yes because, for example, changing the size of intervals would cause everything to be re-checked immediately, which is wasteful. But it's also not super easy to do it efficiently (interval trees would be optimal but complicated; there is probably some good-enough heuristic to be found that works for us) |