Page MenuHomeSoftware Heritage

storage_checker: Notify database when ranges are fully checked
ClosedPublic

Authored by vlorentz on Oct 4 2022, 2:19 PM.

Details

Summary

For now, this does not use this information to deduplicate work (T4527)

Depends on D8608.

Diff Detail

Repository
rDSCRUB Datastore Scrubber
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8609 (id=31094)

Could not rebase; Attempt merge onto 630001ce7b...

Merge made by the 'recursive' strategy.
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |  79 +++++++++++++++-
 swh/scrubber/sql/30-schema.sql                |  17 ++++
 swh/scrubber/sql/60-indexes.sql               |   6 ++
 swh/scrubber/sql/upgrades/4.sql               |  21 +++++
 swh/scrubber/storage_checker.py               |  53 ++++++++++-
 swh/scrubber/tests/test_db.py                 |  58 ++++++++++++
 swh/scrubber/tests/test_storage_postgresql.py | 131 ++++++++++++++++++++++++--
 8 files changed, 358 insertions(+), 11 deletions(-)
 create mode 100644 swh/scrubber/sql/upgrades/4.sql
 create mode 100644 swh/scrubber/tests/test_db.py
Changes applied before test
commit 4779dd8f9b4b1b1758c9a904daa938315b76a7c1
Merge: 630001c c4f91f6
Author: Jenkins user <jenkins@localhost>
Date:   Tue Oct 4 12:19:24 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c4f91f62fe3ed6ebb16fa6decd0a3b09d018ef87
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

commit d64e557c18213034772c1784eeedc2e89a1c4a2f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:17:12 2022 +0200

    db: Add table checked_ranges
    
    It will be used by the storage_checker to 'remember' what ranges
    it already checked recently across runs (and crashes), and to
    monitor progress.

Link to build: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/64/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/64/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 4 2022, 2:19 PM
Harbormaster failed remote builds in B32064: Diff 31094!

Build is green

Patch application report for D8609 (id=31095)

Could not rebase; Attempt merge onto 630001ce7b...

Merge made by the 'recursive' strategy.
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |  79 +++++++++++++++-
 swh/scrubber/sql/30-schema.sql                |  17 ++++
 swh/scrubber/sql/60-indexes.sql               |   6 ++
 swh/scrubber/sql/upgrades/4.sql               |  21 +++++
 swh/scrubber/storage_checker.py               |  53 ++++++++++-
 swh/scrubber/tests/test_db.py                 |  58 ++++++++++++
 swh/scrubber/tests/test_storage_postgresql.py | 131 ++++++++++++++++++++++++--
 8 files changed, 358 insertions(+), 11 deletions(-)
 create mode 100644 swh/scrubber/sql/upgrades/4.sql
 create mode 100644 swh/scrubber/tests/test_db.py
Changes applied before test
commit d2daac04a622ea5020cfa0602cbea06350cd8b08
Merge: 630001c c48f096
Author: Jenkins user <jenkins@localhost>
Date:   Tue Oct 4 12:25:20 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c48f096a433c156ac647273728437eb3afe9d278
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

commit d64e557c18213034772c1784eeedc2e89a1c4a2f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:17:12 2022 +0200

    db: Add table checked_ranges
    
    It will be used by the storage_checker to 'remember' what ranges
    it already checked recently across runs (and crashes), and to
    monitor progress.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/65/ for more details.

ardumont added inline comments.
swh/scrubber/storage_checker.py
52

[a, b[ writing to exclude b from the interval?

swh/scrubber/storage_checker.py
52

[a, b[ is the French notation, [a, b) is the English one.

Build is green

Patch application report for D8609 (id=31198)

Could not rebase; Attempt merge onto 630001ce7b...

Merge made by the 'recursive' strategy.
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |  79 +++++++++++++++-
 swh/scrubber/sql/30-schema.sql                |  17 ++++
 swh/scrubber/sql/60-indexes.sql               |   6 ++
 swh/scrubber/sql/upgrades/4.sql               |  21 +++++
 swh/scrubber/storage_checker.py               |  53 ++++++++++-
 swh/scrubber/tests/test_db.py                 |  58 ++++++++++++
 swh/scrubber/tests/test_storage_postgresql.py | 131 ++++++++++++++++++++++++--
 8 files changed, 358 insertions(+), 11 deletions(-)
 create mode 100644 swh/scrubber/sql/upgrades/4.sql
 create mode 100644 swh/scrubber/tests/test_db.py
Changes applied before test
commit d5e8cd26b0da48cdb8c764584a2c37785ed73c54
Merge: 630001c 85c3514
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 7 09:14:43 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 85c35148face203df9f5653efda33ad6c729ca90
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

commit 1f3a00dbcbdaadda7b1625cd186a55febf21c3f2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:17:12 2022 +0200

    db: Add table checked_ranges
    
    It will be used by the storage_checker to 'remember' what ranges
    it already checked recently across runs (and crashes), and to
    monitor progress.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/67/ for more details.

Build is green

Patch application report for D8609 (id=31209)

Could not rebase; Attempt merge onto 630001ce7b...

Updating 630001c..84fa17c
Fast-forward
 docs/README.rst                               |   4 +
 swh/scrubber/db.py                            |  79 +++++++++++++++-
 swh/scrubber/sql/30-schema.sql                |  17 ++++
 swh/scrubber/sql/60-indexes.sql               |   6 ++
 swh/scrubber/sql/upgrades/4.sql               |  21 +++++
 swh/scrubber/storage_checker.py               |  53 ++++++++++-
 swh/scrubber/tests/test_db.py                 |  58 ++++++++++++
 swh/scrubber/tests/test_storage_postgresql.py | 131 ++++++++++++++++++++++++--
 8 files changed, 358 insertions(+), 11 deletions(-)
 create mode 100644 swh/scrubber/sql/upgrades/4.sql
 create mode 100644 swh/scrubber/tests/test_db.py
Changes applied before test
commit 84fa17c00be8746aa2c08d15eeae68913da40842
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:18:50 2022 +0200

    storage_checker: Notify database when ranges are fully checked
    
    For now, this does not use this information to deduplicate work.

commit fef8a513dfead265788a24e5f580fbe9c746bd32
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 4 14:17:12 2022 +0200

    db: Add table checked_ranges
    
    It will be used by the storage_checker to 'remember' what ranges
    it already checked recently across runs (and crashes), and to
    monitor progress.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/70/ for more details.

douardda added a subscriber: douardda.

lgtm (but with a couple of comments)

swh/scrubber/storage_checker.py
138

wouldn't swhids.ObjectType[self.object_type.upper()] work here?

swh/scrubber/tests/test_storage_postgresql.py
134

I don't really see the purpose of these test datasets. Not really a problem, but looks strange you felt like testing half the cases for the first digit (and the first digit only)

This revision is now accepted and ready to land.Oct 13 2022, 11:49 AM
swh/scrubber/storage_checker.py
138

maybe, I get confused with all the enum syntaxes

swh/scrubber/tests/test_storage_postgresql.py
134

isn't meant to test range generation (because it's imported from swh.storage.backfill).

initially I made the test compute these values dynamically, but the code was messy so I opted to write them out like this