Page MenuHomeSoftware Heritage

Add Fixer class, which re-loads corrupt objects from origins
ClosedPublic

Authored by vlorentz on Mar 25 2022, 2:39 PM.

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D7433 (id=26900)

Could not rebase; Attempt merge onto 4144ac789f...

Updating 4144ac7..b75e0c3
Fast-forward
 swh/scrubber/cli.py                       |  63 +++++-
 swh/scrubber/db.py                        | 214 ++++++++++++++++++-
 swh/scrubber/fixer.py                     | 202 ++++++++++++++++++
 swh/scrubber/journal_checker.py           |   4 +-
 swh/scrubber/origin_locator.py            |  91 ++++++++
 swh/scrubber/sql/30-schema.sql            |  28 ++-
 swh/scrubber/sql/60-indexes.sql           |  18 +-
 swh/scrubber/storage_checker.py           |   4 +-
 swh/scrubber/tests/test_cli.py            |  44 ++++
 swh/scrubber/tests/test_fixer.py          | 331 ++++++++++++++++++++++++++++++
 swh/scrubber/tests/test_origin_locator.py | 170 +++++++++++++++
 swh/scrubber/utils.py                     |  46 +++++
 12 files changed, 1197 insertions(+), 18 deletions(-)
 create mode 100644 swh/scrubber/fixer.py
 create mode 100644 swh/scrubber/origin_locator.py
 create mode 100644 swh/scrubber/tests/test_fixer.py
 create mode 100644 swh/scrubber/tests/test_origin_locator.py
 create mode 100644 swh/scrubber/utils.py
Changes applied before test
commit b75e0c374efc44817a10f44c07f88da7c109204b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins

commit d70dc6b0958ae0ca0b8dc9312eddbbd52191b937
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 23 13:49:19 2022 +0100

    Add OriginLocator, to list candidate origins to recover objects from

Link to build: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/21/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/21/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 25 2022, 2:40 PM
Harbormaster failed remote builds in B27814: Diff 26900!

Build has FAILED

Patch application report for D7433 (id=26902)

Could not rebase; Attempt merge onto 4144ac789f...

Updating 4144ac7..f1a887b
Fast-forward
 mypy.ini                                  |   3 +
 requirements-test.txt                     |   1 +
 swh/scrubber/cli.py                       |  63 +++++-
 swh/scrubber/db.py                        | 214 ++++++++++++++++++-
 swh/scrubber/fixer.py                     | 202 ++++++++++++++++++
 swh/scrubber/journal_checker.py           |   4 +-
 swh/scrubber/origin_locator.py            |  91 ++++++++
 swh/scrubber/sql/30-schema.sql            |  28 ++-
 swh/scrubber/sql/60-indexes.sql           |  18 +-
 swh/scrubber/storage_checker.py           |   4 +-
 swh/scrubber/tests/test_cli.py            |  44 ++++
 swh/scrubber/tests/test_fixer.py          | 331 ++++++++++++++++++++++++++++++
 swh/scrubber/tests/test_origin_locator.py | 170 +++++++++++++++
 swh/scrubber/utils.py                     |  46 +++++
 14 files changed, 1201 insertions(+), 18 deletions(-)
 create mode 100644 swh/scrubber/fixer.py
 create mode 100644 swh/scrubber/origin_locator.py
 create mode 100644 swh/scrubber/tests/test_fixer.py
 create mode 100644 swh/scrubber/tests/test_origin_locator.py
 create mode 100644 swh/scrubber/utils.py
Changes applied before test
commit f1a887bd84f9f91e69263a8f5db5392e291e3810
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins

commit ce497b9800818cf5929301a4a6f174a019120038
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 23 13:49:19 2022 +0100

    Add OriginLocator, to list candidate origins to recover objects from

Link to build: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/23/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/23/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 25 2022, 2:43 PM
Harbormaster failed remote builds in B27816: Diff 26902!

Build has FAILED

Patch application report for D7433 (id=26907)

Rebasing onto 4144ac789f...

Current branch diff-target is up to date.
Changes applied before test
commit ce497b9800818cf5929301a4a6f174a019120038
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 23 13:49:19 2022 +0100

    Add OriginLocator, to list candidate origins to recover objects from

Link to build: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/25/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/25/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 25 2022, 3:28 PM
Harbormaster failed remote builds in B27820: Diff 26907!

Build fails because the latest swh-graph release can't be uploaded on pypi

Build has FAILED

Patch application report for D7433 (id=26911)

Could not rebase; Attempt merge onto 4144ac789f...

Updating 4144ac7..ea8d1c4
Fast-forward
 mypy.ini                                  |   7 +
 requirements-swh.txt                      |   1 +
 requirements-test.txt                     |   1 +
 swh/scrubber/cli.py                       |  63 +++++-
 swh/scrubber/db.py                        | 214 ++++++++++++++++++-
 swh/scrubber/fixer.py                     | 202 ++++++++++++++++++
 swh/scrubber/journal_checker.py           |   4 +-
 swh/scrubber/origin_locator.py            |  91 ++++++++
 swh/scrubber/sql/30-schema.sql            |  28 ++-
 swh/scrubber/sql/60-indexes.sql           |  18 +-
 swh/scrubber/storage_checker.py           |   4 +-
 swh/scrubber/tests/test_cli.py            |  44 ++++
 swh/scrubber/tests/test_fixer.py          | 331 ++++++++++++++++++++++++++++++
 swh/scrubber/tests/test_origin_locator.py | 170 +++++++++++++++
 swh/scrubber/utils.py                     |  46 +++++
 15 files changed, 1206 insertions(+), 18 deletions(-)
 create mode 100644 swh/scrubber/fixer.py
 create mode 100644 swh/scrubber/origin_locator.py
 create mode 100644 swh/scrubber/tests/test_fixer.py
 create mode 100644 swh/scrubber/tests/test_origin_locator.py
 create mode 100644 swh/scrubber/utils.py
Changes applied before test
commit ea8d1c42a4193e5ba36c22d74650321531cedcd3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins

commit ce497b9800818cf5929301a4a6f174a019120038
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 23 13:49:19 2022 +0100

    Add OriginLocator, to list candidate origins to recover objects from

Link to build: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/26/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/26/console

Build is green

Patch application report for D7433 (id=26911)

Could not rebase; Attempt merge onto 4144ac789f...

Updating 4144ac7..ea8d1c4
Fast-forward
 mypy.ini                                  |   7 +
 requirements-swh.txt                      |   1 +
 requirements-test.txt                     |   1 +
 swh/scrubber/cli.py                       |  63 +++++-
 swh/scrubber/db.py                        | 214 ++++++++++++++++++-
 swh/scrubber/fixer.py                     | 202 ++++++++++++++++++
 swh/scrubber/journal_checker.py           |   4 +-
 swh/scrubber/origin_locator.py            |  91 ++++++++
 swh/scrubber/sql/30-schema.sql            |  28 ++-
 swh/scrubber/sql/60-indexes.sql           |  18 +-
 swh/scrubber/storage_checker.py           |   4 +-
 swh/scrubber/tests/test_cli.py            |  44 ++++
 swh/scrubber/tests/test_fixer.py          | 331 ++++++++++++++++++++++++++++++
 swh/scrubber/tests/test_origin_locator.py | 170 +++++++++++++++
 swh/scrubber/utils.py                     |  46 +++++
 15 files changed, 1206 insertions(+), 18 deletions(-)
 create mode 100644 swh/scrubber/fixer.py
 create mode 100644 swh/scrubber/origin_locator.py
 create mode 100644 swh/scrubber/tests/test_fixer.py
 create mode 100644 swh/scrubber/tests/test_origin_locator.py
 create mode 100644 swh/scrubber/utils.py
Changes applied before test
commit ea8d1c42a4193e5ba36c22d74650321531cedcd3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins

commit ce497b9800818cf5929301a4a6f174a019120038
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 23 13:49:19 2022 +0100

    Add OriginLocator, to list candidate origins to recover objects from

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/28/ for more details.

Build is green

Patch application report for D7433 (id=27102)

Rebasing onto 21ee7e2f67...

Current branch diff-target is up to date.
Changes applied before test
commit 78749949f762196b35ad8bc4de6e6eca7afd1717
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/33/ for more details.

I think I'd prefer not to have 3 versions of almost the same function (corrupt_object_(get|grab_xxx) ).

Also one thing that bothers me a bit id the fact the whole fixer thing is purely git; IMHO it should be either made more agnostic in its architecture, with only support for git for now, or explicitly advertized as git specific somehow (put 'git' in the name of the class, etc.)

The think is the Fixer class is not limited to git revisions by principle, but it is by the actual implementation.

I get this can be improved later on, but I would have loved an explicit explanation somewhere.
Also, the commit message could give some information on what's going on in this revision. "Add Fixer class, which re-loads corrupt objects from origins" seems quite an understatement (or simplification) of what this diff does.

swh/scrubber/cli.py
159–160

<nitpick> do we agree this is unrelated?</nitpick>

swh/scrubber/fixer.py
138

not sure I buy the "dataclass everywhere" approach. Why using a dataclass here?

swh/scrubber/utils.py
25

does origin_url set means looks for corrupted objects related to this origin only? (aka filter fetched corrupted objects on their relation a given origin url)?

would be nice to have a small sentence telling this (or the correct version of "this")

fix module docstring + explain better in the commit msg

I think I'd prefer not to have 3 versions of almost the same function (corrupt_object_(get|grab_xxx) ).

The alternative is convoluted SQL query building. I tried that, it's unreadable.

Also one thing that bothers me a bit id the fact the whole fixer thing is purely git; IMHO it should be either made more agnostic in its architecture, with only support for git for now, or explicitly advertized as git specific somehow (put 'git' in the name of the class, etc.)

The think is the Fixer class is not limited to git revisions by principle, but it is by the actual implementation.

Hmm, true. I don't know how I am going to handle switching between VCSs.

I get this can be improved later on, but I would have loved an explicit explanation somewhere.

Done.

Also, the commit message could give some information on what's going on in this revision. "Add Fixer class, which re-loads corrupt objects from origins" seems quite an understatement (or simplification) of what this diff does.

Done.

swh/scrubber/cli.py
159–160

yes

swh/scrubber/fixer.py
138

because it saves me 5 lines of __init__

Build is green

Patch application report for D7433 (id=27248)

Rebasing onto 21ee7e2f67...

Current branch diff-target is up to date.
Changes applied before test
commit 2653c567750404737fe13dfdc5753e95bccc8c47
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 25 14:39:45 2022 +0100

    Add Fixer class, which re-loads corrupt objects from origins
    
    This is heavily inspired by https://archive.softwareheritage.org/swh:1:cnt:c6dda7698c6aecf71f744e4e4c01bc3e115db880;origin=https://forge.softwareheritage.org/source/snippets.git;visit=swh:1:snp:c2e3170b9927f3d356e120546b00f5a9d25a224c;anchor=swh:1:rev:fa9e387ba5e3470b93a5c5a8773ed598ef03d211;path=/vlorentz/analyze_consistency_failures.py
    but reorganized to the database instead of ad-hoc text and pickle files.
    
    Currently, this only implements recovering from Git origins.

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/34/ for more details.

As I said, I'm not completely enthusiastic by this code (especially the lack of flexibility and versatility to provision for other origin sources than git) but it's a nice start at least.
Thanks

swh/scrubber/fixer.py
138

but it scrambles the intent of the code IMHO (no bid deal, I just find it confusing).

This revision is now accepted and ready to land.Apr 6 2022, 3:02 PM
This revision was landed with ongoing or failed builds.Apr 7 2022, 12:18 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7433 (id=27287)

Rebasing onto 88e953ab88...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-35-D7433.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DSCRUB/job/tests-on-diff/35/ for more details.