Details

Reviewers

vlorentz

Group Reviewers

Reviewers

Maniphest Tasks

T3693: Provide a mecanism to report (with persistence) objects that fails to get replayed (mirror)

Commits

rDSTO850a7553b6d5: Add support for a redis-based reporting for invalid mirrorred objects

Summary

The idea is that we check the BaseModel validity at journal
deserialization time so that we still have access to the raw object from
kafka for complete reporting (object id plus raw message from kafka).

This uses a new ModelObjectDeserializer class that is responsible for
deserializing the kafka message (still using kafka_to_value) then
immediately create the BaseModel object from that dict. Its convert
method is then passed as value_deserializer argument of the
JournalClient.

Then, for each deserialized object from kafka, if it's a HashableObject,
check its validity by comparing the computed hash with its id.

If it's invalid, report the error in logs, and if configured, register the
invalid object in via the reporter callback.

In the cli code, a Redis.set() is used a such a callback (if configured).
So it simply stores invalid objects using the object id a key (typically its
swhid), and the raw kafka message value as value.

Related to T3693.

Alse refactor fixer.fix_objects() to extract the inner object_fixers dict
allowing to use this dict independently of the fix_objects() function.

Depends on D6570.

Diff Detail

Repository

rDSTO Storage manager

Branch

reporter

Lint

Lint Skipped

Unit

Unit Tests Skipped

Build Status

Buildable 24913
Build 38926: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 38925: arc lint + arc unit

Event Timeline

douardda created this revision.Oct 27 2021, 6:24 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 27 2021, 6:24 PM

Build has FAILED

Patch application report for D6571 (id=23873)

Could not rebase; Attempt merge onto 49a932c989...

Updating 49a932c9..3b4fcb98
Fast-forward
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  94 +++---------------
 swh/storage/replay.py              | 194 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   |  12 +--
 5 files changed, 154 insertions(+), 177 deletions(-)

Changes applied before test

commit 3b4fcb980514425f279aeddb2fbf253c5ce119e2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit c25e595cfa85fa1249996a43aab6a59ccf2ee8c2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit b44a1eea08f081cef2d5ddb3663403b1f8cd4ed8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 74dbb2b0b148750de272f2f1a523c72b6b2e1365
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1465/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1465/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 27 2021, 6:25 PM

Harbormaster failed remote builds in B24755: Diff 23873!

Fixes and add tests

Note: this depends on D6565, so do not expect tests to pass right now, and a future dependency bump on swh.journal is needed before landing.

douardda retitled this revision from [WIP] Add support for a redis-based reporting for invalid mirrorred objects to Add support for a redis-based reporting for invalid mirrorred objects.Oct 28 2021, 11:25 AM

douardda edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6571 (id=23884)

Could not rebase; Attempt merge onto 49a932c989...

Updating 49a932c9..c233a90b
Fast-forward
 requirements-test.txt              |   2 +
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 6 files changed, 330 insertions(+), 180 deletions(-)

Changes applied before test

commit c233a90b0e3542d5256665ba88a7214b7b7de0e6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit fcd73734f0fdff74c9b4d066d8fc207b49cebedf
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 693acf2760117d1a5f04a1143721662c903d1276
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 04ba5910d53c8e9eb55a4bc2396840f9a47a5ac5
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1469/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1469/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 28 2021, 11:25 AM

Harbormaster failed remote builds in B24764: Diff 23884!

asking for review even if tests are expected to fail because it depends on D6565 (in swh-journal)

Bump the dependency on swh-journal to 0.9

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)

Changes applied before test

commit c4cd575dc1d56fafe37fe902cd708a099d1d9cbd
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:41:00 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1471/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1471/console

Harbormaster failed remote builds in B24812: Diff 23936!Oct 28 2021, 6:41 PM

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)

Changes applied before test

commit 11072ea92f85180fe68a87c4be1fb857a20475c1
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:45:42 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1472/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1472/console

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)

Changes applied before test

commit 8cd92c7e06dbdf81163864ef6c2082f80efde88d
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:49:52 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1473/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1473/console

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)

Changes applied before test

commit 6cf8d9a45f01f762b544ca82220a95ade49ff85d
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 07:48:17 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1474/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1474/console

"fix" mypy

Build has FAILED

Patch application report for D6571 (id=23940)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)

Changes applied before test

commit f2d2697735ee32701348d6b63f4bbdc69cdcaca2
Merge: a5bfe5b5 fb8245e5
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 08:26:34 2021 +0000

    Merge branch 'diff-target' into HEAD

commit fb8245e5b457152ee2e45b01f3d0e9483b6b4775
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1475/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1475/console

Harbormaster failed remote builds in B24816: Diff 23940!Oct 29 2021, 10:35 AM

rebase and a few fixes

add a config sanity check in cli.replay: anonymized replay cannot be validated
fix test_storage_replay_anonymized accordingly
drop usage of storage.fixer; we only need to drop Revision.metedata now

Build has FAILED

Patch application report for D6571 (id=23942)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)

Changes applied before test

commit 96bd83743fca6e1e49d9e38b9c5fc1870b1aeedc
Merge: a5bfe5b5 36665ece
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 10:31:04 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 36665ecec8eebff8dbc62dcb46706b93f16de07f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1476/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1476/console

Harbormaster failed remote builds in B24819: Diff 23942!Oct 29 2021, 12:40 PM

Build has FAILED

Patch application report for D6571 (id=23942)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)

Changes applied before test

commit df9888b2c2d846a447ad60540abb321764c480de
Merge: a5bfe5b5 36665ece
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:14:18 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 36665ecec8eebff8dbc62dcb46706b93f16de07f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1477/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1477/console

vlorentz added a subscriber: vlorentz.Oct 29 2021, 1:24 PM

vlorentz added inline comments.

swh/storage/replay.py
146–148	use `swh.storage.utils.remove_keys`
164–200	why not use `functools.partial` instead of returning a callable?
swh/storage/tests/test_replay.py
458	shorter, and consistent with the code below

fix test_cli

douardda added inline comments.Oct 29 2021, 1:36 PM

swh/storage/replay.py
146–148	thx
164–200	would work too yes, why not
swh/storage/tests/test_replay.py
458	agreed, thx

Build is green

Patch application report for D6571 (id=23943)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)

Changes applied before test

commit 56f332ca8f60451e5634eeda63425f2beb359af4
Merge: a5bfe5b5 ae50f362
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:28:35 2021 +0000

    Merge branch 'diff-target' into HEAD

commit ae50f362bf1bfb55a3cc44069744e2e6b81640e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1478/ for more details.

Harbormaster completed remote builds in B24820: Diff 23943.Oct 29 2021, 1:36 PM

apply vlorentz' comments/suggestions

douardda marked 3 inline comments as done.Oct 29 2021, 1:45 PM

Build is green

Patch application report for D6571 (id=23944)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 161 ++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 326 insertions(+), 372 deletions(-)

Changes applied before test

commit 381c2f34801ad9137ac859b1e13d98f6a27edd5f
Merge: a5bfe5b5 c89447ee
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:45:08 2021 +0000

    Merge branch 'diff-target' into HEAD

commit c89447ee01681736b80d3029ce695dacdc854916
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1479/ for more details.

Harbormaster completed remote builds in B24821: Diff 23944.Oct 29 2021, 1:53 PM

douardda mentioned this in D6554: [WIP] Add a (redis-based) validation error reporting facility.Oct 29 2021, 2:14 PM

vlorentz added inline comments.Nov 3 2021, 3:21 PM

swh/storage/replay.py
49–147	can you rename `object_converter_fn` and `object_fixers` to be in caps? (it's a little confusing below when using them, they look like variables)
157	could you add a docstring? in particular, explain the args of `reporter`
186	TIL this comment syntax
188–190	Do we have any `HashableObject` class without a `swhid` method?
swh/storage/tests/test_replay.py
416	share this regexp with `test_storage_replayer_with_validation_nok` (maybe the whole for loop too). Otherwise a single typo in the regexp could make the test miss issues

douardda marked 5 inline comments as done.Nov 8 2021, 3:29 PM

douardda added inline comments.

swh/storage/replay.py
188–190	ExtID

apply vlorentz' comments

add a docstring in ModelObjectDeserializer
make global vars uppercase
use a common regex var in test_replay

Build has FAILED

Patch application report for D6571 (id=24038)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)

Changes applied before test

commit bd38fdc207d6c90f43098824042a6474cc49aafc
Merge: a5bfe5b5 2fd1de4c
Author: Jenkins user <jenkins@localhost>
Date:   Mon Nov 8 14:29:33 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 2fd1de4cc96a9b42350e562a1fea118714d581a8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 17282b47287a8258f3dca66f54a865485548ccc4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 5e8d18c2ec676448d02391b1fe96f1d24f328027
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 627a8a82c6b5496a2d6aec416a359b8bb360c124
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1482/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1482/console

Harbormaster failed remote builds in B24912: Diff 24038!Nov 8 2021, 3:38 PM

fix docstring rst syntax

Build is green

Patch application report for D6571 (id=24039)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)

Changes applied before test

commit 9596e1e57f3e32cdbc40660d650e741fa711aad2
Merge: a5bfe5b5 8b731bcb
Author: Jenkins user <jenkins@localhost>
Date:   Mon Nov 8 14:44:21 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 8b731bcb89199899cb409e3b3e3dc0942dc1cdbb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 17282b47287a8258f3dca66f54a865485548ccc4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 5e8d18c2ec676448d02391b1fe96f1d24f328027
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 627a8a82c6b5496a2d6aec416a359b8bb360c124
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1483/ for more details.

Harbormaster completed remote builds in B24913: Diff 24039.Nov 8 2021, 3:52 PM

thx

This revision is now accepted and ready to land.Nov 9 2021, 11:41 AM

rebase

Build is green

Patch application report for D6571 (id=24087)

Could not rebase; Attempt merge onto 0262f1c191...

Updating 0262f1c1..850a7553
Fast-forward
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)

Changes applied before test

commit 850a7553b6d5b76847668bb2214cd9d538fcb667
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 04bd15a0bca87fc21139ffdccd322b5e17fa959e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit d655c8581d3c3a522ba083ee309157c4786e2de0
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 55eed77b519c4c3fc22b95554afa3343e013a253
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1488/ for more details.

Harbormaster completed remote builds in B24962: Diff 24087.Nov 9 2021, 4:47 PM

Closed by commit rDSTO850a7553b6d5: Add support for a redis-based reporting for invalid mirrorred objects (authored by douardda). · Explain WhyNov 9 2021, 5:28 PM

This revision was automatically updated to reflect the committed changes.

douardda added a commit: rDSTO850a7553b6d5: Add support for a redis-based reporting for invalid mirrorred objects.

Add support for a redis-based reporting for invalid mirrorred objectsClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D6571 (id=23873)

Changes applied before test

Patch application report for D6571 (id=23884)

Changes applied before test

Patch application report for D6571 (id=23936)

Changes applied before test

Patch application report for D6571 (id=23936)

Changes applied before test

Patch application report for D6571 (id=23936)

Changes applied before test

Patch application report for D6571 (id=23936)

Changes applied before test

Patch application report for D6571 (id=23940)

Changes applied before test

Patch application report for D6571 (id=23942)

Changes applied before test

Patch application report for D6571 (id=23942)

Changes applied before test

Patch application report for D6571 (id=23943)

Changes applied before test

Patch application report for D6571 (id=23944)

Changes applied before test

Patch application report for D6571 (id=24038)

Changes applied before test

Patch application report for D6571 (id=24039)

Changes applied before test

Patch application report for D6571 (id=24087)

Changes applied before test

Revision ContentsChangeset List

Diff 24039

requirements-swh-journal.txt

requirements-test.txt

requirements.txt

swh/storage/backfill.py

swh/storage/cli.py

swh/storage/replay.py

swh/storage/tests/test_backfill.py

swh/storage/tests/test_cli.py

swh/storage/tests/test_replay.py

Add support for a redis-based reporting for invalid mirrorred objects
ClosedPublic
Actions

Revision Contents
Changeset List