Page MenuHomeSoftware Heritage

Add support for a redis-based reporting for invalid mirrorred objects
ClosedPublic

Authored by douardda on Oct 27 2021, 6:24 PM.

Details

Summary

The idea is that we check the BaseModel validity at journal
deserialization time so that we still have access to the raw object from
kafka for complete reporting (object id plus raw message from kafka).

This uses a new ModelObjectDeserializer class that is responsible for
deserializing the kafka message (still using kafka_to_value) then
immediately create the BaseModel object from that dict. Its convert
method is then passed as value_deserializer argument of the
JournalClient.

Then, for each deserialized object from kafka, if it's a HashableObject,
check its validity by comparing the computed hash with its id.

If it's invalid, report the error in logs, and if configured, register the
invalid object in via the reporter callback.

In the cli code, a Redis.set() is used a such a callback (if configured).
So it simply stores invalid objects using the object id a key (typically its
swhid), and the raw kafka message value as value.

Related to T3693.

Alse refactor fixer.fix_objects() to extract the inner object_fixers dict
allowing to use this dict independently of the fix_objects() function.

Depends on D6570.

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D6571 (id=23873)

Could not rebase; Attempt merge onto 49a932c989...

Updating 49a932c9..3b4fcb98
Fast-forward
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  94 +++---------------
 swh/storage/replay.py              | 194 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   |  12 +--
 5 files changed, 154 insertions(+), 177 deletions(-)
Changes applied before test
commit 3b4fcb980514425f279aeddb2fbf253c5ce119e2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit c25e595cfa85fa1249996a43aab6a59ccf2ee8c2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit b44a1eea08f081cef2d5ddb3663403b1f8cd4ed8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 74dbb2b0b148750de272f2f1a523c72b6b2e1365
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1465/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1465/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 27 2021, 6:25 PM
Harbormaster failed remote builds in B24755: Diff 23873!

Fixes and add tests

Note: this depends on D6565, so do not expect tests to pass right now, and a future dependency bump on swh.journal is needed before landing.

douardda retitled this revision from [WIP] Add support for a redis-based reporting for invalid mirrorred objects to Add support for a redis-based reporting for invalid mirrorred objects.Oct 28 2021, 11:25 AM
douardda edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6571 (id=23884)

Could not rebase; Attempt merge onto 49a932c989...

Updating 49a932c9..c233a90b
Fast-forward
 requirements-test.txt              |   2 +
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 6 files changed, 330 insertions(+), 180 deletions(-)
Changes applied before test
commit c233a90b0e3542d5256665ba88a7214b7b7de0e6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit fcd73734f0fdff74c9b4d066d8fc207b49cebedf
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 693acf2760117d1a5f04a1143721662c903d1276
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 04ba5910d53c8e9eb55a4bc2396840f9a47a5ac5
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1469/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1469/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 28 2021, 11:25 AM
Harbormaster failed remote builds in B24764: Diff 23884!

asking for review even if tests are expected to fail because it depends on D6565 (in swh-journal)

Bump the dependency on swh-journal to 0.9

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)
Changes applied before test
commit c4cd575dc1d56fafe37fe902cd708a099d1d9cbd
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:41:00 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1471/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1471/console

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)
Changes applied before test
commit 11072ea92f85180fe68a87c4be1fb857a20475c1
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:45:42 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1472/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1472/console

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)
Changes applied before test
commit 8cd92c7e06dbdf81163864ef6c2082f80efde88d
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Oct 28 16:49:52 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1473/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1473/console

Build has FAILED

Patch application report for D6571 (id=23936)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)
Changes applied before test
commit 6cf8d9a45f01f762b544ca82220a95ade49ff85d
Merge: a5bfe5b5 06b1629d
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 07:48:17 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 06b1629d1f75820f6451f1ec82c23b455edbab3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1474/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1474/console

Build has FAILED

Patch application report for D6571 (id=23940)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 ++-
 swh/storage/cli.py                 |  24 ++++-
 swh/storage/fixer.py               |  95 ++++--------------
 swh/storage/replay.py              | 195 ++++++++++++++++++++-----------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 187 +++++++++++++++++++++++++++++++++--
 8 files changed, 337 insertions(+), 186 deletions(-)
Changes applied before test
commit f2d2697735ee32701348d6b63f4bbdc69cdcaca2
Merge: a5bfe5b5 fb8245e5
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 08:26:34 2021 +0000

    Merge branch 'diff-target' into HEAD

commit fb8245e5b457152ee2e45b01f3d0e9483b6b4775
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit bc4556bd1120503c86d9e8b3e5315e02c407a672
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 8283f462321234941a3e16b4a6c308dcaf247694
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless revision date checker in fixer

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1475/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1475/console

rebase and a few fixes

  • add a config sanity check in cli.replay: anonymized replay cannot be validated
  • fix test_storage_replay_anonymized accordingly
  • drop usage of storage.fixer; we only need to drop Revision.metedata now

Build has FAILED

Patch application report for D6571 (id=23942)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)
Changes applied before test
commit 96bd83743fca6e1e49d9e38b9c5fc1870b1aeedc
Merge: a5bfe5b5 36665ece
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 10:31:04 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 36665ecec8eebff8dbc62dcb46706b93f16de07f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1476/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1476/console

Build has FAILED

Patch application report for D6571 (id=23942)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)
Changes applied before test
commit df9888b2c2d846a447ad60540abb321764c480de
Merge: a5bfe5b5 36665ece
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:14:18 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 36665ecec8eebff8dbc62dcb46706b93f16de07f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1477/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1477/console

vlorentz added inline comments.
swh/storage/replay.py
146–148

use swh.storage.utils.remove_keys

164–200

why not use functools.partial instead of returning a callable?

swh/storage/tests/test_replay.py
458

shorter, and consistent with the code below

swh/storage/replay.py
146–148

thx

164–200

would work too yes, why not

swh/storage/tests/test_replay.py
458

agreed, thx

Build is green

Patch application report for D6571 (id=23943)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 206 ++++++++++++++------------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 352 insertions(+), 391 deletions(-)
Changes applied before test
commit 56f332ca8f60451e5634eeda63425f2beb359af4
Merge: a5bfe5b5 ae50f362
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:28:35 2021 +0000

    Merge branch 'diff-target' into HEAD

commit ae50f362bf1bfb55a3cc44069744e2e6b81640e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1478/ for more details.

apply vlorentz' comments/suggestions

Build is green

Patch application report for D6571 (id=23944)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/cli.py                 |  32 +++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 161 ++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 8 files changed, 326 insertions(+), 372 deletions(-)
Changes applied before test
commit 381c2f34801ad9137ac859b1e13d98f6a27edd5f
Merge: a5bfe5b5 c89447ee
Author: Jenkins user <jenkins@localhost>
Date:   Fri Oct 29 11:45:08 2021 +0000

    Merge branch 'diff-target' into HEAD

commit c89447ee01681736b80d3029ce695dacdc854916
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 25002613a9755d39859bdd00fb573b2f61fbb831
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit f8aecc8691525eb9a677e663eb350975987bdf1a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 920c71c8047946ca55f4ff9d1bdf32b83b279d24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1479/ for more details.

swh/storage/replay.py
49–147

can you rename object_converter_fn and object_fixers to be in caps? (it's a little confusing below when using them, they look like variables)

157

could you add a docstring? in particular, explain the args of reporter

186

TIL this comment syntax

188–190

Do we have any HashableObject class without a swhid method?

swh/storage/tests/test_replay.py
416

share this regexp with test_storage_replayer_with_validation_nok (maybe the whole for loop too). Otherwise a single typo in the regexp could make the test miss issues

douardda added inline comments.
swh/storage/replay.py
188–190

ExtID

douardda marked an inline comment as done.

apply vlorentz' comments

  • add a docstring in ModelObjectDeserializer
  • make global vars uppercase
  • use a common regex var in test_replay

Build has FAILED

Patch application report for D6571 (id=24038)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)
Changes applied before test
commit bd38fdc207d6c90f43098824042a6474cc49aafc
Merge: a5bfe5b5 2fd1de4c
Author: Jenkins user <jenkins@localhost>
Date:   Mon Nov 8 14:29:33 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 2fd1de4cc96a9b42350e562a1fea118714d581a8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 17282b47287a8258f3dca66f54a865485548ccc4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 5e8d18c2ec676448d02391b1fe96f1d24f328027
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 627a8a82c6b5496a2d6aec416a359b8bb360c124
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1482/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1482/console

Build is green

Patch application report for D6571 (id=24039)

Could not rebase; Attempt merge onto a5bfe5b514...

Merge made by the 'recursive' strategy.
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)
Changes applied before test
commit 9596e1e57f3e32cdbc40660d650e741fa711aad2
Merge: a5bfe5b5 8b731bcb
Author: Jenkins user <jenkins@localhost>
Date:   Mon Nov 8 14:44:21 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 8b731bcb89199899cb409e3b3e3dc0942dc1cdbb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 17282b47287a8258f3dca66f54a865485548ccc4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit 5e8d18c2ec676448d02391b1fe96f1d24f328027
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 627a8a82c6b5496a2d6aec416a359b8bb360c124
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1483/ for more details.

This revision is now accepted and ready to land.Nov 9 2021, 11:41 AM

Build is green

Patch application report for D6571 (id=24087)

Could not rebase; Attempt merge onto 0262f1c191...

Updating 0262f1c1..850a7553
Fast-forward
 requirements-swh-journal.txt       |   2 +-
 requirements-test.txt              |   2 +
 requirements.txt                   |  11 +-
 swh/storage/backfill.py            |   4 +-
 swh/storage/cli.py                 |  51 ++++++-
 swh/storage/fixer.py               | 290 ++-----------------------------------
 swh/storage/replay.py              | 195 +++++++++++++++----------
 swh/storage/tests/test_backfill.py |   7 +-
 swh/storage/tests/test_cli.py      |  14 ++
 swh/storage/tests/test_replay.py   | 193 ++++++++++++++++++++++--
 10 files changed, 394 insertions(+), 375 deletions(-)
Changes applied before test
commit 850a7553b6d5b76847668bb2214cd9d538fcb667
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 17:31:18 2021 +0200

    Add support for a redis-based reporting for invalid mirrorred objects
    
    The idea is that we check the BaseModel validity at journal
    deserialization time so that we still have access to the raw object from
    kafka for complete reporting (object id plus raw message from kafka).
    
    This uses a new ModelObjectDeserializer class that is responsible for
    deserializing the kafka message (still using kafka_to_value) then
    immediately create the BaseModel object from that dict. Its `convert`
    method is then passed as `value_deserializer` argument of the
    `JournalClient`.
    
    Then, for each deserialized object from kafka, if it's a HashableObject,
    check its validity by comparing the computed hash with its id.
    
    If it's invalid, report the error in logs, and if configured, register the
    invalid object in via the `reporter` callback.
    
    In the cli code, a `Redis.set()` is used a such a callback (if configured).
    So it simply stores invalid objects using the object id a key (typically its
    swhid), and the raw kafka message value as value.
    
    Related to T3693.

commit 04bd15a0bca87fc21139ffdccd322b5e17fa959e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:56:53 2021 +0200

    Refactor fixer.fix_objects() to extract the inner object_fixers dict
    
    allowing to use this dict independently of the fix_objects() function.

commit d655c8581d3c3a522ba083ee309157c4786e2de0
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 27 16:55:03 2021 +0200

    Remove now useless fixers
    
    keep the the fix_objects() function for bw compat for now.

commit 55eed77b519c4c3fc22b95554afa3343e013a253
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:38:26 2021 +0200

    Add a --type option to 'swh storage replay'
    
    allows to choose replayed object types from the cli.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1488/ for more details.