Page MenuHomeSoftware Heritage

Add a journal replayer for the revision layer
ClosedPublic

Authored by douardda on Oct 11 2022, 12:26 PM.

Details

Summary

This simple replayer implementation is probably not good enought for a
real life scenario; especially since it will probably not be robust
enought against out-of-order kafka messages handling.

Should be ok now (maybe).

As usual when kafka is involved in tests, provided new tests are a bit
slow...

Depends on D8657
Related to T4616

Diff Detail

Event Timeline

Build was aborted

Patch application report for D8658 (id=31246)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..94ead8d
Fast-forward
 swh/provenance/provenance.py                       |  11 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/storage/replay.py                   | 120 +++++++++++++
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  41 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 swh/provenance/tests/test_replay.py                | 169 ++++++++++++++++++
 .../tests/test_revision_content_layer.py           |   6 +-
 13 files changed, 727 insertions(+), 37 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/storage/replay.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit 94ead8d7ed6523dc8411742efd0a8576788358bf
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit c8ddd305cb942eb5bd491f305b3fedaf8d094666
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).

commit d182fcdbc532f1769bec129b2e37c36a59d021de
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/675/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/675/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 11 2022, 12:43 PM
Harbormaster failed remote builds in B32198: Diff 31246!

Build is green

Patch application report for D8658 (id=31251)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..db5342e
Fast-forward
 swh/provenance/provenance.py                       |  11 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/storage/replay.py                   | 120 +++++++++++++
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 swh/provenance/tests/test_replay.py                | 170 ++++++++++++++++++
 .../tests/test_revision_content_layer.py           |   6 +-
 tox.ini                                            |   4 +-
 15 files changed, 737 insertions(+), 39 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/storage/replay.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit db5342ed4b9fddc9e5372ff196f163f480b7b4ae
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit b0535b51c49af1f7a8ac82b05785c086369f96b1
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 6b539ecbe1673caf539e0636de76881a3f8ed171
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/678/ for more details.

vlorentz added inline comments.
swh/provenance/storage/replay.py
113–120

method isn't needed.

Also, what's this dict(next()) thing? Doesn't it discard all but one object type at random?

Build was aborted

Patch application report for D8658 (id=31275)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..f140cec
Fast-forward
 swh/provenance/algos/directory.py                  |  10 +-
 swh/provenance/interface.py                        |   8 +-
 swh/provenance/provenance.py                       |   9 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  16 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  24 ++-
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/storage/replay.py                   | 120 +++++++++++++
 swh/provenance/tests/test_directory_flatten.py     |   4 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 swh/provenance/tests/test_replay.py                | 170 ++++++++++++++++++
 .../tests/test_revision_content_layer.py           |   8 +-
 tox.ini                                            |   4 +-
 18 files changed, 753 insertions(+), 55 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/storage/replay.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit f140cec057052c6ee26a0f6bab33d60df7b50c1c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit 08f2e604b0743845acb17b6cf7ea4b0fc749e1e3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 7e6a62c990b76ac63ee53be1f4c1c147bba4b806
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 16:30:46 2022 +0200

    Rename ProvenanceInterface.directory_xxx_flattenned as directory_xxx_flattened
    
    and fix all occurrences of the typo.

commit 2bd74fc7d97d40d7132a6530cd0078e3ffb8c614
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/686/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/686/console

swh/provenance/storage/replay.py
113–120

geez you're right, it's much better that way...

the dict(next()) is because at this point we get a list of 1-element dict like [{id1: val1}, {id2: val2}. ...] which is converted as a simple dict {id1: val1, id2: val2, ...}

All this is not very pretty, I know...

simplify a bit the code (vlorentz' suggestion)

Build is green

Patch application report for D8658 (id=31279)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..88c94ea
Fast-forward
 swh/provenance/algos/directory.py                  |  10 +-
 swh/provenance/interface.py                        |   8 +-
 swh/provenance/provenance.py                       |   9 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  16 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  24 ++-
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/storage/replay.py                   | 117 +++++++++++++
 swh/provenance/tests/test_directory_flatten.py     |   4 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 swh/provenance/tests/test_replay.py                | 170 ++++++++++++++++++
 .../tests/test_revision_content_layer.py           |   8 +-
 tox.ini                                            |   4 +-
 18 files changed, 750 insertions(+), 55 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/storage/replay.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit 88c94ea44f33bd43952f1cd308f873b08ad5b30c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit 08f2e604b0743845acb17b6cf7ea4b0fc749e1e3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 7e6a62c990b76ac63ee53be1f4c1c147bba4b806
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 16:30:46 2022 +0200

    Rename ProvenanceInterface.directory_xxx_flattenned as directory_xxx_flattened
    
    and fix all occurrences of the typo.

commit 2bd74fc7d97d40d7132a6530cd0078e3ffb8c614
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/687/ for more details.

Attempt to make replaying work in the general situation

namely adding a preps revision in this diff that makes relation_add sql
function prefill entity tables if needed.

Depends on D8668

Build was aborted

Patch application report for D8658 (id=31307)

Could not rebase; Attempt merge onto b3fa1f5924...

Updating b3fa1f5..0c70e5c
Fast-forward
 swh/provenance/provenance.py                       |  16 -
 swh/provenance/sql/15-flavor.sql                   |   8 +-
 swh/provenance/sql/30-schema.sql                   |   5 +-
 swh/provenance/sql/40-funcs.sql                    | 364 +++------------------
 swh/provenance/sql/upgrades/004.sql                |  26 ++
 swh/provenance/storage/interface.py                |   4 -
 swh/provenance/storage/journal.py                  |   3 -
 swh/provenance/storage/postgresql.py               |  29 +-
 swh/provenance/storage/replay.py                   | 103 ++++++
 swh/provenance/tests/conftest.py                   |   4 +-
 swh/provenance/tests/test_cli.py                   |  17 +-
 swh/provenance/tests/test_provenance_db.py         |   6 +-
 swh/provenance/tests/test_provenance_storage.py    | 112 ++-----
 ....py => test_provenance_storage_denormalized.py} |   2 +-
 .../tests/test_provenance_storage_rabbitmq.py      |   7 +-
 ...st_provenance_storage_with_path_denormalized.py |  24 --
 ...provenance_storage_without_path_denormalized.py |  24 --
 swh/provenance/tests/test_replay.py                | 170 ++++++++++
 .../tests/test_revision_content_layer.py           |  44 +--
 19 files changed, 427 insertions(+), 541 deletions(-)
 create mode 100644 swh/provenance/sql/upgrades/004.sql
 create mode 100644 swh/provenance/storage/replay.py
 rename swh/provenance/tests/{test_provenance_storage_without_path.py => test_provenance_storage_denormalized.py} (95%)
 delete mode 100644 swh/provenance/tests/test_provenance_storage_with_path_denormalized.py
 delete mode 100644 swh/provenance/tests/test_provenance_storage_without_path_denormalized.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit 0c70e5ce406b8ba9879776e29a5d939b620a9def
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit 571fd56ae51311f60acb4d938785919f63e3e3d7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:18:52 2022 +0200

    Make relation_add sql function prefill entity tables if needed
    
    instead of depending on the proper behavior of the user of the
    ProvenanceStoragePostgresql user.

commit 9f33b2df31c9449e52e00984bbc614a0152bdfd1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:16:54 2022 +0200

    Remove the without-path db flavor
    
    it's not used, and keeping it makes code unnecessarly complex.

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/689/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/689/console

Build was aborted

Patch application report for D8658 (id=31310)

Could not rebase; Attempt merge onto b3fa1f5924...

Updating b3fa1f5..e654062
Fast-forward
 swh/provenance/provenance.py                       |  16 -
 swh/provenance/sql/15-flavor.sql                   |   8 +-
 swh/provenance/sql/30-schema.sql                   |   5 +-
 swh/provenance/sql/40-funcs.sql                    | 364 +++------------------
 swh/provenance/sql/upgrades/004.sql                |  26 ++
 swh/provenance/storage/interface.py                |   4 -
 swh/provenance/storage/journal.py                  |   3 -
 swh/provenance/storage/postgresql.py               |  29 +-
 swh/provenance/storage/replay.py                   | 103 ++++++
 swh/provenance/tests/conftest.py                   |   4 +-
 swh/provenance/tests/test_cli.py                   |  17 +-
 swh/provenance/tests/test_provenance_db.py         |   6 +-
 swh/provenance/tests/test_provenance_storage.py    | 112 ++-----
 ....py => test_provenance_storage_denormalized.py} |   2 +-
 .../tests/test_provenance_storage_rabbitmq.py      |   7 +-
 ...st_provenance_storage_with_path_denormalized.py |  24 --
 ...provenance_storage_without_path_denormalized.py |  24 --
 swh/provenance/tests/test_replay.py                | 170 ++++++++++
 .../tests/test_revision_content_layer.py           |  44 +--
 19 files changed, 427 insertions(+), 541 deletions(-)
 create mode 100644 swh/provenance/sql/upgrades/004.sql
 create mode 100644 swh/provenance/storage/replay.py
 rename swh/provenance/tests/{test_provenance_storage_without_path.py => test_provenance_storage_denormalized.py} (95%)
 delete mode 100644 swh/provenance/tests/test_provenance_storage_with_path_denormalized.py
 delete mode 100644 swh/provenance/tests/test_provenance_storage_without_path_denormalized.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit e654062bcf7cf05ae23e9c4514e3e5ecb055c766
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit 8246c5ebe7b5ee10dbe714b3c087d2a9534a8d18
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:18:52 2022 +0200

    Make relation_add sql function prefill entity tables if needed
    
    instead of depending on the proper behavior of the user of the
    ProvenanceStoragePostgresql user.

commit 9f33b2df31c9449e52e00984bbc614a0152bdfd1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:16:54 2022 +0200

    Remove the without-path db flavor
    
    it's not used, and keeping it makes code unnecessarly complex.

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/690/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/690/console

Build is green

Patch application report for D8658 (id=31310)

Could not rebase; Attempt merge onto b3fa1f5924...

Updating b3fa1f5..e654062
Fast-forward
 swh/provenance/provenance.py                       |  16 -
 swh/provenance/sql/15-flavor.sql                   |   8 +-
 swh/provenance/sql/30-schema.sql                   |   5 +-
 swh/provenance/sql/40-funcs.sql                    | 364 +++------------------
 swh/provenance/sql/upgrades/004.sql                |  26 ++
 swh/provenance/storage/interface.py                |   4 -
 swh/provenance/storage/journal.py                  |   3 -
 swh/provenance/storage/postgresql.py               |  29 +-
 swh/provenance/storage/replay.py                   | 103 ++++++
 swh/provenance/tests/conftest.py                   |   4 +-
 swh/provenance/tests/test_cli.py                   |  17 +-
 swh/provenance/tests/test_provenance_db.py         |   6 +-
 swh/provenance/tests/test_provenance_storage.py    | 112 ++-----
 ....py => test_provenance_storage_denormalized.py} |   2 +-
 .../tests/test_provenance_storage_rabbitmq.py      |   7 +-
 ...st_provenance_storage_with_path_denormalized.py |  24 --
 ...provenance_storage_without_path_denormalized.py |  24 --
 swh/provenance/tests/test_replay.py                | 170 ++++++++++
 .../tests/test_revision_content_layer.py           |  44 +--
 19 files changed, 427 insertions(+), 541 deletions(-)
 create mode 100644 swh/provenance/sql/upgrades/004.sql
 create mode 100644 swh/provenance/storage/replay.py
 rename swh/provenance/tests/{test_provenance_storage_without_path.py => test_provenance_storage_denormalized.py} (95%)
 delete mode 100644 swh/provenance/tests/test_provenance_storage_with_path_denormalized.py
 delete mode 100644 swh/provenance/tests/test_provenance_storage_without_path_denormalized.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit e654062bcf7cf05ae23e9c4514e3e5ecb055c766
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    This simple replayer implementation is probably not good enought for a
    real life scenario; especially since it will probably not be robust
    enought against out-of-order kafka messages handling.
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit 8246c5ebe7b5ee10dbe714b3c087d2a9534a8d18
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:18:52 2022 +0200

    Make relation_add sql function prefill entity tables if needed
    
    instead of depending on the proper behavior of the user of the
    ProvenanceStoragePostgresql user.

commit 9f33b2df31c9449e52e00984bbc614a0152bdfd1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:16:54 2022 +0200

    Remove the without-path db flavor
    
    it's not used, and keeping it makes code unnecessarly complex.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/691/ for more details.

swh/provenance/storage/replay.py
113–120

where do this singleton dicts come from? swh-journal?

swh/provenance/storage/replay.py
113–120

I'd say it's related with the way I put objects in kafka where I do not "use" the message key so this later is repeated in the message payload (dict), which probably is not necessary...

(not sure to remember why I did this that way; I suspect the reason I chose to do that vanished at some point in a later refactoring but I did not notice).

This seems to work:

diff --git a/swh/provenance/storage/replay.py b/swh/provenance/storage/replay.py
index d42d134..63e44cb 100644
--- a/swh/provenance/storage/replay.py
+++ b/swh/provenance/storage/replay.py
@@ -4,7 +4,7 @@
 # See top-level LICENSE file for more information
 
 import logging
-from typing import Any, Callable, Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional, Tuple
 
 try:
     from systemd.daemon import notify
@@ -29,19 +29,19 @@
 
 
 def cvrt_directory(msg_d):
-    return {msg_d["id"]: DirectoryData(**msg_d["value"])}
+    return (msg_d["id"], DirectoryData(**msg_d["value"]))
 
 
 def cvrt_revision(msg_d):
-    return {msg_d["id"]: RevisionData(**msg_d["value"])}
+    return (msg_d["id"], RevisionData(**msg_d["value"]))
 
 
 def cvrt_default(msg_d):
-    return {msg_d["id"]: msg_d["value"]}
+    return (msg_d["id"], msg_d["value"])
 
 
 def cvrt_relation(msg_d):
-    return {msg_d["id"]: {RelationData(**v) for v in msg_d["value"]}}
+    return (msg_d["id"], {RelationData(**v) for v in msg_d["value"]})
 
 
 OBJECT_CONVERTERS: Dict[str, Callable[[Dict], Dict]] = {
@@ -75,7 +75,9 @@ def report_failure(self, msg: bytes, obj: Dict):
 
 
 def process_replay_objects(
-    all_objects: Dict[str, List[Dict]], *, storage: ProvenanceStorageInterface
+    all_objects: Dict[str, List[Tuple[bytes, Any]]],
+    *,
+    storage: ProvenanceStorageInterface,
 ) -> None:
     for object_type, objects in all_objects.items():
         logger.debug("Inserting %s %s objects", len(objects), object_type)
@@ -89,14 +91,16 @@ def process_replay_objects(
 
 
 def _insert_objects(
-    object_type: str, objects: List[Any], storage: ProvenanceStorageInterface
+    object_type: str,
+    objects: List[Tuple[bytes, Any]],
+    storage: ProvenanceStorageInterface,
 ) -> None:
     """Insert objects of type object_type in the storage."""
     if object_type not in OBJECT_CONVERTERS:
         logger.warning("Received a series of %s, this should not happen", object_type)
         return
 
-    data = dict(next(iter(obj.items())) for obj in objects)
+    data = dict(objects)
     if "_in_" in object_type:
         storage.relation_add(relation=RelationType(object_type), data=data)
     else:
swh/provenance/storage/replay.py
113–120

ah no, the way swh-journal currently works make it hard to access the kafka message key, so I had to embed it in the msg.value.

This seems to work:

diff --git a/swh/provenance/storage/replay.py b/swh/provenance/storage/replay.py
index d42d134..63e44cb 100644
--- a/swh/provenance/storage/replay.py
+++ b/swh/provenance/storage/replay.py
@@ -4,7 +4,7 @@
 # See top-level LICENSE file for more information
 
 import logging
-from typing import Any, Callable, Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional, Tuple
 
 try:
     from systemd.daemon import notify
@@ -29,19 +29,19 @@
 
 
 def cvrt_directory(msg_d):
-    return {msg_d["id"]: DirectoryData(**msg_d["value"])}
+    return (msg_d["id"], DirectoryData(**msg_d["value"]))
 
 
 def cvrt_revision(msg_d):
-    return {msg_d["id"]: RevisionData(**msg_d["value"])}
+    return (msg_d["id"], RevisionData(**msg_d["value"]))
 
 
 def cvrt_default(msg_d):
-    return {msg_d["id"]: msg_d["value"]}
+    return (msg_d["id"], msg_d["value"])
 
 
 def cvrt_relation(msg_d):
-    return {msg_d["id"]: {RelationData(**v) for v in msg_d["value"]}}
+    return (msg_d["id"], {RelationData(**v) for v in msg_d["value"]})
 
 
 OBJECT_CONVERTERS: Dict[str, Callable[[Dict], Dict]] = {
@@ -75,7 +75,9 @@ def report_failure(self, msg: bytes, obj: Dict):
 
 
 def process_replay_objects(
-    all_objects: Dict[str, List[Dict]], *, storage: ProvenanceStorageInterface
+    all_objects: Dict[str, List[Tuple[bytes, Any]]],
+    *,
+    storage: ProvenanceStorageInterface,
 ) -> None:
     for object_type, objects in all_objects.items():
         logger.debug("Inserting %s %s objects", len(objects), object_type)
@@ -89,14 +91,16 @@ def process_replay_objects(
 
 
 def _insert_objects(
-    object_type: str, objects: List[Any], storage: ProvenanceStorageInterface
+    object_type: str,
+    objects: List[Tuple[bytes, Any]],
+    storage: ProvenanceStorageInterface,
 ) -> None:
     """Insert objects of type object_type in the storage."""
     if object_type not in OBJECT_CONVERTERS:
         logger.warning("Received a series of %s, this should not happen", object_type)
         return
 
-    data = dict(next(iter(obj.items())) for obj in objects)
+    data = dict(objects)
     if "_in_" in object_type:
         storage.relation_add(relation=RelationType(object_type), data=data)
     else:

obviously yes! doh! (thx)

This revision is now accepted and ready to land.Oct 13 2022, 11:19 AM

Build is green

Patch application report for D8658 (id=31319)

Could not rebase; Attempt merge onto b3fa1f5924...

Updating b3fa1f5..0850a39
Fast-forward
 swh/provenance/provenance.py                       |  16 -
 swh/provenance/sql/15-flavor.sql                   |   8 +-
 swh/provenance/sql/30-schema.sql                   |   5 +-
 swh/provenance/sql/40-funcs.sql                    | 365 +++------------------
 swh/provenance/sql/upgrades/004.sql                |  29 ++
 swh/provenance/storage/interface.py                |   4 -
 swh/provenance/storage/journal.py                  |   3 -
 swh/provenance/storage/postgresql.py               |  29 +-
 swh/provenance/storage/replay.py                   | 107 ++++++
 swh/provenance/tests/conftest.py                   |   4 +-
 swh/provenance/tests/test_cli.py                   |  17 +-
 swh/provenance/tests/test_provenance_db.py         |   6 +-
 swh/provenance/tests/test_provenance_storage.py    | 112 ++-----
 ....py => test_provenance_storage_denormalized.py} |   2 +-
 .../tests/test_provenance_storage_rabbitmq.py      |   7 +-
 ...st_provenance_storage_with_path_denormalized.py |  24 --
 ...provenance_storage_without_path_denormalized.py |  24 --
 swh/provenance/tests/test_replay.py                | 170 ++++++++++
 .../tests/test_revision_content_layer.py           |  44 +--
 19 files changed, 434 insertions(+), 542 deletions(-)
 create mode 100644 swh/provenance/sql/upgrades/004.sql
 create mode 100644 swh/provenance/storage/replay.py
 rename swh/provenance/tests/{test_provenance_storage_without_path.py => test_provenance_storage_denormalized.py} (95%)
 delete mode 100644 swh/provenance/tests/test_provenance_storage_with_path_denormalized.py
 delete mode 100644 swh/provenance/tests/test_provenance_storage_without_path_denormalized.py
 create mode 100644 swh/provenance/tests/test_replay.py
Changes applied before test
commit 0850a3943df235a94724c2bc0ba7c3fd702d67b6
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 12:17:39 2022 +0200

    Add a journal replayer for the revision layer
    
    As usual when kafka is involved in tests, provided new tests are a bit
    slow...

commit e1da37d4375f859c139f247f14b971228cf43a7b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:18:52 2022 +0200

    Make relation_add sql function prefill entity tables if needed
    
    instead of depending on the proper behavior of the user of the
    ProvenanceStoragePostgresql user.

commit a0b2f0e9da09b8d2ddbaa80dfd57473185327dd7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Oct 12 15:16:54 2022 +0200

    Remove the without-path db flavor
    
    it's not used, and keeping it makes code unnecessarly complex.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/694/ for more details.