Page MenuHomeSoftware Heritage

Add support for kafka journalization of the ProvenanceStorageInterface
ClosedPublic

Authored by douardda on Oct 11 2022, 12:25 PM.

Details

Summary

the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
that will push added objects in a swh-journal (typ. a kafka).

Journal messages are simple dicts with 2 keys: id (the sharding key) and
value (a serialiazable version of the argument of the xxx_add() method).

Depends on D8656
Related to T4616

Diff Detail

Event Timeline

Build was aborted

Patch application report for D8657 (id=31245)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..c8ddd30
Fast-forward
 swh/provenance/provenance.py                       |  11 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  41 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   6 +-
 11 files changed, 438 insertions(+), 37 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit c8ddd305cb942eb5bd491f305b3fedaf8d094666
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).

commit d182fcdbc532f1769bec129b2e37c36a59d021de
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/674/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/674/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 11 2022, 12:42 PM
Harbormaster failed remote builds in B32197: Diff 31245!

rebase, add (and use in tox.ini) a 'kafka' pytest marker

Build was aborted

Patch application report for D8657 (id=31250)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..b0535b5
Fast-forward
 swh/provenance/provenance.py                       |  11 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   6 +-
 tox.ini                                            |   4 +-
 13 files changed, 447 insertions(+), 39 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit b0535b51c49af1f7a8ac82b05785c086369f96b1
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 6b539ecbe1673caf539e0636de76881a3f8ed171
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/677/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/677/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 11 2022, 2:19 PM
Harbormaster failed remote builds in B32202: Diff 31250!

Build is green

Patch application report for D8657 (id=31250)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..b0535b5
Fast-forward
 swh/provenance/provenance.py                       |  11 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   6 +-
 tox.ini                                            |   4 +-
 13 files changed, 447 insertions(+), 39 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit b0535b51c49af1f7a8ac82b05785c086369f96b1
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 6b539ecbe1673caf539e0636de76881a3f8ed171
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/680/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/provenance/storage/journal.py
93–96

s/flattenned/flattened/ btw

swh/provenance/tests/test_provenance_journal_writer.py
46–50

(same below)

This revision is now accepted and ready to land.Oct 11 2022, 4:00 PM
douardda added inline comments.
swh/provenance/storage/journal.py
93–96

I know but that's the way the interface is defined... but nothing really prevents me from fixing it I guess...

swh/provenance/tests/test_provenance_journal_writer.py
46–50

ah yes I remember seeing these and thinking about "fixing" them. seems I forgot then.

Build was aborted

Patch application report for D8657 (id=31272)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..2b0d957
Fast-forward
 swh/provenance/provenance.py                       |   5 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  10 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  20 +--
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   6 +-
 tox.ini                                            |   4 +-
 13 files changed, 444 insertions(+), 36 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit 2b0d9572035ab5420955203537872799289733ff
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 2bd74fc7d97d40d7132a6530cd0078e3ffb8c614
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/682/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/682/console

douardda marked an inline comment as done.

rebase

Build was aborted

Patch application report for D8657 (id=31274)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..08f2e60
Fast-forward
 swh/provenance/algos/directory.py                  |  10 +-
 swh/provenance/interface.py                        |   8 +-
 swh/provenance/provenance.py                       |   9 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  16 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  24 ++-
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/tests/test_directory_flatten.py     |   4 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   8 +-
 tox.ini                                            |   4 +-
 16 files changed, 463 insertions(+), 55 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit 08f2e604b0743845acb17b6cf7ea4b0fc749e1e3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 7e6a62c990b76ac63ee53be1f4c1c147bba4b806
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 16:30:46 2022 +0200

    Rename ProvenanceInterface.directory_xxx_flattenned as directory_xxx_flattened
    
    and fix all occurrences of the typo.

commit 2bd74fc7d97d40d7132a6530cd0078e3ffb8c614
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/684/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/684/console

Build is green

Patch application report for D8657 (id=31274)

Could not rebase; Attempt merge onto 6f4a193e90...

Updating 6f4a193..08f2e60
Fast-forward
 swh/provenance/algos/directory.py                  |  10 +-
 swh/provenance/interface.py                        |   8 +-
 swh/provenance/provenance.py                       |   9 +-
 swh/provenance/storage/__init__.py                 |   9 +
 swh/provenance/storage/interface.py                |  16 +-
 swh/provenance/storage/journal.py                  | 152 ++++++++++++++++
 swh/provenance/storage/postgresql.py               |  24 ++-
 swh/provenance/storage/rabbitmq/client.py          |   1 +
 swh/provenance/storage/rabbitmq/server.py          |   2 +-
 swh/provenance/tests/test_directory_flatten.py     |   4 +-
 swh/provenance/tests/test_journal_client.py        |   2 +
 .../tests/test_provenance_journal_writer.py        | 193 +++++++++++++++++++++
 .../tests/test_provenance_journal_writer_kafka.py  |  46 +++++
 swh/provenance/tests/test_provenance_storage.py    |  30 ++--
 .../tests/test_revision_content_layer.py           |   8 +-
 tox.ini                                            |   4 +-
 16 files changed, 463 insertions(+), 55 deletions(-)
 create mode 100644 swh/provenance/storage/journal.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer.py
 create mode 100644 swh/provenance/tests/test_provenance_journal_writer_kafka.py
Changes applied before test
commit 08f2e604b0743845acb17b6cf7ea4b0fc749e1e3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 18:23:09 2022 +0200

    Add support for kafka journalization of the ProvenanceStorageInterface
    
    the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
    that will push added objects in a swh-journal (typ. a kafka).
    
    Journal messages are simple dicts with 2 keys: id (the sharding key) and
    value (a serialiazable version of the argument of the xxx_add() method).
    
    Use the 'kafka' pytest marker for all kafka-related tests (especially
    used for tox, see tox.ini).

commit 7e6a62c990b76ac63ee53be1f4c1c147bba4b806
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 11 16:30:46 2022 +0200

    Rename ProvenanceInterface.directory_xxx_flattenned as directory_xxx_flattened
    
    and fix all occurrences of the typo.

commit 2bd74fc7d97d40d7132a6530cd0078e3ffb8c614
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Oct 7 14:51:09 2022 +0200

    Normalize _add() methods of the ProvenanceStorage interface
    
    make them all accept a Dict[Sha1Git, xxx] as argument, ie:
    
    - remove support for Iterable[bytes] in revision_add, and
    - replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add
    
    Currently, the sha1 of location path in location_add() is not really
    used by any backend, so the computation of said hashed is a waste of
    resource, but it makes the API of this interface much more consistent
    which will be helpful for coming features (like kafka journal).

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/685/ for more details.