Page MenuHomeSoftware Heritage

Add new RabbitMQ-based client/server API
ClosedPublic

Authored by aeviso on Aug 31 2021, 2:03 PM.

Details

Summary

New conflict resolution layer implementation using RabbitMQ to communicate
between client and server processes.

For each set methods in the ProvenanceStorageInterface the client will
dispatch the information to be stored to different queues, based on the id
of the associated entity (in case of a relation, the source entity).

The server will spawn one sub-process per queue to handle those particular
requirements. The split policy is defined in the server class in such a
way that no writing conflicts should occur in the underlying storage.

For the get methods, the client directly access the underlying storage
object, for which it has its own connection (ie. no communication through
RabbitMQ occurs).

Docs: https://hedgedoc.softwareheritage.org/RJQjBSR2TmuVzD6NRFFCeg

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D6165 (id=23044)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..2e43218
Fast-forward
 mypy.ini                                |   3 +
 swh/provenance/__init__.py              |  29 +-
 swh/provenance/api/client.py            | 537 +++++++++++++++++++++-
 swh/provenance/api/server.py            | 792 +++++++++++++++++++++++++++++++-
 swh/provenance/cli.py                   |  34 +-
 swh/provenance/interface.py             |   9 +
 swh/provenance/mongo/backend.py         |   3 +
 swh/provenance/postgresql/provenance.py |   3 +
 swh/provenance/provenance.py            |   3 +
 swh/provenance/tests/conftest.py        |  35 +-
 10 files changed, 1411 insertions(+), 37 deletions(-)
Changes applied before test
commit 2e432185109a272a03aad42d7d380bd6ae2de4be
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling

commit 4ec6d3359e20920469cde43516bc120d4352f915
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit f6f174fb76b2c42435b6c75eb90c175d9cb0fca5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 437c2b4ad60e15c0b102988a99766b2c503c91d5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `close` method to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/371/ for more details.

Build is green

Patch application report for D6165 (id=23046)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..ba22e90
Fast-forward
 mypy.ini                                |   3 +
 swh/provenance/__init__.py              |  29 +-
 swh/provenance/api/client.py            | 537 +++++++++++++++++++++-
 swh/provenance/api/server.py            | 792 +++++++++++++++++++++++++++++++-
 swh/provenance/cli.py                   |  34 +-
 swh/provenance/interface.py             |   9 +
 swh/provenance/mongo/backend.py         |   3 +
 swh/provenance/postgresql/provenance.py |   3 +
 swh/provenance/provenance.py            |   3 +
 swh/provenance/tests/conftest.py        |  33 +-
 10 files changed, 1410 insertions(+), 36 deletions(-)
Changes applied before test
commit ba22e90f707688aed716d20253256ee12c414a33
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling

commit 4ec6d3359e20920469cde43516bc120d4352f915
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit f6f174fb76b2c42435b6c75eb90c175d9cb0fca5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 437c2b4ad60e15c0b102988a99766b2c503c91d5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `close` method to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/373/ for more details.

Thanks for this massive implementation work!

I still want to do a deeper dive in this code (and give others the chance to do so), but I think that before that, and now that bugs and wrinkles have been ironed out and this code seems to be working, we need a large pass of updating the docstrings to describe the actual behavior of the code.

I expect a lot of this is present inside the hedgedoc document, so you should try to land it as documentation at the same time as this code.

When reading this diff, I would like to find the following:

  • a description of all threads and subprocesses (on the client and server side), as well as their associated workflows (who does what)
  • a description of how RabbitMQ queues and exchanges are handled (the request queues, the response queues, the way the acknowledgements are managed)
  • a description of how objects are serialised to be passed on to the queues
  • a description of what queues feed to what server processes, and how the messages are "bundled" before being sent to the database
  • a list of "tunables" (number of queues, batch sizes, timeouts, etc.) to watch out for

I would suggest documenting the "lifecycle" of the client and server threads/processes, for instance by writing a summarised list of all the methods that are called in sequence, on initialization of the classes, with how the callbacks mesh together.

When this lifecycle doc is available (centrally), I think most of the "boilerplate" documentation that's been pulled from the pika example code can go away (with a shorter reference to the full lifecycle documentation).

In D6165#164547, @olasd wrote:

Thanks for this massive implementation work!

I still want to do a deeper dive in this code (and give others the chance to do so), but I think that before that, and now that bugs and wrinkles have been ironed out and this code seems to be working, we need a large pass of updating the docstrings to describe the actual behavior of the code.

I expect a lot of this is present inside the hedgedoc document, so you should try to land it as documentation at the same time as this code.

When reading this diff, I would like to find the following:

  • a description of all threads and subprocesses (on the client and server side), as well as their associated workflows (who does what)
  • a description of how RabbitMQ queues and exchanges are handled (the request queues, the response queues, the way the acknowledgements are managed)
  • a description of how objects are serialised to be passed on to the queues
  • a description of what queues feed to what server processes, and how the messages are "bundled" before being sent to the database
  • a list of "tunables" (number of queues, batch sizes, timeouts, etc.) to watch out for

I would suggest documenting the "lifecycle" of the client and server threads/processes, for instance by writing a summarised list of all the methods that are called in sequence, on initialization of the classes, with how the callbacks mesh together.

When this lifecycle doc is available (centrally), I think most of the "boilerplate" documentation that's been pulled from the pika example code can go away (with a shorter reference to the full lifecycle documentation).

I agree about documentation. My only concern is that this will delay the experiments on the server (and other machines if possible), and I'll actually have plenty of time to work on docs while those experiments run.

Build is green

Patch application report for D6165 (id=23170)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..2ca0c9b
Fast-forward
 mypy.ini                                |   3 +
 requirements.txt                        |   1 +
 swh/provenance/__init__.py              |  43 +-
 swh/provenance/api/client.py            | 537 ++++++++++++++++++++-
 swh/provenance/api/server.py            | 793 +++++++++++++++++++++++++++++++-
 swh/provenance/cli.py                   |  40 +-
 swh/provenance/interface.py             |  20 +
 swh/provenance/mongo/backend.py         |  19 +-
 swh/provenance/postgresql/provenance.py |  18 +-
 swh/provenance/provenance.py            |   6 +
 swh/provenance/tests/conftest.py        |  52 ++-
 11 files changed, 1473 insertions(+), 59 deletions(-)
Changes applied before test
commit 2ca0c9bdf640c7cbbcf84580aa211d8cb51d55a4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling

commit 0ec7250ea299d42697cb3b480171efcd2926d049
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit f0f3a584ea6965021990ddba926ae13c29b9560a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 846b20e0e9995a13591a1641bf92036ff3764be5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `open`/`close` methods to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly allocate/release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/395/ for more details.

move later in the commit history

Build is green

Patch application report for D6165 (id=23187)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..efdce8a
Fast-forward
 mypy.ini                                |   3 +
 requirements-test.txt                   |   1 -
 requirements.txt                        |   1 +
 swh/provenance/__init__.py              |  28 +-
 swh/provenance/api/client.py            | 545 ++++++++++++++++++++-
 swh/provenance/api/server.py            | 844 +++++++++++++++++++++++++++++---
 swh/provenance/cli.py                   |  39 +-
 swh/provenance/graph.py                 |   9 +
 swh/provenance/interface.py             |  20 +
 swh/provenance/mongo/backend.py         |  41 +-
 swh/provenance/origin.py                |  17 +-
 swh/provenance/postgresql/archive.py    |  15 +-
 swh/provenance/postgresql/provenance.py |  40 +-
 swh/provenance/provenance.py            | 149 +++---
 swh/provenance/revision.py              |  31 +-
 swh/provenance/storage/archive.py       |  15 +-
 swh/provenance/tests/conftest.py        |  63 +--
 17 files changed, 1592 insertions(+), 269 deletions(-)
Changes applied before test
commit efdce8a40cfd5864efbe6971dd26365d133b9260
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 2049a13dcd69b2e9f5d853f75158967ecaf0ead3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 462d9f84facf6d7020000673d57000326df9dd4a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 61b427c0956bd35596213df7c0f4655966227449
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:59:38 2021 +0200

    Make old StatsD metrics style compliant with the rest of the module

commit d9a00102c66284f358c6ced5e3fdf1057a9ba62d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 14:08:10 2021 +0200

    Add StatsD support to graph submodule
    
    Time stats of graphs creation and counter of amount of invalidated isochrone frontiers

commit 0cf3d9185f3eb8528c9cf2031ea8f94d83977ca2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 13:53:57 2021 +0200

    Add StatsD support to provenance storage implementations

commit 0160d4f7c3cfc3f0193b729f0c04bd2ff7ad7129
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:21:42 2021 +0200

    Add StatsD support to provenance backend

commit 4f6bf0a4670e69730e47f519ac8bca6673be29f6
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:17:34 2021 +0200

    Split `Provenance::flush` method in two (one per layer)

commit 8d401db34539f5df2ce2bd37080ec8ae1557417b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 1 11:27:02 2021 +0200

    Remove old client/server storage based on `swh.core.api.RPCClient`
    
    This implementation was a first attempt for conflict resolution that didn't worked as expected.

commit 846b20e0e9995a13591a1641bf92036ff3764be5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `open`/`close` methods to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly allocate/release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/409/ for more details.

Build is green

Patch application report for D6165 (id=23193)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..ca1ac92
Fast-forward
 mypy.ini                                |   3 +
 requirements-test.txt                   |   1 -
 requirements.txt                        |   1 +
 swh/provenance/__init__.py              |  28 +-
 swh/provenance/api/client.py            | 541 +++++++++++++++++++-
 swh/provenance/api/server.py            | 848 +++++++++++++++++++++++++++++---
 swh/provenance/cli.py                   |  39 +-
 swh/provenance/graph.py                 |   9 +
 swh/provenance/interface.py             |  20 +
 swh/provenance/mongo/backend.py         |  41 +-
 swh/provenance/origin.py                |  17 +-
 swh/provenance/postgresql/archive.py    |  15 +-
 swh/provenance/postgresql/provenance.py |  40 +-
 swh/provenance/provenance.py            | 149 +++---
 swh/provenance/revision.py              |  31 +-
 swh/provenance/storage/archive.py       |  15 +-
 swh/provenance/tests/conftest.py        |  63 +--
 17 files changed, 1590 insertions(+), 271 deletions(-)
Changes applied before test
commit ca1ac92d5bddc8abad4a6b2283373ccf88acc3f3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit ddd49fbd1841475f30faeb9257839a936cf2a815
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit a74ed65b748dd6bf3d1ec7f694a80889a7bb3b35
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 61b427c0956bd35596213df7c0f4655966227449
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:59:38 2021 +0200

    Make old StatsD metrics style compliant with the rest of the module

commit d9a00102c66284f358c6ced5e3fdf1057a9ba62d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 14:08:10 2021 +0200

    Add StatsD support to graph submodule
    
    Time stats of graphs creation and counter of amount of invalidated isochrone frontiers

commit 0cf3d9185f3eb8528c9cf2031ea8f94d83977ca2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 13:53:57 2021 +0200

    Add StatsD support to provenance storage implementations

commit 0160d4f7c3cfc3f0193b729f0c04bd2ff7ad7129
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:21:42 2021 +0200

    Add StatsD support to provenance backend

commit 4f6bf0a4670e69730e47f519ac8bca6673be29f6
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:17:34 2021 +0200

    Split `Provenance::flush` method in two (one per layer)

commit 8d401db34539f5df2ce2bd37080ec8ae1557417b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 1 11:27:02 2021 +0200

    Remove old client/server storage based on `swh.core.api.RPCClient`
    
    This implementation was a first attempt for conflict resolution that didn't worked as expected.

commit 846b20e0e9995a13591a1641bf92036ff3764be5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `open`/`close` methods to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly allocate/release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/411/ for more details.

Build is green

Patch application report for D6165 (id=23213)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..5992c08
Fast-forward
 mypy.ini                                |   3 +
 requirements-test.txt                   |   1 -
 requirements.txt                        |   2 +
 swh/provenance/__init__.py              |  28 +-
 swh/provenance/api/client.py            | 541 +++++++++++++++++++-
 swh/provenance/api/server.py            | 848 +++++++++++++++++++++++++++++---
 swh/provenance/cli.py                   |  39 +-
 swh/provenance/graph.py                 |   9 +
 swh/provenance/interface.py             |  20 +
 swh/provenance/mongo/backend.py         |  41 +-
 swh/provenance/origin.py                |  17 +-
 swh/provenance/postgresql/archive.py    |  15 +-
 swh/provenance/postgresql/provenance.py |  40 +-
 swh/provenance/provenance.py            | 149 +++---
 swh/provenance/revision.py              |  31 +-
 swh/provenance/storage/archive.py       |  15 +-
 swh/provenance/tests/conftest.py        |  63 +--
 17 files changed, 1591 insertions(+), 271 deletions(-)
Changes applied before test
commit 5992c08ff5c8f4eb2cdd8c5148375a050a4da738
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 7513e6eff6979f8014dc23ebe0ea5c9e937de53c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit af1fe50f5a6548c6f4c31adbcf8d5124796d691b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 61b427c0956bd35596213df7c0f4655966227449
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:59:38 2021 +0200

    Make old StatsD metrics style compliant with the rest of the module

commit d9a00102c66284f358c6ced5e3fdf1057a9ba62d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 14:08:10 2021 +0200

    Add StatsD support to graph submodule
    
    Time stats of graphs creation and counter of amount of invalidated isochrone frontiers

commit 0cf3d9185f3eb8528c9cf2031ea8f94d83977ca2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 13:53:57 2021 +0200

    Add StatsD support to provenance storage implementations

commit 0160d4f7c3cfc3f0193b729f0c04bd2ff7ad7129
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:21:42 2021 +0200

    Add StatsD support to provenance backend

commit 4f6bf0a4670e69730e47f519ac8bca6673be29f6
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:17:34 2021 +0200

    Split `Provenance::flush` method in two (one per layer)

commit 8d401db34539f5df2ce2bd37080ec8ae1557417b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 1 11:27:02 2021 +0200

    Remove old client/server storage based on `swh.core.api.RPCClient`
    
    This implementation was a first attempt for conflict resolution that didn't worked as expected.

commit 846b20e0e9995a13591a1641bf92036ff3764be5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `open`/`close` methods to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    The idea is to have a mechanism to explicitly allocate/release resources when needed.

commit 6c3071493b5d3f187113493275d402a27866da95
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Rename remote storage backend classes
    
    Make names consistent with the naming convention used for other components.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/413/ for more details.

Build is green

Patch application report for D6165 (id=23275)

Could not rebase; Attempt merge onto 4c087ea0ec...

Updating 4c087ea..cd4056b
Fast-forward
 mypy.ini                                |   3 +
 requirements-test.txt                   |   1 -
 requirements.txt                        |   3 +-
 swh/provenance/__init__.py              |  28 +-
 swh/provenance/api/client.py            | 557 ++++++++++++++++++++-
 swh/provenance/api/server.py            | 846 +++++++++++++++++++++++++++++---
 swh/provenance/cli.py                   |  97 ++--
 swh/provenance/graph.py                 |   9 +
 swh/provenance/interface.py             |  47 +-
 swh/provenance/mongo/backend.py         |  61 ++-
 swh/provenance/origin.py                |  17 +-
 swh/provenance/postgresql/archive.py    |  15 +-
 swh/provenance/postgresql/provenance.py |  58 ++-
 swh/provenance/provenance.py            | 165 ++++---
 swh/provenance/revision.py              |  31 +-
 swh/provenance/storage/archive.py       |  15 +-
 swh/provenance/tests/conftest.py        |  61 +--
 17 files changed, 1701 insertions(+), 313 deletions(-)
Changes applied before test
commit cd4056be39e8152276cc5d65f39d512021714d84
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 2eaf7200e8a97e42b291313f04cf24ae6c6ce9f2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit d1913496c81be2fae1eff1cf93a17c8439991706
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 04ff73ea98f5f239cee6a126c75767f4617e330c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:59:38 2021 +0200

    Make old StatsD metrics style compliant with the rest of the module

commit 1bd6b22aae6a356e18f65005fc7e1c162e6f38c6
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 14:08:10 2021 +0200

    Add StatsD support to graph submodule
    
    Time stats of graphs creation and counter of amount of invalidated isochrone frontiers

commit 1ad78362fb415ea1d88a1d416da9991896e68d43
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 13:53:57 2021 +0200

    Add StatsD support to provenance storage implementations

commit e2a1843d5ebe01a9cdfe46b6b74dde1e293b8c01
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:21:42 2021 +0200

    Add StatsD support to provenance backend

commit 246e55f9b7e3475ea4509e08370827a3190db916
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Sep 27 15:17:34 2021 +0200

    Split `Provenance::flush` method in two (one per layer)

commit f0210c3753c3a4122ee3c54f7fac97d170a142fa
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Sep 24 11:08:08 2021 +0200

    Add `open`/`close` methods to both `ProvenanceInterface` and `ProvenanceStorageInterface`
    
    This allows to have an explicit mechanism to allocate/release resources when needed.
    The necessary methods for the classes implementing these interfaces to be turned in contexts
    managers are added as well (ie. `__enter__`/`__exit__`).

commit 172e327c25883bee768a9c16b850ce6aab7e2eb2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 16:14:10 2021 +0200

    Remove remote provenance storage based on `swh.core.api.RPCClient`
    
    This implementation was a first attempt for conflict resolution that didn't worked as expected.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/425/ for more details.

As others (and I) said, this must come with actual documentation.
As is, I have hard time understanding how this actually works (even after reading the document in hedgdoc).

Before landing this, please give credit to the origin of the example code used as starting point of a substantial part of the code in this diff (and make sure there is not license caveat).

This revision is now accepted and ready to land.Oct 5 2021, 10:39 AM

Also there is no real value in keeping 3 revisions: the last 2 revisions actually improve/modify the code from the first revision.

  • Add new RabbitMQ-based client/server API
  • Rework ProvenanceStorageRabbitMQWorker to handle connection loss
  • Improve server/client shoutdown logic and error handling
  • Improve routing key computation for paths
  • Fix config file parsing for server initilization
  • Send several items per message in the remote provenance storage

Build is green

Patch application report for D6165 (id=23453)

Could not rebase; Attempt merge onto 04ff73ea98...

Updating 04ff73e..3f56270
Fast-forward
 mypy.ini                                        |   3 +
 requirements.txt                                |   1 +
 swh/provenance/__init__.py                      |   8 +
 swh/provenance/api/client.py                    | 582 +++++++++++++++++
 swh/provenance/api/server.py                    | 794 +++++++++++++++++++++++-
 swh/provenance/cli.py                           |  26 +-
 swh/provenance/model.py                         |   6 +-
 swh/provenance/provenance.py                    |  11 +-
 swh/provenance/tests/test_provenance_storage.py |  21 +-
 swh/provenance/util.py                          |  15 +
 10 files changed, 1432 insertions(+), 35 deletions(-)
 create mode 100644 swh/provenance/util.py
Changes applied before test
commit 3f56270a0f912e909312e52f864081bf6720cfce
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit 03dc27f6f2eb1d99084fbc3a3f9ecdaa7c9edb27
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit 2759b4977f933be21951ea90fd70f7c16c69aea1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit f146fac61a1ac44f489739caba3fe2b2f21de8d3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 327be11571eef3e44c38705990f9c931661a7591
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit bccbf59fcb16d8727d347c3d7b9a623704a80467
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 3e87301a2868a7a9aa42403e150a60489f22708e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:40:39 2021 +0200

    Move path normalization function to `util` submodule

commit 2c9ef5673b369f2baa83b11ec9256c6aafc3a855
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 5 12:01:25 2021 +0200

    Remove direct dependencies on deprecated `swh.model.identifiers` module

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/436/ for more details.

Build is green

Patch application report for D6165 (id=23517)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..b5a1d84
Fast-forward
 mypy.ini                         |   3 +
 requirements.txt                 |   1 +
 swh/provenance/__init__.py       |   8 +
 swh/provenance/api/client.py     | 582 ++++++++++++++++++++++++++++
 swh/provenance/api/server.py     | 794 ++++++++++++++++++++++++++++++++++++++-
 swh/provenance/cli.py            |  26 +-
 swh/provenance/sql/30-schema.sql |  20 +-
 swh/provenance/sql/40-funcs.sql  |  50 +--
 swh/provenance/util.py           |   5 +
 9 files changed, 1446 insertions(+), 43 deletions(-)
Changes applied before test
commit b5a1d8414f313af6a59211afa66816228c14172c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit 5eb1fc59d3d34890f66f43314217412310fd919a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit ce649ddb6db25dc3e68f128a7ae6174b8b31e8a0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 87108b1ea2ceb9909d4bbec6012004acdf96c08a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit e6588cb24368b5edbc8bfdb82b5a4ba9b690d444
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 5a85b6c17f172315e784724158b7e08b6bdf9c61
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 37da3774d8dc34365b7b1cbed469d970c51ecc58
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/444/ for more details.

  • Fix config file parsing for server initilization
  • Send several items per message in the remote provenance storage
  • Export batch size and prefetch count as parameters for remote storage

Build is green

Patch application report for D6165 (id=23635)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..effb5b0
Fast-forward
 mypy.ini                                |   3 +
 requirements.txt                        |   1 +
 swh/provenance/__init__.py              |   8 +
 swh/provenance/api/client.py            | 588 +++++++++++++++++++++++
 swh/provenance/api/server.py            | 808 +++++++++++++++++++++++++++++++-
 swh/provenance/cli.py                   |  31 +-
 swh/provenance/graph.py                 |   2 +-
 swh/provenance/postgresql/provenance.py |  29 +-
 swh/provenance/provenance.py            |  63 +++
 swh/provenance/sql/30-schema.sql        |  20 +-
 swh/provenance/sql/40-funcs.sql         |  50 +-
 swh/provenance/util.py                  |   5 +
 12 files changed, 1557 insertions(+), 51 deletions(-)
Changes applied before test
commit effb5b099a9e6928da42cdb491db532d6a75e988
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit 6e11a8e528850ae05375243469ed74ab9e0956ee
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit d18e8bd4e9ca7814c9cdbfa7a21c155a3d7d3a08
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit 018ab5106e5ac9be36830c8242f1d94c229b878a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 647d0ae75b85043b0c2ef0f528be9f6891c91ce9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit cdddb2d573763ab005c7d3c754c5d85a263220e9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 35f03480581d52d1b7b705d0b974151fa49ba546
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 8168ab4fc3f0fc3556623dd3de854f222ffe5d7e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit c7ae90e08b39919da9d67ad3436a71d47a6ad5e7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 12:10:10 2021 +0200

    Add metrics on retries when flushing cache on the provenance backend

commit bfea53a97c588aa85ddd2ea93fa3dcf17b34a6a4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 19 16:12:23 2021 +0200

    Export page size as a parameter for postgresql storage

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/459/ for more details.

Build is green

Patch application report for D6165 (id=23660)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..b280343
Fast-forward
 mypy.ini                         |   3 +
 requirements.txt                 |   1 +
 swh/provenance/__init__.py       |   8 +
 swh/provenance/api/client.py     | 588 ++++++++++++++++++++++++++++
 swh/provenance/api/server.py     | 808 ++++++++++++++++++++++++++++++++++++++-
 swh/provenance/cli.py            |  31 +-
 swh/provenance/sql/30-schema.sql |  20 +-
 swh/provenance/sql/40-funcs.sql  |  50 ++-
 swh/provenance/util.py           |   5 +
 9 files changed, 1470 insertions(+), 44 deletions(-)
Changes applied before test
commit b2803436de6bc67c6d0dc01b5624ebba18689ca2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit 3fee8fbfbdaac3ed1adb5003adb32c52c99c8d37
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit d8c0aae0093f700f5e362d60fe8cf6b51f374fa2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit 8e0d98fb67ace71342867b68c0703b551e47e7f0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 1e66d6940749e33ff4442dfe8c2495567b6c50a5
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 5e8c8a54cdac94a2a41662468f33893df97e7c6b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 2a5fe87fbd76aa30e44ea7703a85d5a4b70e574c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 62884e23dd1164274fd89a09acedae8977a8e0f3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/464/ for more details.

  • Improve timeout logic on remote storage client side

Build is green

Patch application report for D6165 (id=23906)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..9358df8
Fast-forward
 mypy.ini                                   |   3 +
 requirements.txt                           |   1 +
 swh/provenance/__init__.py                 |   8 +
 swh/provenance/api/client.py               | 597 +++++++++++++++++++++
 swh/provenance/api/server.py               | 808 ++++++++++++++++++++++++++++-
 swh/provenance/cli.py                      |  31 +-
 swh/provenance/sql/30-schema.sql           |  20 +-
 swh/provenance/sql/40-funcs.sql            |  50 +-
 swh/provenance/tests/data/generate_repo.py |   2 +-
 swh/provenance/util.py                     |   5 +
 10 files changed, 1480 insertions(+), 45 deletions(-)
Changes applied before test
commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:58:32 2021 +0200

    Fix `yaml.load` deprecated warning

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/474/ for more details.

Build is green

Patch application report for D6165 (id=24271)

Could not rebase; Attempt merge onto 94baaab052...

Updating 94baaab..81ccb6a
Fast-forward
 mypy.ini                             |   3 +
 requirements.txt                     |   1 +
 swh/provenance/__init__.py           |   8 +
 swh/provenance/api/client.py         | 597 ++++++++++++++++++++++++++
 swh/provenance/api/server.py         | 808 ++++++++++++++++++++++++++++++++++-
 swh/provenance/archive.py            |   2 +-
 swh/provenance/cli.py                |  35 +-
 swh/provenance/graph.py              |   3 +-
 swh/provenance/model.py              |   4 +-
 swh/provenance/postgresql/archive.py |  15 +-
 swh/provenance/provenance.py         |  77 ++--
 swh/provenance/revision.py           |  12 +-
 swh/provenance/sql/30-schema.sql     |  20 +-
 swh/provenance/sql/40-funcs.sql      |  50 ++-
 swh/provenance/storage/archive.py    |  16 +-
 swh/provenance/tests/conftest.py     |  34 +-
 swh/provenance/util.py               |   5 +
 17 files changed, 1560 insertions(+), 130 deletions(-)
Changes applied before test
commit 81ccb6a310249e96e6393fd183dd36af66421083
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit bdfc2c23c18a3db5b32edc53172797b510442c7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit 785c156f7148bf20c0b1606736a4d9b99f701d7e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit ea93e1f94cff82372d1236158e8c4c36ff2747cd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit aa9f923259352d59a4825b17196c1f9df9ae4c9d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit f5a548512b4d5f602bd5dbea5d66705325ef3da1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 9f765cf93dd47d92dfd170fc14aacc69aa102a8a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 0eacff7a66a5a117b6d81d97d1a554d6cab4920c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 579c3bd35e5668ad9ef5fea58c20d5c66e5699f2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 584845d3715ea6c536e7cf5f697cac628032416f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size
    
    The idea is to be able to filter files that are not meaningful from the
    provenance point of view. For instance, the empty file. This modification
    allows to define a minimum size for files to be considered for the
    provenance index.

commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Nov 23 16:11:09 2021 +0100

    Reorder flushing operations to avoid unnecessary updated in the storage

commit 62a31f6f986bb38ced99331ab66eb0717600ea5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Nov 24 11:10:40 2021 +0100

    Rework conftest and improve type annotations

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/489/ for more details.

  • Add documentation for the remote storage backend

Build is green

Patch application report for D6165 (id=24323)

Could not rebase; Attempt merge onto 579c3bd35e...

Updating 579c3bd..b6db561
Fast-forward
 docs/storage/remote.rst      | 340 +++++++++++++++++++++++
 mypy.ini                     |   3 +
 requirements.txt             |   1 +
 swh/provenance/__init__.py   |   8 +
 swh/provenance/api/client.py | 463 +++++++++++++++++++++++++++++++
 swh/provenance/api/server.py | 646 ++++++++++++++++++++++++++++++++++++++++++-
 swh/provenance/cli.py        |  31 ++-
 swh/provenance/util.py       |   5 +
 8 files changed, 1486 insertions(+), 11 deletions(-)
 create mode 100644 docs/storage/remote.rst
Changes applied before test
commit b6db561c102db1536fc956a4e5be5efbbd372e08
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Nov 26 16:21:19 2021 +0100

    Add documentation for the remote storage backend
    
    Clean up code

commit 81ccb6a310249e96e6393fd183dd36af66421083
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit bdfc2c23c18a3db5b32edc53172797b510442c7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit 785c156f7148bf20c0b1606736a4d9b99f701d7e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit ea93e1f94cff82372d1236158e8c4c36ff2747cd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit aa9f923259352d59a4825b17196c1f9df9ae4c9d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit f5a548512b4d5f602bd5dbea5d66705325ef3da1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit 9f765cf93dd47d92dfd170fc14aacc69aa102a8a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 0eacff7a66a5a117b6d81d97d1a554d6cab4920c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/492/ for more details.

Build is green

Patch application report for D6165 (id=24326)

Rebasing onto 579c3bd35e...

Current branch diff-target is up to date.
Changes applied before test
commit ecaaeb2e57762f109ddb6f3a6e8529cd48c86085
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Write methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Read methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple sub-processes to handle independent requests concurrently.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/494/ for more details.

Build is green

Patch application report for D6165 (id=24329)

Rebasing onto 579c3bd35e...

Current branch diff-target is up to date.
Changes applied before test
commit 84a62793d5de0b51aa7706b04a773366f5411391
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Write methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Read methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple sub-processes to handle independent requests concurrently.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/496/ for more details.

Build is green

Patch application report for D6165 (id=24358)

Rebasing onto 579c3bd35e...

Current branch diff-target is up to date.
Changes applied before test
commit f8411a522e937b961d7fc8270c653d83039268fd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Write methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Read methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple sub-processes to handle independent requests concurrently.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/499/ for more details.

Build is green

Patch application report for D6165 (id=24366)

Rebasing onto 579c3bd35e...

Current branch diff-target is up to date.
Changes applied before test
commit a6cc3e4daf228ce9c124712b93c4749b16e65ce1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Write methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Read methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple sub-processes to handle independent requests concurrently.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/504/ for more details.

This revision was automatically updated to reflect the committed changes.