Page MenuHomeSoftware Heritage

All the current swh.provenance changes that are running in production...
ClosedPublic

Authored by olasd on Aug 12 2022, 5:51 PM.

Details

Reviewers
douardda
Group Reviewers
Reviewers
Commits
rDPROV8f476d494b4a: swhgraph: handle empty responses
rDPROVedf00f88894f: Use proper signatures in journal_client
rDPROV68e1907e7f37: Appease pyright by ensuring target_type is bound
rDPROV08de80b680bd: origin layer: retrieve multiple levels of revision history at once
rDPROVd935abf431df: Rename origin.proceed_origin to origin.process_origin
rDPROV2ac46f58346f: multiplexer: add endpoint counts per backend
rDPROVf5f8555f8e3d: provenance: lower the cache thresholds
rDPROV8d323c322df2: journal client: only use the provenance context manager once
rDPROV4b3de6177b4f: revision: only trigger partial flushes when necessary
rDPROV9c936c39779c: revision: sort batches by date, improve logging, add incremental flushing
rDPROV5b66b98e62c5: revision: capture datetime exceptions with sentry
rDPROVaf09058f0a80: revision: don't process revisions before the epoch
rDPROV3473d4af62d8: revision: don't process revisions with unknown dates
rDPROV34a9a1ac220b: Remove sneaky caches in the postgresql archive implementation
rDPROVd7d0c3d87605: postgresql archive: add support for partially copied databases
rDPROV95eb9622a00c: postgresql archive: don't use custom types
rDPROVbae8f4afda45: rabbitmq: Extend timeouts for reception of acks
rDPROV1efc40c7917f: rabbitmq: close the consumer only after all acks are received
rDPROVef7cd991712e: Improve logging in the API client and the revision layer
rDPROV3edf3690258b: Add systemd notification support
rDPROV5cadb13de9eb: Try to avoid some circular imports
rDPROV98254d2e930f: blacken swhgraph/archive.py
Summary

this is the accumulation of changes that have been tested in production
on mmca and met over the past weeks. A lot of this has been pair-programmed, and
the tests still pass, so we're probably in good shape.

git log origin/master.. says:

commit 8f476d494b4aeab6e0cd6a7adb5f2bce095e8c60
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 13:37:12 2022 +0200

    swhgraph: handle empty responses

    When the visit_edges response is empty, swh.graph.client generates an
    empty tuple, which can't be unpacked. Work around the issue.

 swh/provenance/swhgraph/archive.py | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

commit edf00f88894fb9cf407017944dc5cd751b012357
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 13:36:39 2022 +0200

    Use proper signatures in journal_client

    We're always passing the provenance-internal object types, not those of
    swh.storage.

 swh/provenance/journal_client.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

commit 08de80b680bdf008f9a1f45805f2d54a7a397549
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:26:01 2022 +0200

    origin layer: retrieve multiple levels of revision history at once

    Replace `revision_get_parents` with `revision_get_some_outbound_edges`,
    which can optionally retrieve more levels of history than just a single
    one. This allows us to do way fewer queries on the swh.graph or
    swh.storage backend if the revision exists there.

    The swh.storage backend does limited recursion, so we still process the
    origin in multiple steps to fetch the whole history.

 swh/provenance/archive.py                      | 15 +++++----
 swh/provenance/graph.py                        | 43 +++++++++++++-------------
 swh/provenance/interface.py                    | 10 +++---
 swh/provenance/journal_client.py               |  1 -
 swh/provenance/model.py                        | 20 +-----------
 swh/provenance/multiplexer/archive.py          | 28 ++++++++++-------
 swh/provenance/origin.py                       | 26 ++++++----------
 swh/provenance/postgresql/archive.py           | 27 +++++++++-------
 swh/provenance/provenance.py                   | 28 ++++++++---------
 swh/provenance/storage/archive.py              | 12 ++++---
 swh/provenance/swhgraph/archive.py             | 23 ++++++++------
 swh/provenance/tests/test_archive_interface.py | 29 ++++++++++-------
 12 files changed, 130 insertions(+), 132 deletions(-)

commit 68e1907e7f37863d732edcb6211be893df94b9c7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:03:12 2022 +0200

    Appease pyright by ensuring target_type is bound

 swh/provenance/tests/test_archive_interface.py | 2 ++
 1 file changed, 2 insertions(+)

commit d935abf431df5105fec8422e87eb5ee47d3c177a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:01:37 2022 +0200

    Rename origin.proceed_origin to origin.process_origin

 swh/provenance/origin.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

commit 2ac46f58346f7c3763f1263109885fea6797e155
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Aug 3 18:21:32 2022 +0200

    multiplexer: add endpoint counts per backend

 swh/provenance/__init__.py                     |  6 ++-
 swh/provenance/multiplexer/archive.py          | 61 +++++++++++++++++++-------
 swh/provenance/tests/test_archive_interface.py |  4 +-
 swh/provenance/tests/test_init.py              |  6 ++-
 4 files changed, 57 insertions(+), 20 deletions(-)

commit 8d323c322df2bf9a429a1329de6c87636927df19
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:46:20 2022 +0200

    journal client: only use the provenance context manager once

    The context manager for the provenance storage rabbitmq client doesn't
    like being used multiple times over the lifetime of a process. Only use
    it once in the cli of the journal client.

 swh/provenance/cli.py            | 6 ++++--
 swh/provenance/journal_client.py | 6 ++----
 2 files changed, 6 insertions(+), 6 deletions(-)

commit f5f8555f8e3d8c72a5d51f4a10d0b761e74c97fe
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:45:27 2022 +0200

    provenance: lower the cache thresholds

    Instead of flushing if any entry is over the threshold, flush when the
    cumulative count goes over.

 swh/provenance/provenance.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

commit 4b3de6177b4f2c5b45dede931004c719fdfb0f7d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:44:47 2022 +0200

    revision: only trigger partial flushes when necessary

 swh/provenance/revision.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

commit 9c936c39779cdb42b0f8f1a40df23d2de3032dfb
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:43:13 2022 +0200

    revision: sort batches by date, improve logging, add incremental flushing

 swh/provenance/revision.py | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

commit 5b66b98e62c50c5958936adcc3b0ab651fb2d279
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:40:59 2022 +0200

    revision: capture datetime exceptions with sentry

 swh/provenance/journal_client.py | 3 +++
 1 file changed, 3 insertions(+)

commit af09058f0a80aac79a4e477fb2f7bd9800e3603f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:40:26 2022 +0200

    revision: don't process revisions before the epoch

 swh/provenance/journal_client.py | 7 +++++++
 1 file changed, 7 insertions(+)

commit 3473d4af62d85255845aafc1def6c591090062e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:39:30 2022 +0200

    revision: don't process revisions with unknown dates

 swh/provenance/journal_client.py | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

commit d7d0c3d876059abe6a1d60a6c38ed4245e1b58c9
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:34:21 2022 +0200

    postgresql archive: add support for partially copied databases

    The incremental copy of the archive to mmca is not atomic: the directory
    table needs to be copied first, then the directory_entry_* tables need
    to be updated. This means that the client can view inconsistent entries,
    where the directory has been synced but not all the entry rows.

    We return an empty list when one of these bogus entries is detected.
    This allows smooth fallback to the main database through the
    multiplexer.

 swh/provenance/postgresql/archive.py | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

commit 95eb9622a00ce99d089bb9accdaed0bdbf1bdc37
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:33:28 2022 +0200

    postgresql archive: don't use custom types

    The partial copy of the archive on mmca doesn't have them anyway.

 swh/provenance/postgresql/archive.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

commit 34a9a1ac220bfabdda26b243c79742bdab090d76
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:32:09 2022 +0200

    Remove sneaky caches in the postgresql archive implementation

 mypy.ini                             | 3 ---
 requirements.txt                     | 1 -
 swh/provenance/postgresql/archive.py | 3 ---
 3 files changed, 7 deletions(-)

commit bae8f4afda455ca28e64e54f1c9c37c6af2214b6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:29:45 2022 +0200

    rabbitmq: Extend timeouts for reception of acks

    The retry logic is not very refined, extending the timeouts makes more
    sense.

 swh/provenance/api/client.py | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

commit 1efc40c7917feaedfa1204b6e4e395d41530d14c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:28:31 2022 +0200

    rabbitmq: close the consumer only after all acks are received

    This is not quite working but it seems to reduce issues on worker
    termination a bit.

 swh/provenance/api/client.py | 63 ++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 22 deletions(-)

commit ef7cd991712e47a14d7877f726f427a9de22e545
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:14:58 2022 +0200

    Improve logging in the API client and the revision layer

 swh/provenance/api/client.py | 39 +++++++++++++++++++++++----------------
 swh/provenance/provenance.py |  2 +-
 swh/provenance/revision.py   | 12 ++++++++++++
 3 files changed, 36 insertions(+), 17 deletions(-)

commit 3edf3690258b9e61de5452967c6ee178120276e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 16:53:11 2022 +0200

    Add systemd notification support

 mypy.ini                         |  3 +++
 swh/provenance/cli.py            | 15 +++++++++++++++
 swh/provenance/journal_client.py |  9 +++++++++
 3 files changed, 27 insertions(+)

commit 5cadb13de9eb27b309d2ada3df54dc86452785b3
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 16:54:27 2022 +0200

    Try to avoid some circular imports

 swh/provenance/__init__.py   | 2 +-
 swh/provenance/api/server.py | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

commit 98254d2e930f639c7b1fdb3c27f5eb2a668b857d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:17:11 2022 +0200

    blacken swhgraph/archive.py

 swh/provenance/swhgraph/archive.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
Test Plan

tests pass, and the journal clients seem happy enough...

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8243 (id=29726)

Rebasing onto 804c3a371e...

Current branch diff-target is up to date.
Changes applied before test
commit 8f476d494b4aeab6e0cd6a7adb5f2bce095e8c60
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 13:37:12 2022 +0200

    swhgraph: handle empty responses
    
    When the visit_edges response is empty, swh.graph.client generates an
    empty tuple, which can't be unpacked. Work around the issue.

commit edf00f88894fb9cf407017944dc5cd751b012357
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 13:36:39 2022 +0200

    Use proper signatures in journal_client
    
    We're always passing the provenance-internal object types, not those of
    swh.storage.

commit 08de80b680bdf008f9a1f45805f2d54a7a397549
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:26:01 2022 +0200

    origin layer: retrieve multiple levels of revision history at once
    
    Replace `revision_get_parents` with `revision_get_some_outbound_edges`,
    which can optionally retrieve more levels of history than just a single
    one. This allows us to do way fewer queries on the swh.graph or
    swh.storage backend if the revision exists there.
    
    The swh.storage backend does limited recursion, so we still process the
    origin in multiple steps to fetch the whole history.

commit 68e1907e7f37863d732edcb6211be893df94b9c7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:03:12 2022 +0200

    Appease pyright by ensuring target_type is bound

commit d935abf431df5105fec8422e87eb5ee47d3c177a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 11:01:37 2022 +0200

    Rename origin.proceed_origin to origin.process_origin

commit 2ac46f58346f7c3763f1263109885fea6797e155
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Aug 3 18:21:32 2022 +0200

    multiplexer: add endpoint counts per backend

commit 8d323c322df2bf9a429a1329de6c87636927df19
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:46:20 2022 +0200

    journal client: only use the provenance context manager once
    
    The context manager for the provenance storage rabbitmq client doesn't
    like being used multiple times over the lifetime of a process. Only use
    it once in the cli of the journal client.

commit f5f8555f8e3d8c72a5d51f4a10d0b761e74c97fe
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:45:27 2022 +0200

    provenance: lower the cache thresholds
    
    Instead of flushing if any entry is over the threshold, flush when the
    cumulative count goes over.

commit 4b3de6177b4f2c5b45dede931004c719fdfb0f7d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:44:47 2022 +0200

    revision: only trigger partial flushes when necessary

commit 9c936c39779cdb42b0f8f1a40df23d2de3032dfb
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:43:13 2022 +0200

    revision: sort batches by date, improve logging, add incremental flushing

commit 5b66b98e62c50c5958936adcc3b0ab651fb2d279
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:40:59 2022 +0200

    revision: capture datetime exceptions with sentry

commit af09058f0a80aac79a4e477fb2f7bd9800e3603f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:40:26 2022 +0200

    revision: don't process revisions before the epoch

commit 3473d4af62d85255845aafc1def6c591090062e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:39:30 2022 +0200

    revision: don't process revisions with unknown dates

commit d7d0c3d876059abe6a1d60a6c38ed4245e1b58c9
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:34:21 2022 +0200

    postgresql archive: add support for partially copied databases
    
    The incremental copy of the archive to mmca is not atomic: the directory
    table needs to be copied first, then the directory_entry_* tables need
    to be updated. This means that the client can view inconsistent entries,
    where the directory has been synced but not all the entry rows.
    
    We return an empty list when one of these bogus entries is detected.
    This allows smooth fallback to the main database through the
    multiplexer.

commit 95eb9622a00ce99d089bb9accdaed0bdbf1bdc37
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:33:28 2022 +0200

    postgresql archive: don't use custom types
    
    The partial copy of the archive on mmca doesn't have them anyway.

commit 34a9a1ac220bfabdda26b243c79742bdab090d76
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:32:09 2022 +0200

    Remove sneaky caches in the postgresql archive implementation

commit bae8f4afda455ca28e64e54f1c9c37c6af2214b6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:29:45 2022 +0200

    rabbitmq: Extend timeouts for reception of acks
    
    The retry logic is not very refined, extending the timeouts makes more
    sense.

commit 1efc40c7917feaedfa1204b6e4e395d41530d14c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:28:31 2022 +0200

    rabbitmq: close the consumer only after all acks are received
    
    This is not quite working but it seems to reduce issues on worker
    termination a bit.

commit ef7cd991712e47a14d7877f726f427a9de22e545
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:14:58 2022 +0200

    Improve logging in the API client and the revision layer

commit 3edf3690258b9e61de5452967c6ee178120276e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 16:53:11 2022 +0200

    Add systemd notification support

commit 5cadb13de9eb27b309d2ada3df54dc86452785b3
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 16:54:27 2022 +0200

    Try to avoid some circular imports

commit 98254d2e930f639c7b1fdb3c27f5eb2a668b857d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Aug 12 17:17:11 2022 +0200

    blacken swhgraph/archive.py

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/650/ for more details.

olasd requested review of this revision.Aug 12 2022, 6:00 PM
This revision is now accepted and ready to land.Aug 12 2022, 6:32 PM
This revision was automatically updated to reflect the committed changes.