Page MenuHomeSoftware Heritage

Make the indexer storage write to the journal.
ClosedPublic

Authored by vlorentz on Sep 29 2020, 4:21 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4083 (id=14402)

Rebasing onto e92b931e47...

Current branch diff-target is up to date.
Changes applied before test
commit 842798f886f54f64a7dc6a0bb092a6fcc0f04b63
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    [WIP] start writing to the journal from the idx-storage

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/51/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/51/console

Build has FAILED

Patch application report for D4083 (id=15212)

Rebasing onto 82d935733b...

Current branch diff-target is up to date.
Changes applied before test
commit 27ca432c1a67d63d2a99c160d16ea1602ae8b8d0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    [WIP] start writing to the journal from the idx-storage

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/96/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/96/console

Build has FAILED

Patch application report for D4083 (id=15566)

Could not rebase; Attempt merge onto 300b307394...

Updating 300b307..c644012
Fast-forward
 swh/indexer/metadata.py                      | 41 +-----------
 swh/indexer/storage/__init__.py              | 24 +------
 swh/indexer/storage/db.py                    | 22 -------
 swh/indexer/storage/in_memory.py             | 54 ++++++----------
 swh/indexer/storage/interface.py             | 32 ----------
 swh/indexer/storage/writer.py                | 57 +++++++++++++++++
 swh/indexer/tests/storage/conftest.py        |  6 +-
 swh/indexer/tests/storage/test_api_client.py |  6 +-
 swh/indexer/tests/storage/test_in_memory.py  |  2 +-
 swh/indexer/tests/storage/test_storage.py    | 95 +---------------------------
 swh/indexer/tests/test_origin_metadata.py    | 26 --------
 11 files changed, 94 insertions(+), 271 deletions(-)
 create mode 100644 swh/indexer/storage/writer.py
Changes applied before test
commit c644012f1a653f6d3b1d4a2f53e66eca54bacc6c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    [WIP] start writing to the journal from the idx-storage

commit 94c825919320bf3d3e2608b823dc887ed6122413
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 2 13:47:51 2020 +0100

    Remove metadata deletion endpoints and algorithms
    
    This was expected to be used in these two cases:
    
    1. if we remove mappings or file detection from a metadata indexer
    2. if an origin removes all its metadata files
    
    but:
    
    1. if we do so, then we should bump the indexer version, so the
       old metadata will be preserved anyway, as different indexer
       versions get different indexer_configuration_ids
    2. this should be a rather rare even, and even if it happens, we
       might want to keep the old metadata anyway rather than
       nothing (even if it's outdated), for search purposes.
    
    Additionally, this commit is motivated by:
    
    * that's less issues to deal with when writing to Kafka (the journal
      writer currently doesn't support suppression; and we would also have
      to add support for deletion in all consumers)
    * less code (~250 lines)

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/101/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/101/console

vlorentz retitled this revision from [WIP] start writing to the journal from the idx-storage to Make the indexer storage write to the journal..Nov 5 2020, 3:03 PM

Build has FAILED

Patch application report for D4083 (id=15659)

Could not rebase; Attempt merge onto e2835bfff6...

Updating e2835bf..e59868e
Fast-forward
 swh/indexer/storage/__init__.py              | 26 ++++++++++++-
 swh/indexer/storage/db.py                    | 12 ++++++
 swh/indexer/storage/in_memory.py             | 39 +++++++++++++------
 swh/indexer/storage/model.py                 | 13 +++++++
 swh/indexer/storage/writer.py                | 57 ++++++++++++++++++++++++++++
 swh/indexer/tests/storage/conftest.py        |  6 ++-
 swh/indexer/tests/storage/test_api_client.py | 29 +++++++++++---
 swh/indexer/tests/storage/test_in_memory.py  |  2 +-
 swh/indexer/tests/storage/test_storage.py    | 41 ++++++++++++++------
 9 files changed, 193 insertions(+), 32 deletions(-)
 create mode 100644 swh/indexer/storage/writer.py
Changes applied before test
commit e59868eb41f72fe3568de477fe4a0711269b375d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    Make the indexer storage write to the journal.

commit 8272bc90a367f2d9a9eb231505ad3ccc126c714f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:36:44 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: use more values, to make the test less likely to unexpectedly pass.

commit 5a5af91ac5aee172ff58e68f8b2121de179635e9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:37:43 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: Fix failure on nondeterministic order
    
    postgresql kindly returns the results in the order the test expected... most of the time.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/114/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/114/console

Build is green

Patch application report for D4083 (id=15659)

Could not rebase; Attempt merge onto e2835bfff6...

Updating e2835bf..e59868e
Fast-forward
 swh/indexer/storage/__init__.py              | 26 ++++++++++++-
 swh/indexer/storage/db.py                    | 12 ++++++
 swh/indexer/storage/in_memory.py             | 39 +++++++++++++------
 swh/indexer/storage/model.py                 | 13 +++++++
 swh/indexer/storage/writer.py                | 57 ++++++++++++++++++++++++++++
 swh/indexer/tests/storage/conftest.py        |  6 ++-
 swh/indexer/tests/storage/test_api_client.py | 29 +++++++++++---
 swh/indexer/tests/storage/test_in_memory.py  |  2 +-
 swh/indexer/tests/storage/test_storage.py    | 41 ++++++++++++++------
 9 files changed, 193 insertions(+), 32 deletions(-)
 create mode 100644 swh/indexer/storage/writer.py
Changes applied before test
commit e59868eb41f72fe3568de477fe4a0711269b375d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    Make the indexer storage write to the journal.

commit 8272bc90a367f2d9a9eb231505ad3ccc126c714f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:36:44 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: use more values, to make the test less likely to unexpectedly pass.

commit 5a5af91ac5aee172ff58e68f8b2121de179635e9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:37:43 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: Fix failure on nondeterministic order
    
    postgresql kindly returns the results in the order the test expected... most of the time.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/115/ for more details.

Build is green

Patch application report for D4083 (id=15661)

Could not rebase; Attempt merge onto e2835bfff6...

Updating e2835bf..1fd7ae9
Fast-forward
 swh/indexer/storage/__init__.py              | 26 ++++++++++++-
 swh/indexer/storage/db.py                    | 12 ++++++
 swh/indexer/storage/in_memory.py             | 39 +++++++++++++------
 swh/indexer/storage/model.py                 | 13 +++++++
 swh/indexer/storage/writer.py                | 56 ++++++++++++++++++++++++++++
 swh/indexer/tests/storage/conftest.py        |  6 ++-
 swh/indexer/tests/storage/test_api_client.py | 29 +++++++++++---
 swh/indexer/tests/storage/test_in_memory.py  |  2 +-
 swh/indexer/tests/storage/test_storage.py    | 41 ++++++++++++++------
 9 files changed, 192 insertions(+), 32 deletions(-)
 create mode 100644 swh/indexer/storage/writer.py
Changes applied before test
commit 1fd7ae9261e8c6dce5a9cfbfa6c2bc758b833b22
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    Make the indexer storage write to the journal.

commit 8272bc90a367f2d9a9eb231505ad3ccc126c714f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:36:44 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: use more values, to make the test less likely to unexpectedly pass.

commit 5a5af91ac5aee172ff58e68f8b2121de179635e9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 5 13:37:43 2020 +0100

    test_origin_intrinsic_metadata_add__deadlock: Fix failure on nondeterministic order
    
    postgresql kindly returns the results in the order the test expected... most of the time.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/116/ for more details.

douardda added a subscriber: douardda.

overall I'm ok, but I find it really lacks some documentation/explanations of how this works, especially the JournalWriter collaborator object

swh/indexer/storage/__init__.py
128–130

I know this docstring was outdated before this diff, but maybe it could be updated as part of it.

This revision is now accepted and ready to land.Nov 10 2020, 3:12 PM

update docstring + add type to tool_getter.

Build is green

Patch application report for D4083 (id=15797)

Rebasing onto 8272bc90a3...

Current branch diff-target is up to date.
Changes applied before test
commit 340c73a19604273467804c7c54b217b780a1677a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 16:21:13 2020 +0200

    Make the indexer storage write to the journal.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/117/ for more details.

olasd added inline comments.
swh/indexer/storage/writer.py
64

This will flush the journal writer on every message. Please avoid doing that.