Page MenuHomeSoftware Heritage

Add support for ExtID in the storage
ClosedPublic

Authored by douardda on Dec 9 2020, 5:12 PM.

Details

Summary

This endpoint allows to query the storage for known external identifier (typ. revisions from
their original VCS type intrinsic identifier.)

The underlying data structure is filled via the extid_add() endpoint (which will be the responsibility of loaders).

Related to T2849.

Depends on D4807.

Missing:

  • cassandra implementation,
  • inmemory implementations,
  • feed kafka,
  • consistent names,
  • more tests,
  • migration scripts,
  • other stuff

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/storage/postgresql/db.py
827–841

I haven't tested this but I think it should work.

swh/storage/sql/30-schema.sql
282–283

We'll want to add an index (unique) on these columns.

swh/storage/sql/40-funcs.sql
561

And we'll want to add an "on conflict (extid_pkey) do nothing" to this insert to avoid bombing out on already existing entries.

swh/storage/postgresql/db.py
827–841

alternatively, LEFT JOIN extid USING (extid) WHERE extid_type = '%s'. both are equivalent.

swh/storage/postgresql/db.py
827–841

that's what I tried but it breaks my tests (because [] != [None, None, None, None])

swh/storage/postgresql/db.py
827–841

right. you need an inner join.

(as a good rule of thumb, always use inner joins in SQL, one rarely ever needs the other types of joins)

swh/storage/postgresql/db.py
827–841

I finally have it working (with a left join). My main problem was to figure how to make it happy with the USING(target_type) (in extid_get_from_swhid_list)...

Rewrite SQL queries a bit better (thx olasd and vlorentz)

Build has FAILED

Patch application report for D4698 (id=17061)

Rebasing onto 2b35198d30...

Current branch diff-target is up to date.
Changes applied before test
commit 3e690230053d09aa7f25906126a316442a4df058
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 2a5bb7994f283ef036c4b9a1a051b020c298ed38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1079/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1079/console

Add unicity index and corresponding test

Build has FAILED

Patch application report for D4698 (id=17064)

Rebasing onto 2b35198d30...

Current branch diff-target is up to date.
Changes applied before test
commit e1bdbf89deab8e5a80608cf0345c3d91a5e3c3a3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 2a5bb7994f283ef036c4b9a1a051b020c298ed38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1080/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1080/console

Build has FAILED

Patch application report for D4698 (id=17169)

Could not rebase; Attempt merge onto 30945a5890...

Updating 30945a58..632b0666
Fast-forward
 README.md                           |  11 +++
 swh/storage/cassandra/cql.py        | 127 ++++++++++++++++++++++++++
 swh/storage/cassandra/model.py      |  11 +++
 swh/storage/cassandra/schema.py     |  25 ++++++
 swh/storage/cassandra/storage.py    |  82 +++++++++++++++++
 swh/storage/in_memory.py            |  42 +++++++++
 swh/storage/interface.py            |  48 ++++++++++
 swh/storage/postgresql/db.py        |  48 ++++++++++
 swh/storage/postgresql/storage.py   |  73 +++++++++++++++
 swh/storage/sql/30-schema.sql       |  16 ++++
 swh/storage/sql/40-funcs.sql        |  17 ++++
 swh/storage/sql/60-indexes.sql      |   4 +
 swh/storage/tests/storage_data.py   | 136 +++++++++++++++++++++++++++-
 swh/storage/tests/storage_tests.py  | 174 ++++++++++++++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py |  16 ++--
 swh/storage/writer.py               |   4 +
 tox.ini                             |   1 +
 17 files changed, 827 insertions(+), 8 deletions(-)
Changes applied before test
commit 632b0666fc612e63c55da976ea4fd88ed1c7a065
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit a10abe86e06fa10e00f323df491f85daa6b667d5
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit 728c3eeaae84a7e249a712a5545cc9ba682403c3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jan 8 10:39:42 2021 +0100

    Allow to use the JAVA_HOME environment for cassandra tests
    
    This allows to enforce a specific version of java to be used. For
    example, since cassandra seems not to support java 14 yet, this allows
    to run tests on bullseye:
    
      JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ pytest swh

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1096/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1096/console

rebase, rework to use new SWHID classes/model

there are still some tests to make pass.

Build has FAILED

Patch application report for D4698 (id=18681)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 8d4878812302bc49b68c206adf5eca1fd009b315
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6f3b728d436658a0960f0a74504af5c1d96452dd
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1195/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1195/console

All tests should now pass (with D4807)

douardda retitled this revision from [WIP][RFC] Add support for ExtID in the storage to Add support for ExtID in the storage.Mar 9 2021, 10:21 AM
douardda edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D4698 (id=18696)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 7645145413976cad9d93d50a0b248e1b3e54b2a9
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6f3b728d436658a0960f0a74504af5c1d96452dd
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1196/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1196/console

ExtID.target is now a CoreSWHID

also fix tests , journal writer and backfiller.

forgot sql migration script

Build has FAILED

Patch application report for D4698 (id=18716)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 64b03901568f604cefb740bec33c7bbe97e08919
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6f3b728d436658a0960f0a74504af5c1d96452dd
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1201/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1201/console

Build has FAILED

Patch application report for D4698 (id=18718)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit f21645fcc48f97d9ffde9dc0105d2a910c7188a3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6f3b728d436658a0960f0a74504af5c1d96452dd
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1202/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1202/console

Do we really need the two "index tables" in Cassandra? Tweaking the partition key of the main table should allow using it to query one of target -> extid or the reverse?

swh/storage/cassandra/schema.py
229

If you switch this to ((target, target_type), extid, extid_type)

239–245

Then I don't think this index is needed?

LGTM overall


I agree with @olasd. It's sad to break the symmetry, but it spares an indirection and some storage space.

to query one of target -> extid or the reverse

If the main table does "target -> extid", it reduces the storage space if some extids are large.

On the other hand, if the main table does "extid -> target", it spares the a query when getting targets from extids, which is probably by far what we will need the most, right?


Also, remove the docstrings in postgresql/storage.py, they are already in interface.py.

swh/storage/interface.py
510
523
swh/storage/postgresql/converters.py
42–48

dupe

swh/storage/postgresql/db.py
21–27

plz no, I removed it from swh.model.identifiers because we don't need it anymore.

OBJECT_TYPE_MAP[key] -> ObjectType(key)

823–825

why mangle_query_key?

This revision now requires changes to proceed.Mar 10 2021, 10:35 AM

LGTM overall


I agree with @olasd. It's sad to break the symmetry, but it spares an indirection and some storage space.

to query one of target -> extid or the reverse

If the main table does "target -> extid", it reduces the storage space if some extids are large.

On the other hand, if the main table does "extid -> target", it spares the a query when getting targets from extids, which is probably by far what we will need the most, right?

Yeah, you're right. So, flip my suggestion around :-)

swh/storage/postgresql/converters.py
42–48

so what's the "proper" way of doing this mapping?

swh/storage/postgresql/db.py
21–27

you mean ObjectType(key).name.lower() ?
It's hideous in any case, and that's exactly why I want to get rid of this naming/constant hell.

swh/storage/postgresql/converters.py
42–48

OBJECT_TYPE_MAP_SHORT[key] -> ObjectType[key.upper()]

swh/storage/postgresql/db.py
21–27

yes. but if you really want a map, let's add it back to swh-model, it doesn't belong in swh-storage

Remove extid_by_extid in cassandre

also:

  • rebase
  • fix test_backfiller (in a dedicated revision)
  • apply other suggestions made by olasd and vlorentz (all? most?)

Build has FAILED

Patch application report for D4698 (id=18731)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit c8e59c111d0f55a93bfacd01bd4b914308ea1290
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 15e0f388214de8d85ae68be636234e0908cfbe1c
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1207/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1207/console

Add missing support for extid in the replayer

also add a revision to import TEST_OBJECTS from the swh.model

Build has FAILED

Patch application report for D4698 (id=18732)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit f34717e2061978d90e0b40941930e0a438a9f27a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6a7773225d41995eb320aa98126e41ddce3cb24b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit e83452b3577a0985189c4be52735b34c274563e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 16:21:26 2021 +0100

    Import TEST_OBJECTS from swh.model instead of swh.journal
    
    this later has been deprecated for a while now.

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1208/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1208/console

get rid of the OBJECT_TYPE_MAP_SHORT map, as requested

Build has FAILED

Patch application report for D4698 (id=18735)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit aa9b8f62f419be035c3ce85c866f7db2d06155f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6a7773225d41995eb320aa98126e41ddce3cb24b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit e83452b3577a0985189c4be52735b34c274563e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 16:21:26 2021 +0100

    Import TEST_OBJECTS from swh.model instead of swh.journal
    
    this later has been deprecated for a while now.

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1209/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1209/console

swh/storage/cassandra/storage.py
1389–1391

you must re-filter the rows here, because extid_get_from_target will return unrelated rows in case of murmur3 collisions (which are likely to happen, the hash space is 32 bits). See _content_get_from_hash.

See also the *_murmur3_collision tests to test it.

This revision now requires changes to proceed.Mar 10 2021, 5:37 PM

Make sure extid_get_from_target() is immune to murmur3 collisions

as suggested by vlorentz.

Build has FAILED

Patch application report for D4698 (id=18741)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit dc5af6fcb1e46ee4576da192245992b815a405d8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6a7773225d41995eb320aa98126e41ddce3cb24b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit e83452b3577a0985189c4be52735b34c274563e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 16:21:26 2021 +0100

    Import TEST_OBJECTS from swh.model instead of swh.journal
    
    this later has been deprecated for a while now.

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1211/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1211/console

vlorentz added inline comments.
swh/storage/cassandra/cql.py
1033–1038

Please add a short comment explaining why this is needed

This revision is now accepted and ready to land.Mar 11 2021, 12:18 PM

bump dep on swh.model + add a comment (as requested)

Build has FAILED

Patch application report for D4698 (id=18743)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit b8e10f00cfbc49bafb11104c282bd2871a14f14f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6a7773225d41995eb320aa98126e41ddce3cb24b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit e83452b3577a0985189c4be52735b34c274563e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 16:21:26 2021 +0100

    Import TEST_OBJECTS from swh.model instead of swh.journal
    
    this later has been deprecated for a while now.

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1212/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1212/console

Build is green

Patch application report for D4698 (id=18743)

Rebasing onto c4fdd6dbd0...

Current branch diff-target is up to date.
Changes applied before test
commit b8e10f00cfbc49bafb11104c282bd2871a14f14f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 6a7773225d41995eb320aa98126e41ddce3cb24b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit e83452b3577a0985189c4be52735b34c274563e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 16:21:26 2021 +0100

    Import TEST_OBJECTS from swh.model instead of swh.journal
    
    this later has been deprecated for a while now.

commit 82ce7bf07ce24f6cafe84171f1fb71222e89906d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 10 15:46:29 2021 +0100

    Make sure test_backfill does not depend on 2 dict keys
    
    being miraculously listed the same.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1213/ for more details.