Page MenuHomeSoftware Heritage

[WIP][RFC] Add support for ExtID in the storage
Needs ReviewPublic

Authored by douardda on Dec 9 2020, 5:12 PM.

Details

Summary

This endpoint allows to query the storage for known external identifier (typ. revisions from
their original VCS type intrinsic identifier.)

The underlying data structure is filled via the extid_add() endpoint (which will be the responsibility of loaders).

Related to T2849.

Depends on D4807.

Missing:

  • cassandra implementation,
  • inmemory implementations,
  • feed kafka,
  • consistent names,
  • more tests,
  • migration scripts,
  • other stuff

Diff Detail

Event Timeline

douardda retitled this revision from [WIP][RFC] Add hg revisions to the test data set to [WIP][RFC] Add a revision_id_from_vcs() method to the Storage.Dec 9 2020, 5:17 PM
douardda edited the summary of this revision. (Show Details)

Thanks for having a shot at this.

Rather than fishing the original VCS info out of the Revision object (introducing tight coupling between loaders and storage for that aspect of the storage of metadata), I'd prefer having an explicit (opt-in) API for loaders to record that information: as this is likely to be used for releases or other types of objects as well, I'm not sure we'll want to have hard-coded quirks for a bunch of loaders and need a new release of swh.storage every time we want to enable this API for them.

Build was aborted

Patch application report for D4698 (id=16661)

Rebasing onto 04ae89f4b4...

Current branch diff-target is up to date.
Changes applied before test
commit a38ffa19dedd933dbabc903496a32ad4b6999586
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add a revision_id_from_vcs() method to the Storage
    
    This endpoint allows to query the storage for known revisions from
    their original VCS type intrinsic identifier.
    
    The underlying data structure is filled in `revision_add()`.
    
    Related to T2849.

commit be2ebf46b27b637c04954ca57542bfbe0d9fccdb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1073/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1073/console

zack added inline comments.
swh/storage/common.py
17 ↗(On Diff #16661)

Naming nitpick, as we discussed (at least tentatively) in T2849, this is not going to be VCS-only, we might have external IDs also for stuff like tarballs and other archives.

So, should we call these things "external IDs", rather than "VCS IDs"?
And, if so, rename revision_to_vcsid to revision_to_extid.
Or something.

vlorentz added inline comments.
swh/storage/common.py
17 ↗(On Diff #16661)

Already bikeshedded (via Jitsi), and we agreed on using the name "vcsid" anyway.

I would be fine with "extid", though.

I completely agree with @olasd 's comment.

And why is it restricted to revisions? Why not have an endpoint that returns an arbitrary SWHID?

I completely agree with @olasd 's comment.

And why is it restricted to revisions? Why not have an endpoint that returns an arbitrary SWHID?

Not sure what would be the use case, but the fact @olasd proposed to have a swhid_type column shows a pattern there.

Then, also not sure what column type we should use for the vcs_type / extid_type column. enum or just str?

Not sure what would be the use case, but the fact @olasd proposed to have a swhid_type column shows a pattern there.

For example, the git loader could use it when loading a sha256 repo, to see if a directory was already ingested.

Then, also not sure what column type we should use for the vcs_type / extid_type column. enum or just str?

good question. It could be either just the name of the loader, or a pair (vcs name, name of type in the vcs)

Refactor the API and implementation so ExtID are not created as part of revision_add()

but instead are the responsibility of loaders via an extid_add() method.

Build has FAILED

Patch application report for D4698 (id=17037)

Rebasing onto 2b35198d30...

Current branch diff-target is up to date.
Changes applied before test
commit 733108cea78438c4233f123dba6b860ec61ad031
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 2a5bb7994f283ef036c4b9a1a051b020c298ed38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1078/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1078/console

douardda retitled this revision from [WIP][RFC] Add a revision_id_from_vcs() method to the Storage to [WIP][RFC] Add support for ExtID in the storage.Jan 5 2021, 5:18 PM
douardda edited the summary of this revision. (Show Details)

Don't forget publishing to Kafka (but it should probably go in a future diff instead of bloating this one)

swh/storage/common.py
19–22 ↗(On Diff #17037)
swh/storage/postgresql/db.py
828–832

Doesn't this load *all* the extids of a given VCS in RAM? This could get quite big for Mercurial...

You could join with the extid table directly.

851–855

same

swh/storage/common.py
19–22 ↗(On Diff #17037)

actually this function is not used any more, forgot to remove it...

swh/storage/postgresql/db.py
828–832

yeah I was pretty sure while writing this that it was a naive approach... as you can see, my SQL skills are pretty narrow...

You should probably add a test to insert duplicate entries in separate calls (and make sure you only return a single row)

swh/storage/postgresql/db.py
827–841

I haven't tested this but I think it should work.

swh/storage/sql/30-schema.sql
282–283

We'll want to add an index (unique) on these columns.

swh/storage/sql/40-funcs.sql
561

And we'll want to add an "on conflict (extid_pkey) do nothing" to this insert to avoid bombing out on already existing entries.

swh/storage/postgresql/db.py
827–841

alternatively, LEFT JOIN extid USING (extid) WHERE extid_type = '%s'. both are equivalent.

swh/storage/postgresql/db.py
827–841

that's what I tried but it breaks my tests (because [] != [None, None, None, None])

swh/storage/postgresql/db.py
827–841

right. you need an inner join.

(as a good rule of thumb, always use inner joins in SQL, one rarely ever needs the other types of joins)

swh/storage/postgresql/db.py
827–841

I finally have it working (with a left join). My main problem was to figure how to make it happy with the USING(target_type) (in extid_get_from_swhid_list)...

Rewrite SQL queries a bit better (thx olasd and vlorentz)

Build has FAILED

Patch application report for D4698 (id=17061)

Rebasing onto 2b35198d30...

Current branch diff-target is up to date.
Changes applied before test
commit 3e690230053d09aa7f25906126a316442a4df058
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 2a5bb7994f283ef036c4b9a1a051b020c298ed38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1079/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1079/console

Add unicity index and corresponding test

Build has FAILED

Patch application report for D4698 (id=17064)

Rebasing onto 2b35198d30...

Current branch diff-target is up to date.
Changes applied before test
commit e1bdbf89deab8e5a80608cf0345c3d91a5e3c3a3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit 2a5bb7994f283ef036c4b9a1a051b020c298ed38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1080/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1080/console

Build has FAILED

Patch application report for D4698 (id=17169)

Could not rebase; Attempt merge onto 30945a5890...

Updating 30945a58..632b0666
Fast-forward
 README.md                           |  11 +++
 swh/storage/cassandra/cql.py        | 127 ++++++++++++++++++++++++++
 swh/storage/cassandra/model.py      |  11 +++
 swh/storage/cassandra/schema.py     |  25 ++++++
 swh/storage/cassandra/storage.py    |  82 +++++++++++++++++
 swh/storage/in_memory.py            |  42 +++++++++
 swh/storage/interface.py            |  48 ++++++++++
 swh/storage/postgresql/db.py        |  48 ++++++++++
 swh/storage/postgresql/storage.py   |  73 +++++++++++++++
 swh/storage/sql/30-schema.sql       |  16 ++++
 swh/storage/sql/40-funcs.sql        |  17 ++++
 swh/storage/sql/60-indexes.sql      |   4 +
 swh/storage/tests/storage_data.py   | 136 +++++++++++++++++++++++++++-
 swh/storage/tests/storage_tests.py  | 174 ++++++++++++++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py |  16 ++--
 swh/storage/writer.py               |   4 +
 tox.ini                             |   1 +
 17 files changed, 827 insertions(+), 8 deletions(-)
Changes applied before test
commit 632b0666fc612e63c55da976ea4fd88ed1c7a065
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:57:44 2020 +0100

    Add ExtID query support to the Storage
    
    These endpoints allow to add and query the storage for known ExtID from SWHID
    (typically get original VCS' revision intrinsic identifier from SWHID).
    
    The underlying data structure is to be filled typically by loaders using
    the `extid_add()` endpoint.
    
    This only provides the Postgresql implementation.
    
    Related to T2849.

commit a10abe86e06fa10e00f323df491f85daa6b667d5
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 9 16:54:25 2020 +0100

    Add hg revisions to the test data set

commit 728c3eeaae84a7e249a712a5545cc9ba682403c3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jan 8 10:39:42 2021 +0100

    Allow to use the JAVA_HOME environment for cassandra tests
    
    This allows to enforce a specific version of java to be used. For
    example, since cassandra seems not to support java 14 yet, this allows
    to run tests on bullseye:
    
      JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ pytest swh

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1096/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1096/console