Page MenuHomeSoftware Heritage

Fix database queries related to the origin-revision layer
ClosedPublic

Authored by aeviso on Jun 29 2021, 12:48 PM.

Details

Summary

This required allowing null dates in the revision table so that revision can be added
by the origin-revision layer algorithm but not recognized as already processed by the
revision-content layer. Revision and origin entries are now inserted in the database
prior to inserting rows to revision_in_origin and revision_before_revision relations,
so that internal ids are properly resolved.

Depends on D5925

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5943 (id=21338)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..37ac81f
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 118 ++++++++------------
 swh/provenance/postgresql/provenancedb_base.py     | 120 +++++++++++++++------
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       | 103 +++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   4 +-
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 347 insertions(+), 253 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit 37ac81faf15a32c4471a3c4ee5140bcb9bf57178
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:10:38 2021 +0200

    Fix database queries related to the origin-revision layer
    
    This required allowing null dates in the `revision` table so that revision can be added
    by the origin-revision layer algorithm but not recognized as already processed by the
    revision-content layer. Revision and origin entries are now inserted in the database
    prior to inserting rows to revision_in_origin and revision_before_revision relations,
    so that internal ids are properly resolved.

commit 4eb166cc4f2aa036c932b9a5eb462454a70ee0d9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both ArchiveInterface implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 01ac9eea375258ac1e000389d3fd286d0dbae458
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit 76d1560924251396c1ac63c286d8612ce0f7e9d9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor ArchiveInterface to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit df69a9e57692ed9d4d870c295a21b3ac187d7b9c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are not correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit fa22dc902781e30e46823030681f003983cc6d6e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/193/ for more details.

Build is green

Patch application report for D5943 (id=21367)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..3171ae2
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 118 ++++++++------------
 swh/provenance/postgresql/provenancedb_base.py     | 120 +++++++++++++++------
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       | 103 +++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   4 +-
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 347 insertions(+), 253 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit 3171ae2f129df433689fd22e32c8eeebf7af4171
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:10:38 2021 +0200

    Fix database queries related to the origin-revision layer
    
    This required allowing null dates in the `revision` table so that revision can be added
    by the origin-revision layer algorithm but not recognized as already processed by the
    revision-content layer. Revision and origin entries are now inserted in the database
    prior to inserting rows to revision_in_origin and revision_before_revision relations,
    so that internal ids are properly resolved.

commit 6736f6068280f167df5616681dee9ad67b2b7dbd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both `ArchiveInterface` implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit dde867254e51dd87f4aba3cdea59da8bffc2d160
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit 14001c1844598a3d4ebd1b5f609070f9c85dcaa9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor `ArchiveInterface` to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit df69a9e57692ed9d4d870c295a21b3ac187d7b9c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are not correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit fa22dc902781e30e46823030681f003983cc6d6e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/204/ for more details.

Why do all these queries use LOCK TABLE?

swh/provenance/postgresql/provenancedb_base.py
152

nitpick

swh/provenance/postgresql/provenancedb_base.py
152

Agreed. I missed this one. Thanks!

Why do all these queries use LOCK TABLE?

This locking is currently needed to make the ingestion process not explode when you have many concurrent workers: without them, a lot of conflicts (and rollback) happen which makes performances even worse than with the locking.
It is something that is currently being worked, as part of an important refactoring in progress (with the help of olasd) to attempt to make the provenance db more concurrent friendly.

douardda added inline comments.
swh/provenance/postgresql/provenancedb_base.py
148

unless I'm wrong, the locking of the table in this very case is not needed, thanks to the on conflict do nothing part.

If true, some other table lockings can be removed safely.

@olasd am I right?

swh/provenance/postgresql/provenancedb_base.py
148

Not really sure, but this locks will be reworked soon so I don't think is worth worrying about them right now

152

I've realized this set is actually there to avoid trying to insert twice the same revision (the input is a set of tuples, not a dictionary in this case)

Build is green

Patch application report for D5943 (id=21393)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..d45d6ff
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 115 +++++++-------------
 swh/provenance/postgresql/provenancedb_base.py     | 120 +++++++++++++++------
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       | 103 +++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   4 +-
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 342 insertions(+), 255 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit d45d6ff9e9317ecfe38d584df7297c548b654d28
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:10:38 2021 +0200

    Fix database queries related to the origin-revision layer
    
    This required allowing null dates in the `revision` table so that revision can be added
    by the origin-revision layer algorithm but not recognized as already processed by the
    revision-content layer. Revision and origin entries are now inserted in the database
    prior to inserting rows to revision_in_origin and revision_before_revision relations,
    so that internal ids are properly resolved.

commit 0e2a3c64ce3c368b53c101c541e8aebcde789477
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both `ArchiveInterface` implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 98bba93cccece2b47ec4cd5887997cb5bede1e87
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit fa9198afb71bcf3b8abea07d88d763a430f7358e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor `ArchiveInterface` to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit 9e0c1aa099073887206c9334e17b49ee31bbef9a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are now correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit a27ffff67b6b14bf37d153bb9b1d1c2ae63773fc
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/215/ for more details.

swh/provenance/postgresql/provenancedb_base.py
152

beware: the form

{x for x in stuff}

is a set (built by comprehension). The dict would look like

{x: "foo" for x in stuff}

swh/provenance/postgresql/provenancedb_base.py
152

I don't follow. There is no dict here

swh/provenance/postgresql/provenancedb_base.py
152

set((rev,) for rev, _ in data) and {(rev,) for rev, _ in data} do exactly the same thing: building a set; which does always deduplicate.

douardda requested changes to this revision.Jul 2 2021, 3:40 PM
douardda added inline comments.
swh/provenance/provenance.py
259

I don't see the point of looping on the whole cache here rather than the input list of sha1s.

why not keep the original implemantation adapted to check for None dates?

like

{sha1: cache["data"][sha1] for sha1 in ids if cache["data"].get(sha1) is not None }
This revision now requires changes to proceed.Jul 2 2021, 3:40 PM
aeviso added inline comments.
swh/provenance/postgresql/provenancedb_base.py
152

Sure, but this code is replaced in the next few commits, so no point rebasing for such a minor detail.

swh/provenance/provenance.py
259

Because cache now hold Optional[datetime] as values and this method return datetime. So this is mainly to please mypy

Build is green

Patch application report for D5943 (id=21443)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..b7fdcde
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 115 +++++++-------------
 swh/provenance/postgresql/provenancedb_base.py     | 120 +++++++++++++++------
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       | 112 +++++++++++--------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   4 +-
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 347 insertions(+), 259 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit b7fdcdec7ea96101d62a57d9aeed114c897df961
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:10:38 2021 +0200

    Fix database queries related to the origin-revision layer
    
    This required allowing null dates in the `revision` table so that revision can be added
    by the origin-revision layer algorithm but not recognized as already processed by the
    revision-content layer. Revision and origin entries are now inserted in the database
    prior to inserting rows to revision_in_origin and revision_before_revision relations,
    so that internal ids are properly resolved.

commit 0e2a3c64ce3c368b53c101c541e8aebcde789477
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both `ArchiveInterface` implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 98bba93cccece2b47ec4cd5887997cb5bede1e87
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit fa9198afb71bcf3b8abea07d88d763a430f7358e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor `ArchiveInterface` to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit 9e0c1aa099073887206c9334e17b49ee31bbef9a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are now correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit a27ffff67b6b14bf37d153bb9b1d1c2ae63773fc
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/241/ for more details.

swh/provenance/provenance.py
259

But does my proposal code makes mypy unhappy?

swh/provenance/provenance.py
259

my suggested code

I still disagree with the implementation of get_dates() but meh

This revision is now accepted and ready to land.Jul 2 2021, 4:38 PM
aeviso added inline comments.
swh/provenance/provenance.py
259

Yes, you are making the check against a call to .get(sha1) but adding to the dict the of indexing ([sha1]) operator... mypy won't acknowledge that this results are equal