Page MenuHomeSoftware Heritage

Refactor ArchiveInterface to fit origin-revision layer needs
ClosedPublic

Authored by aeviso on Jun 25 2021, 1:46 PM.

Details

Summary

Replace revision_get method by revision_get_parents returning an iterable of
parents' ids only, instead of a swh.model.model.Revision object.

Add test to compare both ArchiveInterface implementations

Improve documentation of the interface and complete pending TODO's.

Depends on D5924

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build was aborted

Patch application report for D5925 (id=21266)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..c4a11c6
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 115 ++++++++-------------
 swh/provenance/postgresql/provenancedb_base.py     |  73 ++++++++-----
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       |  95 ++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   2 +
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/test_archive_interface.py     |  40 +++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 18 files changed, 289 insertions(+), 240 deletions(-)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit c4a11c68b06b5f803473f663131b73f067e1d46a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both ArchiveInterface implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 85e0be2f0755a8b956b0c8a0f91cc9a9af8c57ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor ArchiveInterface to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit c27beb2c29cca48435b02a3764c39efcfe04d6d4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are not correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit 0271c1e41e9b01879e5fddf1b8103fe67451654d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/190/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/190/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 25 2021, 2:09 PM
Harbormaster failed remote builds in B22247: Diff 21266!

Build is green

Patch application report for D5925 (id=21286)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..4eb166c
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 118 ++++++++-------------
 swh/provenance/postgresql/provenancedb_base.py     |  73 +++++++++----
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       |  95 ++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   2 +
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 303 insertions(+), 240 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit 4eb166cc4f2aa036c932b9a5eb462454a70ee0d9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both ArchiveInterface implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 01ac9eea375258ac1e000389d3fd286d0dbae458
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit 76d1560924251396c1ac63c286d8612ce0f7e9d9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor ArchiveInterface to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit df69a9e57692ed9d4d870c295a21b3ac187d7b9c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are not correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit fa22dc902781e30e46823030681f003983cc6d6e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/192/ for more details.

Build is green

Patch application report for D5925 (id=21366)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..6736f60
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 118 ++++++++-------------
 swh/provenance/postgresql/provenancedb_base.py     |  73 +++++++++----
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       |  95 ++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   2 +
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 303 insertions(+), 240 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit 6736f6068280f167df5616681dee9ad67b2b7dbd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both `ArchiveInterface` implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit dde867254e51dd87f4aba3cdea59da8bffc2d160
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit 14001c1844598a3d4ebd1b5f609070f9c85dcaa9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor `ArchiveInterface` to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit df69a9e57692ed9d4d870c295a21b3ac187d7b9c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are not correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit fa22dc902781e30e46823030681f003983cc6d6e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/203/ for more details.

vlorentz added inline comments.
swh/provenance/storage/archive.py
11–19

why not yield entries directly?

swh/provenance/storage/archive.py
11–19

Well, the documentation (in the definition of the ArchiveInterface) states that each Dict should have only these three keys, and we use that to test that both implementation (ArchiveStorage and ArchivePostgreSQL) behave in the same way.

douardda added a subscriber: douardda.
douardda added inline comments.
swh/provenance/postgresql/archive.py
56

why is this removed?

68–72

since you do only return elements from this array for one revision only, I don't think you need to recreate the ARRAY structure.

If using suggested SQL code, then just yield from returned rows.

69–72

since you do only return elements from this array for one revision only, I don't think you need to recreate the ARRAY structure.

113–126

An OR should do the trick here

114

print statement

This revision now requires changes to proceed.Jul 1 2021, 12:07 PM
aeviso added inline comments.
swh/provenance/postgresql/archive.py
56

there's no need to guaranty any order when iterating the contents of a directory

68–72

I agree the query can be further simplified, but wouldn't row be a tuple with a single Sha1Git element? I guess I still need to yield row[0]

113–126

OK, I'll try it

aeviso added inline comments.
swh/provenance/postgresql/archive.py
113–126

Sorry, I don't see how these queries are equivalent. There is a missing JOIN in the proposed one

aeviso marked an inline comment as not done.Jul 1 2021, 2:46 PM

Build is green

Patch application report for D5925 (id=21392)

Could not rebase; Attempt merge onto d892b29e40...

Updating d892b29..0e2a3c6
Fast-forward
 swh/provenance/archive.py                          |  24 ++---
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/model.py                            |  42 +++-----
 swh/provenance/origin.py                           |  19 ++--
 swh/provenance/postgresql/archive.py               | 115 +++++++--------------
 swh/provenance/postgresql/provenancedb_base.py     |  73 ++++++++-----
 .../postgresql/provenancedb_with_path.py           |  16 +--
 .../postgresql/provenancedb_without_path.py        |  20 ++--
 swh/provenance/provenance.py                       |  95 ++++++++++-------
 swh/provenance/revision.py                         |   5 +-
 swh/provenance/sql/30-schema.sql                   |   2 +
 swh/provenance/storage/archive.py                  |  30 +++---
 swh/provenance/tests/conftest.py                   |  22 ++--
 .../tests/data/generate_storage_from_git.py        |   3 +-
 swh/provenance/tests/data/with-merges.msgpack      | Bin 0 -> 7501 bytes
 ...repo_with_merges.yaml => with-merges_repo.yaml} |   0
 ...s-visits-01.yaml => with-merges_visits-01.yaml} |   0
 swh/provenance/tests/test_archive_interface.py     |  51 +++++++++
 swh/provenance/tests/test_origin_iterator.py       |   8 +-
 swh/provenance/tests/test_provenance_db.py         |   4 +-
 swh/provenance/tests/test_provenance_heuristics.py |   7 +-
 21 files changed, 298 insertions(+), 242 deletions(-)
 create mode 100644 swh/provenance/tests/data/with-merges.msgpack
 rename swh/provenance/tests/data/{repo_with_merges.yaml => with-merges_repo.yaml} (100%)
 rename swh/provenance/tests/data/{repo_with_merges-visits-01.yaml => with-merges_visits-01.yaml} (100%)
 create mode 100644 swh/provenance/tests/test_archive_interface.py
Changes applied before test
commit 0e2a3c64ce3c368b53c101c541e8aebcde789477
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 25 13:38:26 2021 +0200

    Add test to compare both `ArchiveInterface` implementations
    
    Improve documentation of the interface and complete pending TODO's.

commit 98bba93cccece2b47ec4cd5887997cb5bede1e87
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:25:15 2021 +0200

    Rename test files to keep naming convension
    
    Also added missing .msgpack file dump for new with-merges repository.

commit fa9198afb71bcf3b8abea07d88d763a430f7358e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 24 16:05:24 2021 +0200

    Refactor `ArchiveInterface` to fit origin-revision layer needs
    
    Replace `revision_get` method by `revision_get_parents` returning an iterable of
    parents' ids only, instead of a swh.model.model.Revision object.

commit 9e0c1aa099073887206c9334e17b49ee31bbef9a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 20:00:40 2021 +0200

    Use `Sha1Git` type to explicitly state the kind of identifiers
    
    Previous occurrences of `bytes` and `Sha1` are now correctly using `Sha1Git`.
    Also, some bytes conversion methods were replaced by their counterparts in
    the swh.model.hashutil module.

commit a27ffff67b6b14bf37d153bb9b1d1c2ae63773fc
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 23 19:12:06 2021 +0200

    Add support for sha1 identifiers for origins

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/214/ for more details.

swh/provenance/postgresql/archive.py
68–72

yes but the idea is to prevent sending unused data on the wire (pg connection).

113–126

you're right, I missed that. You can probably just pack the missing join in there, I don't think it would hurt the query performances (to be checked), so something like:

WITH S AS 
(SELECT object_id FROM snapshot WHERE snapshot.id=%s )

SELECT B.target AS head
  FROM S
  JOIN snapshot_branches AS BS ON (S.object_id=BS.snapshot_id)
  JOIN snapshot_branch AS B ON (BS.branch_id=B.object_id)
  LEFT JOIN release AS R ON (B.target=R.id)
WHERE B.target_type='release'::snapshot_target AND R.target_type='revision'::object_type;
aeviso added inline comments.
swh/provenance/postgresql/archive.py
113–126

I rather keep the current query for now

This revision is now accepted and ready to land.Jul 2 2021, 3:33 PM