Page MenuHomeSoftware Heritage

Update methods associated to the origin-revision layer
ClosedPublic

Authored by aeviso on Jun 16 2021, 1:56 PM.

Details

Summary

Update methods associated to the origin-revision layer

Also fix outdated comments and code styling.

Depends on D5863

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5880 (id=21066)

Could not rebase; Attempt merge onto c9d1369ba1...

Updating c9d1369..3161678
Fast-forward
 swh/provenance/archive.py                          |  21 ++---
 swh/provenance/model.py                            |  73 ++++++++++-----
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  35 ++-----
 swh/provenance/postgresql/provenancedb_base.py     | 102 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 +++---
 .../postgresql/provenancedb_without_path.py        |  22 +++--
 swh/provenance/provenance.py                       |  42 +++++----
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  42 +++------
 10 files changed, 205 insertions(+), 171 deletions(-)
Changes applied before test
commit 316167843f015e10ec876a47bde84e03bbbdfbbd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update methods associated to the origin-revision layer

commit 6accfd7e4e38fae66317859e59f2876f586e16a7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit 6c148a713f2d29d97095fdf75fa379c02847c0c0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit 8aa8e5ed543e1ca20f867f0f7886e7e30502d5e8
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 13:59:30 2021 +0200

    Rework ArchiveInterface
    
    Unused methods were removed and type annotations fixed.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/143/ for more details.

Build is green

Patch application report for D5880 (id=21087)

Could not rebase; Attempt merge onto c9d1369ba1...

Updating c9d1369..af163b0
Fast-forward
 swh/provenance/archive.py                          |  42 +++++---
 swh/provenance/model.py                            | 111 ++++++++++-----------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  67 +++++++------
 swh/provenance/postgresql/provenancedb_base.py     | 102 ++++++++++++-------
 .../postgresql/provenancedb_with_path.py           |  26 ++---
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  42 +++++---
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 +++++++------
 10 files changed, 286 insertions(+), 211 deletions(-)
Changes applied before test
commit af163b0c683efb263e6225e8951cff356da593a2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit 9d5c0ceba2908ca8cf8243517b3cc8569117af50
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit 9d081d6acee99f93ddc40a55f87839c32580ba6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit f309f67a524cc29d12ae4cfb65c482f5aa6aaac9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Unused methods were removed and type annotations fixed.
    Other methos in OriginEntry and RevisionEntry were updated accordingly

commit 12edf4a3b7992575f5f24cd8a648c803dbda142a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieven parents in RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/147/ for more details.

Build is green

Patch application report for D5880 (id=21091)

Could not rebase; Attempt merge onto c9d1369ba1...

Updating c9d1369..a8c2c44
Fast-forward
 swh/provenance/archive.py                          |  42 +++++---
 swh/provenance/model.py                            | 111 ++++++++++-----------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  67 +++++++------
 swh/provenance/postgresql/provenancedb_base.py     | 102 ++++++++++++-------
 .../postgresql/provenancedb_with_path.py           |  26 ++---
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  42 +++++---
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 +++++++------
 10 files changed, 286 insertions(+), 211 deletions(-)
Changes applied before test
commit a8c2c44810277dba0532c5cd3dabc0f14d82bc45
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit cf9136d10a143c962b5b37baad9321d7ff4959d1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit efb4c361330a1402ec25ecb0b62b8ab41382c740
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit a12961f7536eb2e25c81b4f106c028dc6c655b29
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Unused methods were removed and type annotations fixed.
    Other methos in OriginEntry and RevisionEntry were updated accordingly

commit 12edf4a3b7992575f5f24cd8a648c803dbda142a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieven parents in RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/151/ for more details.

Build is green

Patch application report for D5880 (id=21092)

Could not rebase; Attempt merge onto c9d1369ba1...

Updating c9d1369..4c20456
Fast-forward
 swh/provenance/archive.py                          |  42 +++++---
 swh/provenance/model.py                            | 111 ++++++++++-----------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  67 +++++++------
 swh/provenance/postgresql/provenancedb_base.py     | 102 ++++++++++++-------
 .../postgresql/provenancedb_with_path.py           |  26 ++---
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  48 +++++----
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 +++++++------
 10 files changed, 289 insertions(+), 214 deletions(-)
Changes applied before test
commit 4c204561c0d3264e81b3fb1b75cf80e44ccbd9be
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit cf9136d10a143c962b5b37baad9321d7ff4959d1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit efb4c361330a1402ec25ecb0b62b8ab41382c740
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit a12961f7536eb2e25c81b4f106c028dc6c655b29
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Unused methods were removed and type annotations fixed.
    Other methos in OriginEntry and RevisionEntry were updated accordingly

commit 12edf4a3b7992575f5f24cd8a648c803dbda142a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieven parents in RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/152/ for more details.

douardda added a subscriber: douardda.

Mostly nitpicking comments, but I'd really prefer that:

  • the cache is kept properly typed
  • the cache clearing thing gets its own git revision
swh/provenance/postgresql/provenancedb_base.py
42

why clearing the cache? (It somewhat defeats the purpose of a cache).

43

I did not do this partial clear because of the global clear_cache() being called from ProvenanceBackend.commit.
I guess if we do handle (write) cache clearings at this level, then there is no need for this global clear_cache in ProvenanceBackend.

But in any case, this is unrelated with any of the 2 git revisions in this diff (it's not a origin-revision layer thing, neither it's a code styling issue).

76

nitpick: I'd rather put these insertions snippets in dedicated methods instead of having a "giant" commit() method (aka one for each type of table/relation, as it's now the case for the content-to-revision layer).

swh/provenance/postgresql/provenancedb_with_path.py
92–104

FTR, I generally use an intermediate sql variable to make it easier to debug: typically in a pytest --pdb session, if the query fails, it's easier to have this variable to see exactly what was the query sent to postgresql (it's then just a patter of (pdb) p sql).

swh/provenance/provenance.py
109

I'd prefer to keep this properly typed, so please add a new cache class dedicated to the revision_preferred_origin.

This revision now requires changes to proceed.Jun 18 2021, 11:06 AM
swh/provenance/postgresql/provenancedb_base.py
42

This was originally done to avoid trying to insert again was has already succeeded in the case of a commit fail. It doesn't defeat the purpose of the cache since we are precisely in the commit method. This cache was designed to be cleared after every commit, so that the worker gets the same value for a given entry if it is queried more than once during the same batch of revisions (even if another worker has modified that entry in parallel). But after that, we want the worker to see the updated value for the next batch of revisions.

43

If everything goes well during a commit, the final call to clear_cache is indeed not required, it was done just for sanity (ie. to be sure that all cache were cleared after a successful commit).

Also, I agree it was not supposed to be part of this commit. All this cache logic will be actually reworked as part of the backend refactoring, since we need to find a way to guarantee that a revision can be safely reprocessed in case of a critical db error. Anyway, I also think this shouldn't have been changed in the first place as, IMHO, it is better to keep the same semantics as before for now.

76

Yeah, I just got the feeling that the queries for the origin-revision layer are still too ad-hoc to be properly abstracted. All this class definitely requires some reworking. Will do anyway

swh/provenance/postgresql/provenancedb_with_path.py
92–104

Sounds like a good idea, but then we should be consistent and do that with every query.

Build is green

Patch application report for D5880 (id=21133)

Could not rebase; Attempt merge onto 8ff1ab5860...

Updating 8ff1ab5..e89b9a7
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/provenance/archive.py                          |  42 +++++--
 swh/provenance/model.py                            |  90 +++++++-------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  67 ++++++-----
 swh/provenance/postgresql/provenancedb_base.py     | 133 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 ++--
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  48 +++++---
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 ++++++-----
 11 files changed, 295 insertions(+), 220 deletions(-)
Changes applied before test
commit e89b9a7ccfccd526c09f66c292664a7b5c7e4aa9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit ce6a2a4dedc80f6a19918b6b927b4d35462fc484
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit a7e977c8af91a7b7a40655a30fe19211b6d794da
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit 969e3359123b781ae1101e2fb62b9934b0988478
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Remove Unused methods and fix type annotations. Update Other methods in
    OriginEntry and RevisionEntry accordingly.

commit d28d3eed7049e36c5841405021fd0a0e78c11cb7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieving parents in RevisionEntry
    
    Convert `Revision.date` from` TimestampWithTimezone` to `datetime` as expected by` RevisionEntry`.
    
    Create a list with the iterator returned by `ArchiveInterface.revision_get ()` before comparison.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/164/ for more details.

swh/provenance/postgresql/provenancedb_base.py
42

I do discuss the reason for clearing the write cache (now, the cache["added"] only), but I don't see the purpose of clearing the data cache, ie to )potentially) enforce querying the same data from the db over and over.

I know that the big clear_cache called from provenance.py will also clear all these caches, so it actually does not make a difference, but what I don't understand is why we don't keep these caches as long as possible (obvisoulsy, at some point we may need a proper lru handling).

43

If everything goes well during a commit, the final call to clear_cache is indeed not required, it was done just for sanity (ie. to be sure that all cache were cleared after a successful commit).

ok, but I'd much better like a code that does not need a "sanity cache clearing", because if it's needed, then there is something wrong we don't detect at runtime.
If any, we should replace this "sanity final clear cache" by a "sanity final cache check" that ensures that no write cache contains an entry, and raise an exception if not true.

Also, I agree it was not supposed to be part of this commit. All this cache logic will be actually reworked as part of the backend refactoring, since we need to find a way to guarantee that a revision can be safely reprocessed in case of a critical db error. Anyway, I also think this shouldn't have been changed in the first place as, IMHO, it is better to keep the same semantics as before for now.

ok

swh/provenance/postgresql/provenancedb_with_path.py
92–104

I agree, no rush, but yes

swh/provenance/postgresql/provenancedb_base.py
42

We don't keep then as long as possible because this will be shadowing the worker from modifications made concurrently by other workers. Thus the decision of making cache only last for the current processing batch. It makes the processing of the batch consistent if some data is queried several times and it improves performance, but you don't want to keep outdated data forever

43

It's not really needed, but it was there before this other cleanings were introduced. What's the harm of keeping it? I don't get it

swh/provenance/postgresql/provenancedb_base.py
43

Also, this clearing of the structure is to somehow keep track of what was already processed, the fact that this is clearing the outer cache is simply because we forward to this methods the actual cache of the upper class. But we may want to forward a deep copy of it instead, to make things clearer. Hence, this class just clears what was already processed and the upper class clears it cache independently

Build is green

Patch application report for D5880 (id=21141)

Could not rebase; Attempt merge onto 8ff1ab5860...

Updating 8ff1ab5..ef13a5a
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/provenance/archive.py                          |  42 +++++--
 swh/provenance/model.py                            |  90 +++++++-------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  70 ++++++-----
 swh/provenance/postgresql/provenancedb_base.py     | 133 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 ++--
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  48 +++++---
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 ++++++-----
 11 files changed, 298 insertions(+), 220 deletions(-)
Changes applied before test
commit ef13a5a5072cd3f0a68367571613ba45ff8e3c84
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit 4953a1688a19b956a33164b4e5540edb3b8e2c78
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit e610883f5d89c58e409cc7ddf490a4a778b99dcd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit 01bbb0ce850a5cfafaa1837fa346a19b2a851fc9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Remove Unused methods and fix type annotations. Update Other methods in
    OriginEntry and RevisionEntry accordingly.

commit cd48f9d71efeaf73cfcc24f397d22589a5f7987b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieving parents in RevisionEntry
    
    Convert `Revision.date` from` TimestampWithTimezone` to `datetime` as expected by` RevisionEntry`.
    Create a list with the iterator returned by `ArchiveInterface.revision_get()` before comparison.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/169/ for more details.

swh/provenance/postgresql/provenancedb_base.py
42

oh yes you are right. Now I wonder how much, practically, these worker-local caches hurt the algorithm when running with massive concurrency... but that's another story for sure.

43

It's not really needed, but it was there before this other cleanings were introduced. What's the harm of keeping it? I don't get it

The harm will not be at processing/execution time, it's more of a human-reader concern: it's a bit confusing; what's the intention of the developer when doing this global clear cache in addition to step-by-step clearing? what does it say about how the code is supposed to work? etc.
If you think a sanity check is needed there, then make it a real sanity check (i.e. check that the cache is empty at the end of the commit/flush), not hide under the carpet any discrepancy ;-)

Also, this clearing of the structure is to somehow keep track of what was already processed,

After this discussion, I'm more i favor of keeping the cache clearings piece by piece as we wove along the insertion process and kill the global clear_cache (that I find confusing on the intentions of the code).

But we may want to forward a deep copy of it instead,

I don't think we should copy the structures at all actually; it makes a lot of copy but I believe is in fact not needed.

I see no harm in having these caches being filled in provenance.py and consumed by the low level (db) part of the code.
As long as it's clearly documented.

swh/provenance/postgresql/provenancedb_base.py
43

Well, I think the proper thing to do is forward only the write cache to the lower class, have it cleared as inserts succeed but let the upper class decide whether to clear the read cache or not.

What I don't like about the current state of the solution is that we are forwarding the complete cache to the db class, instead of having a clear separation between write and read caches. This need to be reworked as part of the backend refactoring.

Build is green

Patch application report for D5880 (id=21154)

Could not rebase; Attempt merge onto 8ff1ab5860...

Updating 8ff1ab5..31f9aad
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/provenance/archive.py                          |  42 +++++--
 swh/provenance/model.py                            |  90 +++++++-------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  70 ++++++-----
 swh/provenance/postgresql/provenancedb_base.py     | 133 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 ++--
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  63 ++++++----
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 ++++++-----
 11 files changed, 308 insertions(+), 225 deletions(-)
Changes applied before test
commit 31f9aad728a70b7bfe9f065b86bf78ab1126bc99
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit 4953a1688a19b956a33164b4e5540edb3b8e2c78
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit e610883f5d89c58e409cc7ddf490a4a778b99dcd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit 01bbb0ce850a5cfafaa1837fa346a19b2a851fc9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Remove Unused methods and fix type annotations. Update Other methods in
    OriginEntry and RevisionEntry accordingly.

commit cd48f9d71efeaf73cfcc24f397d22589a5f7987b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieving parents in RevisionEntry
    
    Convert `Revision.date` from` TimestampWithTimezone` to `datetime` as expected by` RevisionEntry`.
    Create a list with the iterator returned by `ArchiveInterface.revision_get()` before comparison.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/171/ for more details.

Thanks for the DatetimeCache & co.

I'm fine with this, but note the last comments I've added, I'd prefer these to be addressed here, but no big deal if not.

swh/provenance/postgresql/provenancedb_base.py
43

We currently have only one cache, the "write cache" is not a cache per se, it's the list (well the set) of elements in the cache that needs to be written in the db.

Having 2 caches gives no real advantage but instead oblige to look in 2 caches (read cache then write cache) when looking for an element.

I don't see the drawback of the current solution (keeping only one cache + a list of cache entries that needs to be written in the db).

swh/provenance/provenance.py
252

could use dict.setdefault here to make it a one-liner:

self.cache["revision_before_revision"].setdefault(revision.id, set()).add(relative.id)
263–268

I prefer the form

if id not in cache:
  look for id in db and put it in cache
return cache.get(id)

see suggested code.

278

why the dict() here? It should not be needed.

This revision is now accepted and ready to land.Jun 21 2021, 4:32 PM
swh/provenance/provenance.py
278

self.cache["revision_in_origin"] is a set of tuples and revision.id should be the first item of at least one of those tuples. I don't think doing revision.id in self.cache["revision_in_origin"]actually works. Does it?

Build is green

Patch application report for D5880 (id=21173)

Could not rebase; Attempt merge onto 011645221c...

Updating 0116452..f2ec1e5
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/provenance/archive.py                          |  42 +++++--
 swh/provenance/model.py                            |  90 +++++++-------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  70 ++++++-----
 swh/provenance/postgresql/provenancedb_base.py     | 133 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 ++--
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  62 ++++++----
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 ++++++-----
 11 files changed, 307 insertions(+), 225 deletions(-)
Changes applied before test
commit f2ec1e58c91f1fda1121c6d7c1e29dede6431d45
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit fd66d83c119d8f8b283098f6320bf6f68cc7114f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit 417fd014d96f09b8a20831a6a060cfb408376d5c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit f354b65e52ed78b3a637b2632e1b74bb920669be
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Remove Unused methods and fix type annotations. Update Other methods in
    OriginEntry and RevisionEntry accordingly.

commit e6f39d0244b10b49942b0ab93d4628828e343642
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieving parents in RevisionEntry
    
    Convert `Revision.date` from` TimestampWithTimezone` to `datetime` as expected by` RevisionEntry`.
    Create a list with the iterator returned by `ArchiveInterface.revision_get()` before comparison.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/179/ for more details.

Build is green

Patch application report for D5880 (id=21184)

Could not rebase; Attempt merge onto 011645221c...

Updating 0116452..a0e6dff
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/provenance/archive.py                          |  42 +++++--
 swh/provenance/model.py                            |  90 +++++++-------
 swh/provenance/origin.py                           |   3 +-
 swh/provenance/postgresql/archive.py               |  70 ++++++-----
 swh/provenance/postgresql/provenancedb_base.py     | 133 +++++++++++++--------
 .../postgresql/provenancedb_with_path.py           |  26 ++--
 .../postgresql/provenancedb_without_path.py        |  22 ++--
 swh/provenance/provenance.py                       |  62 ++++++----
 swh/provenance/sql/30-schema.sql                   |  10 +-
 swh/provenance/storage/archive.py                  |  72 ++++++-----
 11 files changed, 307 insertions(+), 225 deletions(-)
Changes applied before test
commit a0e6dffc572d46a8b0b4407319eeb5b189cf9dc8
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 13:53:25 2021 +0200

    Update backend methods associated to the origin-revision layer

commit fd66d83c119d8f8b283098f6320bf6f68cc7114f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 16 11:11:23 2021 +0200

    Fix outdated comments and code styling.

commit 417fd014d96f09b8a20831a6a060cfb408376d5c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 14 14:03:09 2021 +0200

    Refactor RevisionEntry's parents iterator
    
    Make parents a class property and create a separate method to retrieve information
    from the archive, just as it is done for the other model classes

commit f354b65e52ed78b3a637b2632e1b74bb920669be
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:51:54 2021 +0200

    Rework ArchiveInterface
    
    Remove Unused methods and fix type annotations. Update Other methods in
    OriginEntry and RevisionEntry accordingly.

commit e6f39d0244b10b49942b0ab93d4628828e343642
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Jun 17 13:48:24 2021 +0200

    Fix bugs when retrieving parents in RevisionEntry
    
    Convert `Revision.date` from` TimestampWithTimezone` to `datetime` as expected by` RevisionEntry`.
    Create a list with the iterator returned by `ArchiveInterface.revision_get()` before comparison.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/184/ for more details.