Page MenuHomeSoftware Heritage

First stage of refactoring for the Provenance backend
ClosedPublic

Authored by aeviso on Jun 10 2021, 12:46 PM.

Details

Summary

Simplify cache usage in the Provenance backend

Refactor insertion methods in the Provenance backend

Split Provenance backend in two layers

First layer (temporarily called ProvenanceBackend) is responsable of
handling read/write caches and it should ideally be db absnostic (not
yet though).
Second layer is responsable of all db interaction. In revisions to come
it will be further refactored into sevel workers to guarantee no
collitions when writing to the DB.

Depends on D5847

Diff Detail

Event Timeline

Build is green

Patch application report for D5848 (id=20910)

Could not rebase; Attempt merge onto 6cdd424eba...

Updating 6cdd424..f500096
Fast-forward
 swh/provenance/__init__.py                         |  16 +-
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  69 ++-
 swh/provenance/origin.py                           | 107 ++---
 swh/provenance/postgresql/provenancedb_base.py     | 341 ++++-----------
 .../postgresql/provenancedb_with_path.py           | 155 +++----
 .../postgresql/provenancedb_without_path.py        | 104 ++---
 swh/provenance/provenance.py                       | 298 +++++++++++--
 swh/provenance/revision.py                         |  24 +-
 swh/provenance/tests/conftest.py                   |   6 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 444 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 436 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 223 ++++++++++
 .../tests/data/synthetic_out-of-order_lower_1.txt  |   2 +-
 swh/provenance/tests/test_conftest.py              |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       | 104 +++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_provenance_heuristics.py |  40 +-
 swh/provenance/tests/test_revision_iterator.py     |   6 +-
 22 files changed, 2787 insertions(+), 609 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit f5000961116c3ab720c682155d27e678eaf3ff73
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 16:27:51 2021 +0200

    Split Provenance backend in two layers
    
    First layer (temporarily called `ProvenanceBackend`) is responsable of
    handling read/write caches and it should ideally be db absnostic (not
    yet though).
    Second layer is responsable of all db interaction. In revisions to come
    it will be further refactored into sevel workers to guarantee no
    collitions when writing to the DB.

commit add73300f8054eeca73f816867a14ae1d8420190
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:41:54 2021 +0200

    Refactor insertion methods in the Provenance backend

commit 2a8e113d2407e1d11df7d0d2f4116967c92d7e57
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:24:31 2021 +0200

    Simplify cache usage in the Provenance backend

commit a5b7bd73c0ec5fc7cf2b2c7e93c00b40d147ca84
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 11:42:15 2021 +0200

    Remove `directory_invalidate_in_isochrone_frontier` method from provenance interface
    
    It was meant to be used in a multi-thread scenario which is not possible
    due to Python's lack of actual parallelism. This way the
    `build_isochrone_graph` function is guaranteed not to modify the DB (it
    performs only reads now). Also the isochrone graph test was updated to
    use `revision_add` with a new flag to avoid commits, hence emulating the
    batch processing behaviour.

commit b24bc279c19e346a77d233fa7d24f148f52c5d89
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 18:08:25 2021 +0200

    Improve out-of-order revision processing
    
    Added a flag to the `IsochroneNode` to identify invalidated frontiers
    and force its update later when processing the graph. This should
    guarantee the same results when processing revision one-by-one vs. in
    batches (in terms of db rows).

commit 1146a9b9203557195da47df2b76ba1603aa4ca31
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:20:21 2021 +0200

    Refine maxdate calculation

commit 18063809ccc0b4f7cbfcf00fc95b26ba297c99ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:17:49 2021 +0200

    Fix issue when processing revision in batch
    
    If any revision in the batch was invalidating a frontier, the commit of
    the complete batch failed. This is now fixed.

commit 52de7a0c11057ec80743807350f4a625efab11ba
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit a5e8234b9f43ce02144ff9ff37a2caa00ebf608a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 59c0f1bf49617824feae7ad08ce1b5f46b7a70cd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 4ebab8d2ce933637c85bf456a796b6da8d12b513
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit 6ea9313800b86e996783f0bf5e37cc8c34f3627e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/111/ for more details.

douardda added a subscriber: douardda.

I may have other comments to make, but definitely, I'd really prefer having 3 diffs here also...

swh/provenance/postgresql/provenancedb_with_path.py
76–132

it would be best if the data given as argument is only the needed content, aka data[relation]. This method needs no access to any other piece of data in this data dictionary, so only give it what it needs.

swh/provenance/provenance.py
33

not sure I like this renaming of the blob argument. In the content_add_to_revision(), it's a FileEntry, here it's a sha1 (bytes). Using the same argument name for both makes the API confusing.

143

There is no need for both a read and a write cache. Just keep one cache plus a set of ids that needs to be added to the backend storage. See D5829 in which I do this (this diff obviously would need to be rework to play with your current stack, but it still is valid in its approach and goal.)

This revision now requires changes to proceed.Jun 10 2021, 4:11 PM
swh/provenance/provenance.py
143

I rather keep separate track of reads and writes for the refactoring to come. We can always marge that in the future if it is still valid, but merging now to split it again doesn't make much sense.

swh/provenance/provenance.py
33

this was done to be consistent with how we name parameter all over the module. there are similar for directories and revisions and we don't use dirid/revid (or similar) there

Build is green

Patch application report for D5848 (id=20941)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..6a8d341
Fast-forward
 swh/provenance/__init__.py                         |  16 +-
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  76 +++-
 swh/provenance/origin.py                           | 107 ++---
 swh/provenance/postgresql/provenancedb_base.py     | 352 +++++----------
 .../postgresql/provenancedb_with_path.py           | 155 +++----
 .../postgresql/provenancedb_without_path.py        | 104 ++---
 swh/provenance/provenance.py                       | 300 +++++++++++--
 swh/provenance/revision.py                         |  24 +-
 swh/provenance/tests/conftest.py                   |   6 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 444 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 436 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 223 ++++++++++
 .../tests/data/synthetic_out-of-order_lower_1.txt  |   6 +-
 swh/provenance/tests/test_conftest.py              |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       | 106 +++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  14 +-
 swh/provenance/tests/test_provenance_heuristics.py |  40 +-
 swh/provenance/tests/test_revision_iterator.py     |   4 +-
 22 files changed, 2810 insertions(+), 612 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit 6a8d34145b7b113d8ca62cf134d50ab69c491ec7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 16:27:51 2021 +0200

    Split Provenance backend in two layers
    
    First layer (temporarily called `ProvenanceBackend`) is responsable of
    handling read/write caches and it should ideally be db absnostic (not
    yet though).
    Second layer is responsable of all db interaction. In revisions to come
    it will be further refactored into sevel workers to guarantee no
    collitions when writing to the DB.

commit 4a8964d25ff8490b8bf33d8480f6db1b97a0af22
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:41:54 2021 +0200

    Refactor insertion methods in the Provenance backend

commit 4296febd8fbe3b0c8dc5a3650cbbd4ecf29713cf
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:24:31 2021 +0200

    Simplify cache usage in the Provenance backend

commit af41748ef54dedf87f8304bb457b028b2de6369f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 11:42:15 2021 +0200

    Remove `directory_invalidate_in_isochrone_frontier` method from provenance interface
    
    It was meant to be used in a multi-thread scenario which is not possible
    due to Python's lack of actual parallelism. This way the
    `build_isochrone_graph` function is guaranteed not to modify the DB (it
    performs only reads now). Also the isochrone graph test was updated to
    use `revision_add` with a new flag to avoid commits, hence emulating the
    batch processing behaviour.

commit c20aeb432e831e412c13033c4e7a3d0ee6553e82
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 18:08:25 2021 +0200

    Improve out-of-order revision processing
    
    Added a flag to the `IsochroneNode` to identify invalidated frontiers
    and force its update later when processing the graph. This should
    guarantee the same results when processing revision one-by-one vs. in
    batches (in terms of db rows).

commit 65226455d522f5156ed8d7e37d2b7546d0d010f1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:20:21 2021 +0200

    Refine maxdate calculation

commit d4ab6857f6a74e181316bf90db008b51d4b81085
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:17:49 2021 +0200

    Fix issue when processing revision in batch
    
    If any revision in the batch was invalidating a frontier, the commit of
    the complete batch failed. This is now fixed.

commit d14247403019bd34e1e430c71e074574c89e3e57
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit 594e5a83b38ceb99a46520e9d835b14074caed70
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 244b08b4b51c8f0891301e4495f05ba8368e156c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 5a9fb987c9aa169095185b1559a87bce536776b7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/120/ for more details.

I still disagree with some parts of this (the blob vs. blobid naming stuff, and more importantly the whole data dict being given as argument of insert_relation() when only one element of this dict is actually needed), but meh.

swh/provenance/provenance.py
33

Then name as blobid (or any other name) every time it's a sha1, and blob every time it's a FileEntry rather than having a common "blob" name which can be either a bytes or a FileEntry depending on the method, for the sake of both consistency and clarity. (same for other entity types).

But in any case, renaming parameters of methods of an Interface (thus changing an API) should better be in a dedicated revision. Not that critical in this case, but generally you would do it in a dedicated one.

143

Note that I don't propose to stop keeping track of reads and writes in the cache, just do it differently.

This revision is now accepted and ready to land.Jun 11 2021, 12:37 PM
aeviso marked an inline comment as done.

Rebase

Build is green

Patch application report for D5848 (id=20962)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..b15aa2b
Fast-forward
 swh/provenance/__init__.py                         |  16 +-
 swh/provenance/cli.py                              |  18 +-
 swh/provenance/graph.py                            | 223 ++++++++
 swh/provenance/model.py                            |  76 ++-
 swh/provenance/origin.py                           | 183 ++++---
 swh/provenance/postgresql/provenancedb_base.py     | 352 ++++--------
 .../postgresql/provenancedb_with_path.py           | 155 +++---
 .../postgresql/provenancedb_without_path.py        | 104 ++--
 swh/provenance/provenance.py                       | 593 ++++++---------------
 swh/provenance/revision.py                         | 237 +++++++-
 swh/provenance/tests/conftest.py                   |   6 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 401 ++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 401 ++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 371 +++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 365 +++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 185 +++++++
 .../tests/data/synthetic_out-of-order_lower_1.txt  |   6 +-
 swh/provenance/tests/test_conftest.py              |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       | 101 ++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  16 +-
 swh/provenance/tests/test_provenance_heuristics.py |  51 +-
 swh/provenance/tests/test_revision_iterator.py     |   4 +-
 23 files changed, 2895 insertions(+), 1014 deletions(-)
 create mode 100644 swh/provenance/graph.py
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit b15aa2bd3ab7ef8b832bd2fb63f6b5d4f43ba287
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 16:27:51 2021 +0200

    Split Provenance backend in two layers
    
    First layer (temporarily called `ProvenanceBackend`) is responsable of
    handling read/write caches and it should ideally be db absnostic (not
    yet though).
    Second layer is responsable of all db interaction. In revisions to come
    it will be further refactored into sevel workers to guarantee no
    collitions when writing to the DB.

commit 0306231f80c56076ea4f917f2acf619f18727dfb
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:41:54 2021 +0200

    Refactor insertion methods in the Provenance backend

commit 8a92cabc4140dc0cbf6a83901d372f640f444a2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:24:31 2021 +0200

    Simplify cache usage in the Provenance backend

commit 3c8ff220ae4c3d5375fbd5e0981835a67f11f911
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:17:49 2021 +0200

    Fix issue when processing revision in batch
    
    If any revision in the batch was invalidating a frontier, the commit of
    the complete batch failed. This is now fixed.

commit 30bff867e97f37849d960fdc284513844fae2a34
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit c2843ae5ba47bfb03d0fa10ce45ad274061097df
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 1dd14205ba60d02e14f2c352113871c1025b8e7f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 9aaaedb3ebc981555276e99616a0c4fc837b78e9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/125/ for more details.

Build is green

Patch application report for D5848 (id=20971)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..c4b1f31
Fast-forward
 swh/provenance/__init__.py                         |  16 +-
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  76 +++-
 swh/provenance/origin.py                           | 106 ++----
 swh/provenance/postgresql/provenancedb_base.py     | 352 +++++-------------
 .../postgresql/provenancedb_with_path.py           | 155 ++++----
 .../postgresql/provenancedb_without_path.py        | 104 +++---
 swh/provenance/provenance.py                       | 309 +++++++++++++---
 swh/provenance/revision.py                         |  24 +-
 swh/provenance/tests/conftest.py                   |   6 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 401 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 401 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 371 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 365 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 185 ++++++++++
 .../tests/data/synthetic_out-of-order_lower_1.txt  |   6 +-
 swh/provenance/tests/test_conftest.py              |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       | 100 +++++
 swh/provenance/tests/test_origin_iterator.py       |  43 ++-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_provenance_heuristics.py |  49 +--
 swh/provenance/tests/test_revision_iterator.py     |   4 +-
 22 files changed, 2484 insertions(+), 615 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit c4b1f31640b1263e8afb7c4c71a8ca3d984b3fd2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 11 17:52:00 2021 +0200

    Split Provenance backend in two layers
    
    First layer (temporarily called `ProvenanceBackend`) is responsable of
    handling read/write caches and it should ideally be db absnostic (not
    yet though).
    Second layer is responsable of all db interaction. In revisions to come
    it will be further refactored into sevel workers to guarantee no
    collitions when writing to the DB.

commit f1a9fe8182a3a6a8a47d6093197ee6b800fce95b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:41:54 2021 +0200

    Refactor insertion methods in the Provenance backend

commit 3f99025d6d45287ba7ce97db39eef3f9c5acb78c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Jun 9 12:24:31 2021 +0200

    Simplify cache usage in the Provenance backend

commit d1b476b27ac4e7f355468a0514f6a9850dbf1143
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 16:17:49 2021 +0200

    Improve out-of-order revision processing
    
    Fix issue when processing revision in batch
    
    If any revision in the batch was invalidating a frontier, the commit of
    the complete batch failed. This is now fixed.
    
    Refine maxdate calculation
    
    Added a flag to the IsochroneNode to identify invalidated frontiers
    and force its update later when processing the graph. This should
    guarantee the same results when processing revision one-by-one vs. in
    batches (in terms of db rows).
    
    Remove directory_invalidate_in_isochrone_frontier method from provenance interface
    
    It was meant to be used in a multi-thread scenario which is not possible
    due to Python's lack of actual parallelism. This way the
    build_isochrone_graph function is guaranteed not to modify the DB (it
    performs only reads now). Also the isochrone graph test was updated to
    use revision_add with a new flag to avoid commits, hence emulating the
    batch processing behaviour.

commit 30bff867e97f37849d960fdc284513844fae2a34
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit c2843ae5ba47bfb03d0fa10ce45ad274061097df
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 1dd14205ba60d02e14f2c352113871c1025b8e7f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 9aaaedb3ebc981555276e99616a0c4fc837b78e9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/127/ for more details.