Page MenuHomeSoftware Heritage

Add tests for isochrone graph topology
ClosedPublic

Authored by aeviso on Jun 10 2021, 11:59 AM.

Details

Summary

Added equality check functions to model classes.

Added test for isochrone graph topology.

The expected isochrone graphs for each revision in the test should be
provided as a dictionary in an associated yaml file.
Currently only heuristic lower with depth=1 is being tested.

Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
modified so that they can be compared by equlity and hashed.

Added isochrone graph tests for the remaining heuristics.

Depends on D5822

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5845 (id=20902)

Could not rebase; Attempt merge onto 6cdd424eba...

Updating 6cdd424..db73e48
Fast-forward
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  69 ++-
 swh/provenance/origin.py                           | 107 ++---
 swh/provenance/provenance.py                       |  62 ++-
 swh/provenance/revision.py                         |  24 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 444 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 436 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 222 ++++++++++
 swh/provenance/tests/test_isochrone_graph.py       |  88 ++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_revision_iterator.py     |   6 +-
 14 files changed, 2340 insertions(+), 137 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit db73e486e78405e8e26a0378d0f27d8cdda260fc
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Added isochrone graph tests for the remaining heuristics.

commit 54ab245f205ad683876a447c8961a82d8e961f90
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Added test for isochrone graph topology.
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently onle heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 1d6b6ec3ebe9a6a6284ca6898e678abcb9203336
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Added equality check functions to model classes.

commit 134ae3c784326111de4775a9f5da9ac7d9c34cf7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit 436730ceb8c384141686ee262dc1e87d7f0a5f48
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/107/ for more details.

aeviso retitled this revision from Isochrone graph testing to Add isochrone graph tests.

Build is green

Patch application report for D5845 (id=20904)

Could not rebase; Attempt merge onto 6cdd424eba...

Updating 6cdd424..52de7a0
Fast-forward
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  69 ++-
 swh/provenance/origin.py                           | 107 ++---
 swh/provenance/provenance.py                       |  62 ++-
 swh/provenance/revision.py                         |  24 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 444 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 436 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 222 ++++++++++
 swh/provenance/tests/test_isochrone_graph.py       |  88 ++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_revision_iterator.py     |   6 +-
 14 files changed, 2340 insertions(+), 137 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit 52de7a0c11057ec80743807350f4a625efab11ba
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit a5e8234b9f43ce02144ff9ff37a2caa00ebf608a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 59c0f1bf49617824feae7ad08ce1b5f46b7a70cd
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 4ebab8d2ce933637c85bf456a796b6da8d12b513
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit 6ea9313800b86e996783f0bf5e37cc8c34f3627e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/109/ for more details.

aeviso retitled this revision from Add isochrone graph tests to Add tests for isochrone graph topology.
douardda added inline comments.
swh/provenance/provenance.py
325

what's the point of this special case for None values? self.dbdate == other.dbdate shoud work just fine (same for maxdate).

339

I would have simply done it like:

def __eq__(self, other):
    if isinstance(other, IsochroneNode):
        return hash(self) == hash(other)
    return False

(but then the question of the absence of self.children in the __hash__() method arises)

344

see above, why children is not part of this hash? BTW, why is the __hash__ method needed?

swh/provenance/tests/test_isochrone_graph.py
56 ↗(On Diff #20904)

maybe the README file should be updated in this diff to make this true ;-)

This new file format needs to be documented.

Also, to make it easier to read without having to refer to the documentation, it would be best to put the revision id as a field rather than the key, like:

# Isochrone graph for R00
- rev: c0d8929936631ecbcf9147be6b8aa13b13b014e4:
  entry:
    id: "a4cb5e6b2831f7e8eef0e6e08e43d642c97303a1"
    name: ""
  [...]
# Isochrone graph for R01
- rev: 1444db96cbd8cd791abe83527becee73d3c64e86:
  entry:
    id: "b3cf11b22c9f93c3c494cf90ab072f394155072d"
    name: ""
  [...]
66 ↗(On Diff #20904)

then this code becomes:

for graph_as_dict in expected:
  rev = graph_as_dict["rev"]
  [...]

Overall ok, but a few things:

  • I don't think we need the __hash__ methods at all (unless you want to rewrite the __eq__ ones using hash comparison)
  • we usually don't put print statement in tests
  • I'd prefer the more explicit file format (see inline comments), then it does not really need to be documented in the README file (but the comment in the code needs to be removed)
swh/provenance/tests/test_isochrone_graph.py
23 ↗(On Diff #20904)
WARNING: the d.copy() you made just above does not "protect" you here! It's only a "shadow copy": it will only create a new directory with the same keys and values, but the values themselves are NOT copies!

So here, you are actually modifying the original dictionary!

What you want here is a deepcopy (from copy import deepcopy)

This revision now requires changes to proceed.Jun 10 2021, 5:35 PM
swh/provenance/tests/test_isochrone_graph.py
56 ↗(On Diff #20904)

Yeah, but the idea is that the dictionary is a representation of the graph. The revision id is not part of the graph, and it seems confusing to have extra information in the same structure.

Build is green

Patch application report for D5845 (id=20939)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..d142474
Fast-forward
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  76 +++-
 swh/provenance/origin.py                           | 107 ++---
 swh/provenance/provenance.py                       |  63 ++-
 swh/provenance/revision.py                         |  24 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 476 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 444 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 436 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 222 ++++++++++
 swh/provenance/tests/test_isochrone_graph.py       |  90 ++++
 swh/provenance/tests/test_origin_iterator.py       |  43 +-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_revision_iterator.py     |   4 +-
 14 files changed, 2348 insertions(+), 137 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit d14247403019bd34e1e430c71e074574c89e3e57
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit 594e5a83b38ceb99a46520e9d835b14074caed70
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 244b08b4b51c8f0891301e4495f05ba8368e156c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 5a9fb987c9aa169095185b1559a87bce536776b7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/118/ for more details.

swh/provenance/tests/test_isochrone_graph.py
56 ↗(On Diff #20904)

I'd much rather have an extra line of code to deal with this "inconsistency" in the graph representation in favor of an explicit and readable file format than the other way around.

You already have a transformation function to do that (isochrone_graph_from_dict) so it's already not a direct representation.

Better readability is far more important than "purity" of some sort.
Code (and test data sets in this case) are primarily made to be read by humans.

aeviso added inline comments.
swh/provenance/tests/test_isochrone_graph.py
56 ↗(On Diff #20904)

That's exactly the point: readability. This has nothing to do with "purity". It is actually related to the other comment about default values. For this to be really easy to follow for a human being, the real information should be easy to find in the file. If you look at the files now you'll see that 90% of the dbdate fields are null, most of the known fields are False and I haven't even written all invalid fields because only 1 out of 150+ is True. This values are actually the expected one for most internal nodes in the graph, so what's the point on putting them explicitly in the file. The real information gets lost among so much noise.

I won't discuss more the file format for the tests, it's not that critical, but please remove the implementation of __hash__ methods if they are actually not needed, and simplify the code the the __eq__ as suggested.

swh/provenance/tests/test_isochrone_graph.py
56 ↗(On Diff #20904)

I don't understand how

53519b5a5e8cf12a4f81f82e489f95c1d04d5314:
    entry:
        id: "195601c98c28f04e0d19c218434738006990db72"
        name: ""

where you need to refer to a documentation (assuming it exists) or the code to know what this key refers to, is more readable than:

- rev: c0d8929936631ecbcf9147be6b8aa13b13b014e4:
  entry:
    id: "a4cb5e6b2831f7e8eef0e6e08e43d642c97303a1"
    name: ""

where you can read the meaning of this field directly in the data structure.

The other part is something else: we are dealing with data used for tests. I prefer not to have implicit behavior for this, even if it makes the dataset more verbose.

But ok, do as you prefer.

This revision now requires changes to proceed.Jun 11 2021, 12:22 PM
swh/provenance/provenance.py
325

OK, I didn't know how equality handled None values but if a simple == comparison works then it's clearly better.

339

I actually followed the way it's done in other swh modules (or instance: swh.core, swh.loader and swh.model)

344

The method is needed so we can create a set of IsochroneNode items, to compare self.children regardless of the order of the lists. I believe is actually better to do muliset comparison though.

swh/provenance/provenance.py
344

About the first question (why self.children is not included), I believe __hash__ will yield a different results if elements in the list are in different order, but we actually want two items to have the same hash is that's the only difference between them. So I guess we need to add set(self.children) to the tuple. Right?

swh/provenance/provenance.py
339

well we do compute hashes in these modules, but do we need to do so here?

344

The method is needed so we can create a set of IsochroneNode items, to compare self.children regardless of the order of the lists. I believe is actually better to do muliset comparison though.

I don't see anywhere in the code we actually hash these objects. Did I miss something?

swh/provenance/provenance.py
339

I meant that I followed the way other modules implement __eq__, not __hash__

344

the object needs to be hashable to insert it into a set. I'm moving to Counter but it requires this anyway

Build is green

Patch application report for D5845 (id=20959)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..30bff86
Fast-forward
 swh/provenance/cli.py                              |  12 +-
 swh/provenance/model.py                            |  76 +++-
 swh/provenance/origin.py                           | 106 ++----
 swh/provenance/provenance.py                       |  70 +++-
 swh/provenance/revision.py                         |  24 +-
 .../tests/data/graphs_cmdbts2_lower_1.yaml         | 401 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_lower_2.yaml         | 401 +++++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_1.yaml         | 371 +++++++++++++++++++
 .../tests/data/graphs_cmdbts2_upper_2.yaml         | 365 +++++++++++++++++++
 .../tests/data/graphs_out-of-order_lower_1.yaml    | 188 ++++++++++
 swh/provenance/tests/test_isochrone_graph.py       |  88 +++++
 swh/provenance/tests/test_origin_iterator.py       |  43 ++-
 swh/provenance/tests/test_provenance_db.py         |  12 +-
 swh/provenance/tests/test_revision_iterator.py     |   4 +-
 14 files changed, 2023 insertions(+), 138 deletions(-)
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_lower_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_1.yaml
 create mode 100644 swh/provenance/tests/data/graphs_cmdbts2_upper_2.yaml
 create mode 100644 swh/provenance/tests/data/graphs_out-of-order_lower_1.yaml
 create mode 100644 swh/provenance/tests/test_isochrone_graph.py
Changes applied before test
commit 30bff867e97f37849d960fdc284513844fae2a34
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Jun 8 11:08:10 2021 +0200

    Add isochrone graph tests for the remaining heuristics

commit c2843ae5ba47bfb03d0fa10ce45ad274061097df
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Add test for isochrone graph topology
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently only heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 1dd14205ba60d02e14f2c352113871c1025b8e7f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Add equality check functions to model classes

commit 9aaaedb3ebc981555276e99616a0c4fc837b78e9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/123/ for more details.

This revision is now accepted and ready to land.Jun 14 2021, 10:41 AM