Page MenuHomeSoftware Heritage

Refactor the isochrone graph computation
ClosedPublic

Authored by douardda on May 25 2021, 10:17 AM.

Details

Summary

attempt to simplify a bit this part of the code:

  • IsochroneNode are now only used for directories
  • FileEntry are stored in a new IsochroneNode.files attribute, so
  • IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry) objects,
  • rename IsochroneNode.date as 'dbdate' and clarify its semantics

Also (in dedicated revisions):

  • Improve a bit the code of ProvenanceDBBase
  • Add str methods to RevisionEntry, DirectoryEntry and FileEntry to ease logging and debugging.
  • Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class to make it a bit easier to compute the isochrone graph (see following revisions).

Depends on D5772

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D5773 (id=20671)

Could not rebase; Attempt merge onto 77fce4e59d...

Updating 77fce4e..2860fe6
Fast-forward
 swh/provenance/model.py                        |  21 ++-
 swh/provenance/postgresql/provenancedb_base.py |  12 +-
 swh/provenance/provenance.py                   | 202 +++++++++++++++----------
 swh/provenance/tests/test_provenance_db.py     |  96 +++++++++++-
 4 files changed, 238 insertions(+), 93 deletions(-)
Changes applied before test
commit 2860fe6126de1804ab9190026545c935bbbbd99a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 8e4a2f69b53fd8ef74613509eb7a5f6707855a7a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 4ba05e92698d73a6a05078a969eea85d08cd5dca
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit fb0ef598657a9810deac43d495ab882718265543
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit edff905d0df269cb90246ef77b554b1ca58cbeef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

commit 113e11031aa5365a9244409a0cfe646061cb94f9
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 11 15:33:57 2021 +0200

    Replace the 'dates' argument of IsochroneNode() by a simple 'date' one
    
    there is no need for passing a dict here, we only care about the date
    for the node being instanciated.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/34/ for more details.

aeviso requested changes to this revision.May 26 2021, 12:50 PM

I'm not really sure this new algorithm does the same as the previous one. Some subtle things were changed and I have the filling the semantics are different now. Also, I found the previous version to be clearer, I rather stay with it

swh/provenance/provenance.py
265 ↗(On Diff #20671)

maxdate is actually calculated when building the isochrone graph since it may depend on other nodes. I think it's clearer to keep that logic together

268 ↗(On Diff #20671)

This is confusing, files are children as well. Maybe change this name to dirs or subdirs?

282 ↗(On Diff #20671)

I don't think this respects the previous logic. Why are you resetting maxdate and not known?

290 ↗(On Diff #20671)

byt -> by

328 ↗(On Diff #20671)

What's the need of keeping this global? I believe it would be better to reset it for each node. If dates are to be cached, it's the DB module the one that does it taking care of frontier invalidation, etc. If the revision has thousands of files, this will be paying the cost of adding date the a big dictionary unnecessarily since the old ones won't be reused.

343 ↗(On Diff #20671)

This is what I meant in my previous comment. known is not reset by this function that that will break the logic below for calculation maxdate

345 ↗(On Diff #20671)

I don't really follow the changes the logic below, and this is quite crucial for the algorithm to work properly.

This revision now requires changes to proceed.May 26 2021, 12:50 PM

I'm not really sure this new algorithm does the same as the previous one. Some subtle things were changed and I have the filling the semantics are different now. Also, I found the previous version to be clearer, I rather stay with it

no, no no! rather have tests to guarantee we do not hit undocumented "subtle things" or any other corner case! please!

You'll notice it took me some time and effort to add tests before this refactoring, and that these tests pass ok before AND after it, without any modification to the tests. If you are not confident in these tests, them let's add more and better ones, but we cannot rely on the "feeling" of someone, nor on the fact that this code seems to work when applied to a big dataset after manual verification.

I found the previous version to be clearer

possibly, but you wrote it...

I'll comment more later.

swh/provenance/provenance.py
268 ↗(On Diff #20671)

yes I did consider to rename the children as directories or similar, but in fact it make sense as is: files stores only files (which are not IsochroneNode but FileEntry) in the directory entry of the current node, and children are the IsochroneNode: the graph structure consist only in IsochroneNode objects using the children attribute.

What would possibly make sense here is to only use files (FileEntry) from the DirectoryEntry, we already have them there, no need to keep another self.files attribute in the IsochroneNode.

But for this, it would be better to explicit the filling of the DirectoryEntry object, i.e. make it an explicit method (that takes the archive argument) to fill DirectoryEntry._children rather than "hijacking" DirectoryEntry.ls()

I'll give a try to this.

343 ↗(On Diff #20671)

I need to think more about this (I did asked myself whether or not the known also need to be reset), but note that, well tests are ok with this, so if we have a case where this should break the algo, we need a test for it.

douardda added inline comments.
swh/provenance/provenance.py
343 ↗(On Diff #20671)

The thing is this execution path is currently untested. Do you have simple scenarios (typically with a synthetic file) somewhere to add as a test scenario?

Build is green

Patch application report for D5773 (id=20734)

Could not rebase; Attempt merge onto 5aa0314dd7...

Updating 5aa0314..d85f2b0
Fast-forward
 swh/provenance/model.py                        |  21 ++-
 swh/provenance/postgresql/provenancedb_base.py |  12 +-
 swh/provenance/provenance.py                   | 201 ++++++++++++++-----------
 swh/provenance/tests/test_provenance_db.py     |  96 +++++++++++-
 4 files changed, 235 insertions(+), 95 deletions(-)
Changes applied before test
commit d85f2b0ee48aefe03ad32311623e5390f43d7261
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 31d833ec86bf041e100795e7796ce832d00450ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/41/ for more details.

I'm not really sure this new algorithm does the same as the previous one. Some subtle things were changed and I have the filling the semantics are different now. Also, I found the previous version to be clearer, I rather stay with it

Also forgot to mention it earlier, but I am still not satisfied by this code: it remains too complicated and too difficult to comprehend IMHO, but I believe it's a bit easier to follow with this diff.

swh/provenance/provenance.py
343 ↗(On Diff #20671)

FTR D5811 adds a simple test that at least pass though this part of the code.

swh/provenance/provenance.py
343 ↗(On Diff #20671)

This part of the code only executes when processing out-of-order revisions. A frontier defined with a later date is found and need to be ignored by the current (earlier) revision.

swh/provenance/provenance.py
343 ↗(On Diff #20671)

I know, this is why I've added this stupid simple test in D5811...

fix typos, add more comments/docstrings, and remove completly the IsochroneNode.files attribute

just use the self.entry.files, not need for duplicating FileEntry objects, as discussed in
https://forge.softwareheritage.org/D5773#147407

Build is green

Patch application report for D5773 (id=20750)

Could not rebase; Attempt merge onto 49e47c3ea7...

Merge made by the 'recursive' strategy.
 swh/provenance/model.py                        |  31 +++-
 swh/provenance/postgresql/provenancedb_base.py |  12 +-
 swh/provenance/provenance.py                   | 220 ++++++++++++++-----------
 swh/provenance/tests/test_provenance_db.py     |  96 ++++++++++-
 4 files changed, 256 insertions(+), 103 deletions(-)
Changes applied before test
commit 92cd3c1ab52bf4c77bf5abe3b402776f8119b299
Merge: 49e47c3 af15ad6
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jun 2 15:33:33 2021 +0000

    Merge branch 'diff-target' into HEAD

commit af15ad65f4a34e7703bfec80666102a6403cb505
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 1f49fdc967a2854d3a68dec34886b824fdf045f6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/50/ for more details.

Build is green

Patch application report for D5773 (id=20765)

Could not rebase; Attempt merge onto 49e47c3ea7...

Updating 49e47c3..94598b3
Fast-forward
 swh/provenance/model.py                        |  31 +++-
 swh/provenance/postgresql/provenancedb_base.py |  12 +-
 swh/provenance/provenance.py                   | 220 ++++++++++++++-----------
 swh/provenance/tests/test_provenance_db.py     |  96 ++++++++++-
 4 files changed, 256 insertions(+), 103 deletions(-)
Changes applied before test
commit 94598b3ce8c49eb6dfe5308b47b74271a7f9d625
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 9d110b93e9c39d65bf2986b148c4bf3467b0efa3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit fcfbb250e688a4ade6849522714832ec49238a8d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 1f823ac01491ee0f27eac685d32322f8558c26bc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit cb623cb0e7dd9a2a568b6d2645e89c4d86ba0a66
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/59/ for more details.

Build has FAILED

Patch application report for D5773 (id=20790)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..0c23b9d
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 220 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  38 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  38 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1208 insertions(+), 610 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit 0c23b9dc87e3b66714dd81187339db946be81eec
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 264e5c10b36462fe394b919631deb4e14e6a220f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 0df706e024fc638c53415338e5136342fc5e0700
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit d61f7450dd358df41eaf25af029bb5fada188580
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit a4da95075032acf4f88fac738b3ff5b46ceb94c5
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit 71d39de612ef5b156887dfca9bf491649e17bdde
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 9854a75c8f5426836c561bd9c1b9bad7c85494e0
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 4f7b0eadd10c55318f64688abfe391ead4bcc3af
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/72/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/72/console

Build is green

Patch application report for D5773 (id=20797)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..d3692f6
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 220 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  38 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  38 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1208 insertions(+), 610 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit d3692f6793ecb461b7178e14ba7466f78d5e8502
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 84c67b0f906c2100dc45b6b745972aad3e07f08f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 6b820b9acd94a185d70d883c2aa99cd393dd8a75
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 1fc98e60e63967534430e09137ae3f0b20bcaf88
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit a9d5543d6701f2cd79800611d2e4a79b3a0b3686
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit 242726f6980b6c98c7cd9942fd0b1e1ee21e034f
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 9854a75c8f5426836c561bd9c1b9bad7c85494e0
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 4f7b0eadd10c55318f64688abfe391ead4bcc3af
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/74/ for more details.

Build is green

Patch application report for D5773 (id=20799)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..d3692f6
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 220 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  38 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  38 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1208 insertions(+), 610 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit d3692f6793ecb461b7178e14ba7466f78d5e8502
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 84c67b0f906c2100dc45b6b745972aad3e07f08f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 6b820b9acd94a185d70d883c2aa99cd393dd8a75
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 1fc98e60e63967534430e09137ae3f0b20bcaf88
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit a9d5543d6701f2cd79800611d2e4a79b3a0b3686
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit 242726f6980b6c98c7cd9942fd0b1e1ee21e034f
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 9854a75c8f5426836c561bd9c1b9bad7c85494e0
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 4f7b0eadd10c55318f64688abfe391ead4bcc3af
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/76/ for more details.

Build is green

Patch application report for D5773 (id=20805)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..dfd793b
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 220 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  40 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  42 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1213 insertions(+), 611 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit dfd793b083ee2e966dcede458c012441aec606dd
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit ef57446be82a51b8230ffb098bb056094eaa30da
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit a1a049e8e79fc5745f1e140971166359a3ecec09
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 6a6ce44cf1e358862cc3d361b890f29b68f6fa97
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit b3436ed828b6849f03bb0a363177f1d3a1643ed1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit 70924164c5b251f6e8b3e23f691bd77d723b843e
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 19ef0ba9f5c86d26b595aaa5dc64390994551b64
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit f534e558645ecc9384dfb1781e94266feac683f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/80/ for more details.

Build is green

Patch application report for D5773 (id=20809)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..c0ff345
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 220 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  40 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  42 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1213 insertions(+), 611 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit c0ff3458afb5af787496d4234e39babb1e334658
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit d2d5be1ee3d6155a6382b318b875482161cbea8f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 47c7327b7d61eddd1aad4c63196d684b29288f2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 874b176bd6ef674e235469dc80b245f88fde45c1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit 3c9f7bd77a2babc5ca4509878fa7e1f1f9136591
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit f3cd239bf8c3241b297c6481beca266bfd47eb25
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 5f9e5b53a5b0547cfdbe1676e2b09648f4f359f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 9fe096b3905495bff534649f2e5e0ecb8802217d
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/84/ for more details.

do not initialize IsochroneNode.maxdate at instanciation time

make it clear it's the responsibility of the maxdate computation "routine" to set it, as asked by aeviso.

Also rename clear_dbdate() as invalidate().

Build is green

Patch application report for D5773 (id=20812)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..18ae4c6
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 swh/provenance/provenance.py                       | 222 ++++++++++--------
 swh/provenance/tests/conftest.py                   |  40 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  42 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1215 insertions(+), 611 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit 18ae4c6bcd5bea0aac996f4ccbb9c3e3b16f75bc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit d2d5be1ee3d6155a6382b318b875482161cbea8f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 47c7327b7d61eddd1aad4c63196d684b29288f2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 874b176bd6ef674e235469dc80b245f88fde45c1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit 3c9f7bd77a2babc5ca4509878fa7e1f1f9136591
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit f3cd239bf8c3241b297c6481beca266bfd47eb25
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 5f9e5b53a5b0547cfdbe1676e2b09648f4f359f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 9fe096b3905495bff534649f2e5e0ecb8802217d
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/85/ for more details.

swh/provenance/model.py
76

why not call retrieve_children instead? then we won't need to be aware of calling it prior to calling this method.

77

net -> not

86

net -> not

swh/provenance/model.py
76

because I don't want to have to pass the archive argument each time I want to iterate the children. I want to make it clear when we 'build' the structure vs when we use it.

fix typos (thx aeviso) and prevent the creation of a few lists

when consuming DirectoryEntry.files ou .dirs.

Build is green

Patch application report for D5773 (id=20814)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..9eaeb6c
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  33 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  20 +-
 swh/provenance/provenance.py                       | 231 +++++++++++--------
 swh/provenance/tests/conftest.py                   |  40 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  42 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1227 insertions(+), 618 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit 9eaeb6ce050b51e0e04af431b0031df11f402432
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 613be250db4de0e5af78e58c693c97d894ae5034
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.
    
    Change the API of ProvenanceInteface's content_get_early_dates() and
    directory_get_dates_in_isochrone_frontier to expect Iterable instead of
    List to prevent having to create unneeded temporary lists from
    generators retrieved from 'Directory.files' and '.dirs'.

commit 47c7327b7d61eddd1aad4c63196d684b29288f2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 874b176bd6ef674e235469dc80b245f88fde45c1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit 3c9f7bd77a2babc5ca4509878fa7e1f1f9136591
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit f3cd239bf8c3241b297c6481beca266bfd47eb25
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 5f9e5b53a5b0547cfdbe1676e2b09648f4f359f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 9fe096b3905495bff534649f2e5e0ecb8802217d
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/86/ for more details.

swh/provenance/model.py
71

I believe we should append to different lists here, so that we don't need to be filtering later when querying DirectoryEntry.files and DirectoryEntry.filesdirs

swh/provenance/model.py
71

I was reluctant to do this because I wondered it we would need this children list sorted the way the archive sent us the results, but it looks we don't care at all about it, so yeah, we can...

Store the files and subdirs in dedicated lists in DirectoryEntry

but keep the files() and dirs() properties as generators to prevent returning a (mutable) list (make it a bit safer, no one should modify these lists after retireve_children has been called).

Build is green

Patch application report for D5773 (id=20816)

Could not rebase; Attempt merge onto 08344d3f76...

Updating 08344d3..f64a41b
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  44 +++-
 swh/provenance/postgresql/provenancedb_base.py     |  20 +-
 swh/provenance/provenance.py                       | 231 +++++++++++--------
 swh/provenance/tests/conftest.py                   |  40 +++-
 swh/provenance/tests/data/README.md                | 138 ++++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 +++++++
 .../tests/data/generate_storage_from_git.py        | 115 ++++++++++
 swh/provenance/tests/data/out-of-order.msgpack     | Bin 0 -> 6653 bytes
 swh/provenance/tests/data/out-of-order_repo.yaml   |  35 +++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 ++++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 ++++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 --------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 --------
 .../tests/data/synthetic_out-of-order_lower_1.txt  |  42 ++++
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 --------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 --------
 swh/provenance/tests/test_provenance_db.py         | 132 -----------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 247 +++++++++++++++++++++
 23 files changed, 1233 insertions(+), 623 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/out-of-order.msgpack
 create mode 100644 swh/provenance/tests/data/out-of-order_repo.yaml
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_out-of-order_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit f64a41b6ee169959a0059a2b4e19f06a8bed75a2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit b90c8625c040fc38de21b0e0e38ef4f2a8d175e1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.
    
    Change the API of ProvenanceInteface's content_get_early_dates() and
    directory_get_dates_in_isochrone_frontier to expect Iterable instead of
    List to prevent having to create unneeded temporary lists from
    generators retrieved from 'Directory.files' and '.dirs'.

commit 47c7327b7d61eddd1aad4c63196d684b29288f2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 874b176bd6ef674e235469dc80b245f88fde45c1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit 3c9f7bd77a2babc5ca4509878fa7e1f1f9136591
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 2 12:10:50 2021 +0200

    Add a test_provenance_heuristics_content_find_all() test
    
    test that ProvenanceDB.find_all() behaves as expected for all test
    datasets.

commit f3cd239bf8c3241b297c6481beca266bfd47eb25
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 16:34:57 2021 +0200

    Add a simple out-of-order dataset

commit 5f9e5b53a5b0547cfdbe1676e2b09648f4f359f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 9fe096b3905495bff534649f2e5e0ecb8802217d
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/87/ for more details.

This revision is now accepted and ready to land.Jun 4 2021, 1:51 PM