Page MenuHomeSoftware Heritage

Add a new (git) dataset generation scaffolding for tests
ClosedPublic

Authored by douardda on Jun 1 2021, 11:53 AM.

Details

Summary

and use it to the generate a 'cmdbts2' test case strictly equivalent
to the CMDBTS repo.
Replace the previous CMDBTS dataset by this generated 'cmdbts2' dataset.

See the swh/provenance/tests/data/README.md file for more details.

Note: this revision could be split in 2 or 3, but...

Also remove test_provenance_heuristics_CMDBTS test

since it's redundant with the cmdbts2 test, now generated from a simple
yaml file rather than depending on the original CMDBTS git repo on
github.

The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
used for other tests (e.g. test_provenance_db).

Depends on D5781

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 1 2021, 11:57 AM
Harbormaster failed remote builds in B21698: Diff 20737!

improve a bit the commit message

Build has FAILED

Patch application report for D5805 (id=20738)

Could not rebase; Attempt merge onto 5aa0314dd7...

Updating 5aa0314..ba686ca
Fast-forward
 swh/provenance/model.py                            |  21 +-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 +++------
 swh/provenance/provenance.py                       | 201 ++++++++-------
 swh/provenance/tests/conftest.py                   |  26 +-
 swh/provenance/tests/data/README.md                | 138 ++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 ++++++
 .../tests/data/generate_storage_from_git.py        | 115 +++++++++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 +++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 -------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 -------
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 -------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 -------
 swh/provenance/tests/test_provenance_db.py         | 281 ++++++++++++---------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 148 +++++++++++
 20 files changed, 1195 insertions(+), 673 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit ba686ca5b3decec9de1512604b0c97c5c4324b10
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit b5b199bea61ff03a537267e693945c1071668e4e
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit 19f8bd8f5d7476339d8e7eabd0ba2a8aa800251f
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

commit 08d8dd6478836ff4ab1c00c67f553b6d705b5a9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit fd43523fd594e70ccd002827d379321f52c2b6da
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit d85f2b0ee48aefe03ad32311623e5390f43d7261
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 31d833ec86bf041e100795e7796ce832d00450ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/45/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/45/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 1 2021, 12:01 PM
Harbormaster failed remote builds in B21699: Diff 20738!

add missing dependency on swh.loader.git for tests

Build is green

Patch application report for D5805 (id=20742)

Could not rebase; Attempt merge onto 5aa0314dd7...

Updating 5aa0314..e3d6e0b
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  21 +-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 +++------
 swh/provenance/provenance.py                       | 201 ++++++++-------
 swh/provenance/tests/conftest.py                   |  26 +-
 swh/provenance/tests/data/README.md                | 138 ++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 ++++++
 .../tests/data/generate_storage_from_git.py        | 115 +++++++++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 +++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 -------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 -------
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 -------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 -------
 swh/provenance/tests/test_provenance_db.py         | 281 ++++++++++++---------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 185 ++++++++++++++
 21 files changed, 1233 insertions(+), 673 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit e3d6e0b3c4708d31e9f26c0dc8b415f00bb219aa
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit f55368813234fd693ad3bb98ce0ed83e53e0ce22
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit 19f8bd8f5d7476339d8e7eabd0ba2a8aa800251f
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

commit 08d8dd6478836ff4ab1c00c67f553b6d705b5a9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit fd43523fd594e70ccd002827d379321f52c2b6da
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit d85f2b0ee48aefe03ad32311623e5390f43d7261
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 31d833ec86bf041e100795e7796ce832d00450ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/46/ for more details.

actually pick the correct revisions for this diff...

Build is green

Patch application report for D5805 (id=20753)

Could not rebase; Attempt merge onto 49e47c3ea7...

Merge made by the 'recursive' strategy.
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 +++------
 swh/provenance/provenance.py                       | 220 +++++++++-------
 swh/provenance/tests/conftest.py                   |  26 +-
 swh/provenance/tests/data/README.md                | 138 ++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 ++++++
 .../tests/data/generate_storage_from_git.py        | 115 +++++++++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 +++++++
 swh/provenance/tests/test_provenance_db.py         | 281 ++++++++++++---------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 268 ++++++++++++++++++++
 17 files changed, 1337 insertions(+), 316 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit caad5ae31f187d4cecbef70d813a9ad9d11cdd20
Merge: 49e47c3 10a1662
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jun 2 15:38:00 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 10a166272c262f7725d041fc0b15219868eedacb
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit ddc2ae0583db4b317c04d97386d18d2c17ae00d7
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

commit 16bab3c60c2a3a80782273f1aaff796826e7dc2c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit 024cc9ce93e545782a980f8e81d5d09651b8231b
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit af15ad65f4a34e7703bfec80666102a6403cb505
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 1f49fdc967a2854d3a68dec34886b824fdf045f6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/53/ for more details.

Build is green

Patch application report for D5805 (id=20755)

Could not rebase; Attempt merge onto 49e47c3ea7...

Removing swh/provenance/tests/data/synthetic_upper_2.txt
Removing swh/provenance/tests/data/synthetic_upper_1.txt
Removing swh/provenance/tests/data/synthetic_lower_2.txt
Removing swh/provenance/tests/data/synthetic_lower_1.txt
Merge made by the 'recursive' strategy.
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 +++------
 swh/provenance/provenance.py                       | 220 +++++++++-------
 swh/provenance/tests/conftest.py                   |  26 +-
 swh/provenance/tests/data/README.md                | 138 ++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 ++++++
 .../tests/data/generate_storage_from_git.py        | 115 +++++++++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 +++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 -------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 -------
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 -------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 -------
 swh/provenance/tests/test_provenance_db.py         | 281 ++++++++++++---------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 185 ++++++++++++++
 21 files changed, 1254 insertions(+), 681 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit 69ac7ace1f4906ad1cdb3a82165aa5d93634d9fc
Merge: 49e47c3 421b2c8
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jun 2 15:39:45 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 421b2c832a9d37ee6de8b29eebf7c1f65ed01d5a
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 10a166272c262f7725d041fc0b15219868eedacb
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit ddc2ae0583db4b317c04d97386d18d2c17ae00d7
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

commit 16bab3c60c2a3a80782273f1aaff796826e7dc2c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit 024cc9ce93e545782a980f8e81d5d09651b8231b
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit af15ad65f4a34e7703bfec80666102a6403cb505
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 1f49fdc967a2854d3a68dec34886b824fdf045f6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/54/ for more details.

Build is green

Patch application report for D5805 (id=20768)

Could not rebase; Attempt merge onto 49e47c3ea7...

Updating 49e47c3..46c7c9d
Fast-forward
 requirements-test.txt                              |   1 +
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 +++------
 swh/provenance/provenance.py                       | 220 +++++++++-------
 swh/provenance/tests/conftest.py                   |  26 +-
 swh/provenance/tests/data/README.md                | 138 ++++++++++
 swh/provenance/tests/data/cmdbts2.msgpack          | Bin 0 -> 17734 bytes
 swh/provenance/tests/data/cmdbts2_repo.yaml        |  80 ++++++
 .../tests/data/generate_storage_from_git.py        | 115 +++++++++
 .../tests/data/synthetic_cmdbts2_lower_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_lower_2.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_1.txt       |  91 +++++++
 .../tests/data/synthetic_cmdbts2_upper_2.txt       |  91 +++++++
 swh/provenance/tests/data/synthetic_lower_1.txt    |  91 -------
 swh/provenance/tests/data/synthetic_lower_2.txt    |  91 -------
 swh/provenance/tests/data/synthetic_upper_1.txt    |  92 -------
 swh/provenance/tests/data/synthetic_upper_2.txt    |  91 -------
 swh/provenance/tests/test_provenance_db.py         | 281 ++++++++++++---------
 swh/provenance/tests/test_provenance_db_storage.py |   2 +-
 swh/provenance/tests/test_provenance_heuristics.py | 185 ++++++++++++++
 21 files changed, 1254 insertions(+), 681 deletions(-)
 create mode 100644 swh/provenance/tests/data/README.md
 create mode 100644 swh/provenance/tests/data/cmdbts2.msgpack
 create mode 100644 swh/provenance/tests/data/cmdbts2_repo.yaml
 create mode 100644 swh/provenance/tests/data/generate_storage_from_git.py
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_lower_2.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_1.txt
 create mode 100644 swh/provenance/tests/data/synthetic_cmdbts2_upper_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_lower_2.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_1.txt
 delete mode 100644 swh/provenance/tests/data/synthetic_upper_2.txt
 create mode 100644 swh/provenance/tests/test_provenance_heuristics.py
Changes applied before test
commit 46c7c9df7beaae63f4dc1089498c64c5658d3bf5
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit b3279560fea0c3a84002516cd25d3c3ce86491c6
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit 56f0ae8e12990006b1faec62bb8f61b9eed84955
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

commit de30f332f219e4edb299bb50a0b808a779c57d85
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit ee8e4b0b7ce6a85eac0665a916a37b2d63e3bb4d
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit 94598b3ce8c49eb6dfe5308b47b74271a7f9d625
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 9d110b93e9c39d65bf2986b148c4bf3467b0efa3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit fcfbb250e688a4ade6849522714832ec49238a8d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 1f823ac01491ee0f27eac685d32322f8558c26bc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit cb623cb0e7dd9a2a568b6d2645e89c4d86ba0a66
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/62/ for more details.

Why did you commit cmdbts2.msgpack instead of regenerating it? Is it too slow?

swh/provenance/tests/data/README.md
8–14 ↗(On Diff #20768)
19–37 ↗(On Diff #20768)

I like this, but shouldn't this new DSL be defined outside swh-provenance and be used to generate model objects directly?

We could use it to replace the dependency on swh-loader-git in swh-web and swh-vault's tests.

Build is green

Patch application report for D5805 (id=20787)

Rebasing onto 08344d3f76...

Current branch diff-target is up to date.
Changes applied before test
commit 9854a75c8f5426836c561bd9c1b9bad7c85494e0
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 4f7b0eadd10c55318f64688abfe391ead4bcc3af
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/69/ for more details.

Why did you commit cmdbts2.msgpack instead of regenerating it? Is it too slow?

It comes from earlier versions of the testing scaffolding where I added this mshpack file to prevent the test from having to git clone the CMDBTS git repo from github. I am thinking about automating a bit more all the process described in this README file, but well, I'd like this to land first...

swh/provenance/tests/data/README.md
19–37 ↗(On Diff #20768)

good thinking... can we add a task and do it later ? :-)

Concerning the dependencies, well I do use swh-loader-git in generate_storage_from_git.py (so it's a dependency in requirements-tests.txt), so...

This revision is now accepted and ready to land.Jun 3 2021, 4:00 PM

allow inline comments in a synth file

Build is green

Patch application report for D5805 (id=20801)

Rebasing onto 08344d3f76...

Current branch diff-target is up to date.
Changes applied before test
commit 19ef0ba9f5c86d26b595aaa5dc64390994551b64
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit f534e558645ecc9384dfb1781e94266feac683f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/77/ for more details.

fix a "typo" reported by vlorentz (I missed before)

Build is green

Patch application report for D5805 (id=20806)

Rebasing onto 08344d3f76...

Current branch diff-target is up to date.
Changes applied before test
commit 5f9e5b53a5b0547cfdbe1676e2b09648f4f359f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jun 1 11:47:09 2021 +0200

    Remove test_provenance_heuristics_CMDBTS test
    
    since it's redundant with the cmdbts2 test, now generated from a simple
    yaml file rather than depending on the original CMDBTS git repo on
    github.
    
    The CMDBTS dataset (CMDBTS.msgpack) is kept for now since it's still
    used for other tests (e.g. test_provenance_db).

commit 9fe096b3905495bff534649f2e5e0ecb8802217d
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:51:01 2021 +0200

    Add a new (git) dataset generation scaffolding for tests
    
    and use it to the generate a 'cmdbts2' test case strictly equivalent
    to the CMDBTS repo.
    
    See the swh/provenance/tests/data/README.md file for more details.
    
    Note: this aims at making easy to write more test cases than depending
    on the CMDBTS git repo on github. For example, a new test case should
    come soon for situations like 'out-of-order' revisions.

commit fd373add1762c515070d39dc1cc1b58c09d3e8e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 31 16:45:48 2021 +0200

    Remove test_provenance_heuristics from tests from ArchvieStorage tests
    
    because it's not that meaningful.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/81/ for more details.

aeviso added a subscriber: aeviso.
aeviso added inline comments.
swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

is this SQL syntax actually joining the tables?

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

Should do:

fullset=# EXPLAIN 
SELECT encode(src.sha1::bytea, 'hex'),encode(dst.sha1::bytea, 'hex'),encode(location.path::bytea, 'escape') 
FROM content_early_in_rev as rel, content as src, revision as dst, location 
WHERE rel.blob=src.id AND rel.rev=dst.id AND rel.loc=location.id;

 Hash Join  (cost=699125.84..9540426.01 rows=110655704 width=96)
   Hash Cond: (rel.loc = location.id)
   ->  Hash Join  (cost=438373.68..5934494.63 rows=110655704 width=50)
         Hash Cond: (rel.rev = dst.id)
         ->  Hash Join  (cost=160071.36..3586227.92 rows=110655704 width=37)
               Hash Cond: (rel.blob = src.id)
               ->  Seq Scan on content_early_in_rev rel  (cost=0.00..1811401.04 rows=110655704 width=24)
               ->  Hash  (cost=82185.16..82185.16 rows=4028016 width=29)
                     ->  Seq Scan on content src  (cost=0.00..82185.16 rows=4028016 width=29)
         ->  Hash  (cost=135469.03..135469.03 rows=7386903 width=29)
               ->  Seq Scan on revision dst  (cost=0.00..135469.03 rows=7386903 width=29)
   ->  Hash  (cost=123266.96..123266.96 rows=5915296 width=59)
         ->  Seq Scan on location  (cost=0.00..123266.96 rows=5915296 width=59)
 JIT:
   Functions: 27
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(16 lignes)
fullset=# EXPLAIN 
SELECT encode(src.sha1::bytea, 'hex'), encode(dst.sha1::bytea, 'hex'), encode(location.path::bytea, 'escape') 
FROM content_early_in_rev as rel 
INNER JOIN content as src on (rel.blob=src.id) 
INNER JOIN revision as dst ON (rel.rev=dst.id) 
INNER JOIN location ON (rel.loc=location.id);

 Hash Join  (cost=699125.84..9540426.01 rows=110655704 width=96)
   Hash Cond: (rel.loc = location.id)
   ->  Hash Join  (cost=438373.68..5934494.63 rows=110655704 width=50)
         Hash Cond: (rel.rev = dst.id)
         ->  Hash Join  (cost=160071.36..3586227.92 rows=110655704 width=37)
               Hash Cond: (rel.blob = src.id)
               ->  Seq Scan on content_early_in_rev rel  (cost=0.00..1811401.04 rows=110655704 width=24)
               ->  Hash  (cost=82185.16..82185.16 rows=4028016 width=29)
                     ->  Seq Scan on content src  (cost=0.00..82185.16 rows=4028016 width=29)
         ->  Hash  (cost=135469.03..135469.03 rows=7386903 width=29)
               ->  Seq Scan on revision dst  (cost=0.00..135469.03 rows=7386903 width=29)
   ->  Hash  (cost=123266.96..123266.96 rows=5915296 width=59)
         ->  Seq Scan on location  (cost=0.00..123266.96 rows=5915296 width=59)
 JIT:
   Functions: 27
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(16 lignes)
swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

Note that it's an optimization of Postgres' query planner. More naive SQL engines would do a cartesian product and filter.

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

exact, but thanks g*d we only use postgres :-)

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

Also, since it's used for tests only, we really don't care if the engine actually does the cartesian product here, but yeah, thx postgres anyway :-)

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

shouldn't we use it for the first/all occurrence(s) query as well? I mean in the provenance backend

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

possibly, I'll have a look

swh/provenance/tests/test_provenance_heuristics.py
61 ↗(On Diff #20806)

actually the way it's now written in the provenance backend (using inner join) is better since it does not depend on the sql backend being smart. So let's stick to the way it is there.