Page MenuHomeSoftware Heritage

Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
ClosedPublic

Authored by douardda on May 25 2021, 3:21 PM.

Details

Summary

the queries should be exactly the same as before (query plans are the
same); just written (hopefully) in a bit more readable manner.

Depends on D5780.

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.May 25 2021, 3:22 PM
Harbormaster failed remote builds in B21622: Diff 20658!
Harbormaster returned this revision to the author for changes because remote builds failed.May 25 2021, 3:28 PM
Harbormaster failed remote builds in B21627: Diff 20663!

Build is green

Patch application report for D5781 (id=20675)

Could not rebase; Attempt merge onto 77fce4e59d...

Updating 77fce4e..15301e7
Fast-forward
 swh/provenance/model.py                            |  21 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  22 +--
 .../postgresql/provenancedb_with_path.py           | 115 ++++--------
 swh/provenance/provenance.py                       | 202 ++++++++++++---------
 swh/provenance/tests/test_provenance_db.py         | 169 ++++++++++++++++-
 5 files changed, 351 insertions(+), 178 deletions(-)
Changes applied before test
commit 15301e78ef2f3b305791d83b120f02a5d6192904
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit a2fe6386596e89f3af8c2d5b00a0b8d8d20d82ce
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit 489197686cfa447dc4ee0a827ef578cdc041b1d4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri May 21 19:34:07 2021 +0200

    Replace ProvenanceDB.remove_cache by a dict of Set
    
    instead of a dict of dict, which was actually used as a set.

commit 2860fe6126de1804ab9190026545c935bbbbd99a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 8e4a2f69b53fd8ef74613509eb7a5f6707855a7a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 4ba05e92698d73a6a05078a969eea85d08cd5dca
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit fb0ef598657a9810deac43d495ab882718265543
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit edff905d0df269cb90246ef77b554b1ca58cbeef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

commit 113e11031aa5365a9244409a0cfe646061cb94f9
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 11 15:33:57 2021 +0200

    Replace the 'dates' argument of IsochroneNode() by a simple 'date' one
    
    there is no need for passing a dict here, we only care about the date
    for the node being instanciated.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/37/ for more details.

ardumont added inline comments.
swh/provenance/postgresql/provenancedb_with_path.py
44

i'd align the sql keywords on the same line but yes, it's more readable ;)
(i found the left part hard to read, especially the last query...)

44

"left part" as in my diff view setup which shows stuff side-by-side (that is)

78

lgtm

swh/provenance/postgresql/provenancedb_with_path.py
76

don't know if it's intended or not but the name before was in_rev not in_dir (that may be fine).

This revision is now accepted and ready to land.May 25 2021, 5:40 PM
swh/provenance/postgresql/provenancedb_with_path.py
37–47

You mean something like that?

76

content_in_rev is the alias name for (on of the) subquery; but the inner query does indeed join the content_in_dir table, so I think this is correct (which I tend to believe, since the tests I've added in D5780 should test this part of the code stays green).

78

thx

Build is green

Patch application report for D5781 (id=20736)

Could not rebase; Attempt merge onto 5aa0314dd7...

Updating 5aa0314..08d8dd6
Fast-forward
 swh/provenance/model.py                            |  21 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 ++++--------
 swh/provenance/provenance.py                       | 201 ++++++++++++---------
 swh/provenance/tests/test_provenance_db.py         | 171 +++++++++++++++++-
 5 files changed, 344 insertions(+), 176 deletions(-)
Changes applied before test
commit 08d8dd6478836ff4ab1c00c67f553b6d705b5a9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit fd43523fd594e70ccd002827d379321f52c2b6da
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit d85f2b0ee48aefe03ad32311623e5390f43d7261
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are stored in a new IsochroneNode.files attribute, so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics

commit 31d833ec86bf041e100795e7796ce832d00450ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Add 'ls_files()' and 'ls_dirs()' methods to the DirectoryEntry class
    
    to make it a bit easier to compute the isochrone graph (see following
    revisions).

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/43/ for more details.

We are about to start refactoring the provenance backend. It would be nice to have this changes pushed, since re-basing them might by complicated after refactoring.

We are about to start refactoring the provenance backend. It would be nice to have this changes pushed, since re-basing them might by complicated after refactoring.

I'd be delighted, but I need you to accept the diff in the stack...

This "{merge} might by complicated after refactoring" is true for the entire stack of diff, so yes, I'd like very much to push this stack.

This diff is already accepted by ardumont, who knows SQL better than I do. There is no much for me to review here

This diff is already accepted by ardumont, who knows SQL better than I do. There is no much for me to review here

Yes but it is part of a diff stack. I cannot push it without reworking my stack of git revisions, which can be quite a painfull process; in this very case, it should be rather painless, BUT the ides was also to have this refactoring coming AFTER better testing has been added to the code so we can validate that this refactoring does not break the code.

This diff is already accepted by ardumont, who knows SQL better than I do. There is no much for me to review here

Also, allow me to emphasis that even is @ardumont already gave his "ok" on this, you can also give your own green or red light: you may see something neither I nor @ardumont saw. If you are ok with the diff, you are not obliged to
"accept" if it's already accepted. If you are not OK with it, them you have to "ask for rework" it. Nothing wrong with that.

Now the reason it's still not pushed in the main git repo is because it sits in the middle of a stack of diffs, some of which have not been accepted (especially some on which this diff depends). See my previous comment for the reason I have not "moved" it at the base of the stack of diff.

Build is green

Patch application report for D5781 (id=20752)

Could not rebase; Attempt merge onto 49e47c3ea7...

Merge made by the 'recursive' strategy.
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 ++++-------
 swh/provenance/provenance.py                       | 220 ++++++++++++---------
 swh/provenance/tests/test_provenance_db.py         | 171 +++++++++++++++-
 5 files changed, 365 insertions(+), 184 deletions(-)
Changes applied before test
commit e242a69878f57c8b48227681dae76bee3ee01cf2
Merge: 49e47c3 16bab3c
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jun 2 15:36:30 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 16bab3c60c2a3a80782273f1aaff796826e7dc2c
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit 024cc9ce93e545782a980f8e81d5d09651b8231b
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit af15ad65f4a34e7703bfec80666102a6403cb505
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 1f49fdc967a2854d3a68dec34886b824fdf045f6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit 72644b98a218132c0b173f360c503438688ecebb
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit a71041fbaf3f0d7ec3ea944cbbf04286c57d8b7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit defcb388ffba0869edb1a126b6626710c396c2ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/52/ for more details.

Build is green

Patch application report for D5781 (id=20767)

Could not rebase; Attempt merge onto 49e47c3ea7...

Updating 49e47c3..de30f33
Fast-forward
 swh/provenance/model.py                            |  31 ++-
 swh/provenance/postgresql/provenancedb_base.py     |  12 +-
 .../postgresql/provenancedb_with_path.py           | 115 ++++-------
 swh/provenance/provenance.py                       | 220 ++++++++++++---------
 swh/provenance/tests/test_provenance_db.py         | 171 +++++++++++++++-
 5 files changed, 365 insertions(+), 184 deletions(-)
Changes applied before test
commit de30f332f219e4edb299bb50a0b808a779c57d85
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit ee8e4b0b7ce6a85eac0665a916a37b2d63e3bb4d
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

commit 94598b3ce8c49eb6dfe5308b47b74271a7f9d625
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 09:47:03 2021 +0200

    Refactor the isochrone graph computation
    
    attempt to simplify a bit this part of the code:
    
    - IsochroneNode are now only used for directories
    - FileEntry are used directly from IsochroneNode.entry.files (no need
      for creating new FileEntry instances), so
    - IsochroneNode.children only stores IsochroneNode (thus DirectoryEntry)
      objects,
    - rename IsochroneNode.date as 'dbdate' and clarify its semantics,
    - attempt to document (comments) a bit more the algorithm and semantics
      of several attributes/variables used in there.

commit 9d110b93e9c39d65bf2986b148c4bf3467b0efa3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:16:53 2021 +0200

    Replace 'DirectoryEntry.ls()' method by 'files' and 'dirs' properties
    
    and make the retrieval of children from the archive explicit in a
    dedicated retrieve_children() method.

commit fcfbb250e688a4ade6849522714832ec49238a8d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 19 16:14:41 2021 +0200

    Add __str__ methods to RevisionEntry, DirectoryEntry and FileEntry
    
    to ease logging and debugging.

commit 1f823ac01491ee0f27eac685d32322f8558c26bc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 12:44:07 2021 +0200

    Improve a bit the code of ProvenanceDBBase

commit cb623cb0e7dd9a2a568b6d2645e89c4d86ba0a66
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 12 10:30:23 2021 +0200

    Add a test for the build_isochrone_graph() function
    
    this test is far from ideal, since it's mostly the record of what happen
    during a "known good" session of revision insertions, but at least it
    should allow to refactor code related to the isochrone graph computation
    with a bit more confidence...

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/61/ for more details.

Build is green

Patch application report for D5781 (id=20784)

Could not rebase; Attempt merge onto 33eada55f0...

Updating 33eada5..08344d3
Fast-forward
 .../postgresql/provenancedb_with_path.py           | 115 ++++++---------------
 swh/provenance/tests/test_provenance_db.py         |  75 ++++++++++++++
 2 files changed, 109 insertions(+), 81 deletions(-)
Changes applied before test
commit 08344d3f76c000b2fc30452c271a8ce3caf95cdd
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:31:51 2021 +0200

    Simplify DB queries in ProvenanceWithPathDB.content_find_(first|all)
    
    the queries should be exactly the same as before (query plans are the
    same); just written (hopefully) in a bit more readable manner.

commit ce700ffb9810d3607090d123b3b627e1f0cfd271
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 25 14:59:16 2021 +0200

    Add a test for content_find_all()

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/68/ for more details.