⚙ D5843 Add support for a denormalized version of the provenance DB

douardda created this revision.Jun 10 2021, 10:38 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptJun 10 2021, 10:38 AM

Build is green

Patch application report for D5843 (id=20896)

Could not rebase; Attempt merge onto 6cdd424eba...

Updating 6cdd424..a3d3aa5
Fast-forward
 swh/provenance/__init__.py                         |  18 +-
 swh/provenance/postgresql/provenancedb.py          | 455 +++++++++++++++++++++
 swh/provenance/postgresql/provenancedb_base.py     | 325 ---------------
 .../postgresql/provenancedb_with_path.py           | 157 -------
 .../postgresql/provenancedb_without_path.py        | 140 -------
 swh/provenance/provenance.py                       |   3 +-
 swh/provenance/sql/15-flavor.sql                   |   6 +-
 swh/provenance/sql/30-schema.sql                   |  32 +-
 swh/provenance/sql/60-indexes.sql                  |  13 +-
 swh/provenance/tests/conftest.py                   |  25 +-
 swh/provenance/tests/test_cli.py                   |  32 +-
 11 files changed, 500 insertions(+), 706 deletions(-)
 create mode 100644 swh/provenance/postgresql/provenancedb.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_base.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py

Changes applied before test

commit a3d3aa5ff90e081c85aeb3408959cf97f595e665
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 16:59:39 2021 +0200

    Add support for a denormalized version of the provenance DB

commit ac1b33b66ebccff3d5e2a2280e1c7446e8fa087a
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 10:26:38 2021 +0200

    Simplify the ProvenanceDB.insert_all() method
    
    factorize insertions in content, revision and directory tables.

commit 6b2b97ac23fe43146d4964a56806d5ce9f726c06
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 09:13:59 2021 +0200

    Refactor the provenanceDB.insert_location() method
    
    simplify the code and reduce it to a couple of INSERT queries (one for
    locations, one for the dst_table).

commit fe35120741d76ff4d91d82bd1db029ff90ce8d60
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 14:55:54 2021 +0200

    Remove the without-path flavor of ProvenanceDB

commit e23832b21ad4ee7afcb56f98147e51f633b6c2d7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 10:27:32 2021 +0200

    Refactor the cache handling in ProvenanceDB
    
    - use TypedDict structures to properly type the caches needed by the
      ProvenanceDB objects,
    - use only one cache plus a set of added (and eventually removed) ids of
      objects (within the cache) for revisisons, contents and directories.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/104/ for more details.

Harbormaster completed remote builds in B21879: Diff 20896.Jun 10 2021, 10:41 AM

douardda requested review of this revision.Jun 10 2021, 10:41 AM

rebase

Build is green

Patch application report for D5843 (id=20924)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..b692ca1
Fast-forward
 swh/provenance/__init__.py                         |  18 +-
 swh/provenance/postgresql/provenancedb.py          | 455 +++++++++++++++++++++
 swh/provenance/postgresql/provenancedb_base.py     | 325 ---------------
 .../postgresql/provenancedb_with_path.py           | 157 -------
 .../postgresql/provenancedb_without_path.py        | 140 -------
 swh/provenance/provenance.py                       |   3 +-
 swh/provenance/sql/15-flavor.sql                   |   6 +-
 swh/provenance/sql/30-schema.sql                   |  32 +-
 swh/provenance/sql/60-indexes.sql                  |  13 +-
 swh/provenance/tests/conftest.py                   |  25 +-
 swh/provenance/tests/test_cli.py                   |  32 +-
 11 files changed, 500 insertions(+), 706 deletions(-)
 create mode 100644 swh/provenance/postgresql/provenancedb.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_base.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py

Changes applied before test

commit b692ca10b6fd7d0aeded8a3717416c7d2b496ae1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 16:59:39 2021 +0200

    Add support for a denormalized version of the provenance DB

commit aa3b89d7e98462d14d5ce3b13a79c90bb2398adf
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 10:26:38 2021 +0200

    Simplify the ProvenanceDB.insert_all() method
    
    factorize insertions in content, revision and directory tables.

commit 3e424af6b3d65daedf9f1923d2b214cb57676abe
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 09:13:59 2021 +0200

    Refactor the provenanceDB.insert_location() method
    
    simplify the code and reduce it to a couple of INSERT queries (one for
    locations, one for the dst_table).

commit 4c50588e85be58c0d17d0e55d3ebb0facc3ee173
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 14:55:54 2021 +0200

    Remove the without-path flavor of ProvenanceDB

commit 8aff35d251db39537a3a4bd14f98783dc06ebdc9
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 10:27:32 2021 +0200

    Refactor the cache handling in ProvenanceDB
    
    - use TypedDict structures to properly type the caches needed by the
      ProvenanceDB objects,
    - use only one cache plus a set of added (and eventually removed) ids of
      objects (within the cache) for revisisons, contents and directories.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/116/ for more details.

Harbormaster completed remote builds in B21906: Diff 20924.Jun 10 2021, 2:59 PM

rebase, adapt and implement denormalization for content_in_dir and dir_in_rev

douardda retitled this revision from [WIP] Add support for a denormalized version of the provenance DB to Add support for a denormalized version of the provenance DB.Jun 23 2021, 11:04 AM

yes I know, names for the subqueries are horrible...

In D5843#151350, @douardda wrote:

yes I know, names for the subqueries are horrible...

also it might be possible to make them simpler...

Build is green

Patch application report for D5843 (id=21215)

Rebasing onto d892b29e40...

Current branch diff-target is up to date.

Changes applied before test

commit 91e902ca6935d3ce99d6815c5a60afd7c746a3ee
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/187/ for more details.

Harbormaster completed remote builds in B22202: Diff 21215.Jun 23 2021, 11:08 AM

reword a bit the ci message and kill a few tabs in 30-schema.sql

Build is green

Patch application report for D5843 (id=21216)

Rebasing onto d892b29e40...

Current branch diff-target is up to date.

Changes applied before test

commit b2807328360f37d899c425e95d281e8af0a61098
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/188/ for more details.

Harbormaster completed remote builds in B22203: Diff 21216.Jun 23 2021, 11:16 AM

Also, at some point we might want to use better templating to write these SQL queries, or use stored procedures (with the proper "variation" being chosen at db creation time on the selected flavor; would simplify the python code a lot.

Should this be documented somewhere? (How to use it / why)

vlorentz added inline comments.Jun 28 2021, 1:22 PM

swh/provenance/postgresql/provenancedb_without_path.py
56–113	The 1-space indentation isn't great for readability. Can you make it use 4-spaces, and also parenthesize like Black would?
120–121	caps

In D5843#151992, @vlorentz wrote:

Should this be documented somewhere? (How to use it / why)

You are right, a bit of doc somewhere would not be superfluous...

swh/provenance/postgresql/provenancedb_without_path.py
56–113	Actually my idea is to move all these SQL code the DB as (PL/)SQL functions...

rebase

douardda added a child revision: D6015: Use stored SQL functions for content_find_{all,one}() and merge Provenance*DB classes in a single ProvenanceDB.Jul 21 2021, 3:10 PM

Build is green

Patch application report for D5843 (id=21726)

Rebasing onto 1ae32c0a61...

Current branch diff-target is up to date.

Changes applied before test

commit 2f454a94987419e52f595bf5ff9ddf4a15b0cf9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/263/ for more details.

Harbormaster completed remote builds in B22694: Diff 21726.Jul 21 2021, 3:22 PM

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

swh/provenance/postgresql/provenancedb_without_path.py
114	Capitalization: as -> AS (in other places as well)
swh/provenance/sql/15-flavor.sql
4	I don't really like having this a flavors. It makes the code more cryptic and difficult to follow. I would rather have different backends if necessary, with their own schema.

aeviso requested changes to this revision.Jul 27 2021, 12:53 PM

This revision now requires changes to proceed.Jul 27 2021, 12:53 PM

capitalize sql querie

In D5843#155564, @aeviso wrote:

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

swh/provenance/postgresql/provenancedb_without_path.py
114	Capitalization: as -> AS (in other places as well) ok but you did notice that the next revision moves all this in sql functions so it disappear from here, right?
swh/provenance/sql/15-flavor.sql
4	I don't really like having this a flavors. It makes the code more cryptic and difficult to follow. I would rather have different backends if necessary, with their own schema. Tthe code was cryptic before, it's slightly more so now, agreed, that's why in the end I'd like to remove this in favor of stored SQL functions. There is no need to have different backends, we just need to just store as much specificity of a flavor in the database itself as possible; the python logic if then the exact (well mostly) same.

Build is green

Patch application report for D5843 (id=21795)

Rebasing onto 1ae32c0a61...

Current branch diff-target is up to date.

Changes applied before test

commit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/268/ for more details.

Harbormaster completed remote builds in B22763: Diff 21795.Jul 27 2021, 4:38 PM

In D5843#155564, @aeviso wrote:

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

In D5843#155651, @douardda wrote:

In D5843#155564, @aeviso wrote:

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same.
So it looks to be ok.

(I've also played a bit more with locking, I'll this in a future diff).

In D5843#155897, @douardda wrote:

In D5843#155651, @douardda wrote:

In D5843#155564, @aeviso wrote:

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same.
So it looks to be ok.

(I've also played a bit more with locking, I'll this in a future diff).

I'm not sure we are talking about the same here...

Comparing via content_find_all will check that the semantics of the db reaming the same (at least from the final user point of view), but will guarantee there are no (unnecessary) duplicated entries in the db. content_find_all will filter duplicated results in the end.

With a 10k revisions dataset, inserting duplicated entries might not make a big difference, but when processing millions of revision with several clients this can make the size of the db explode. That's why it is important not to store duplicated information in the first place (potential consistency issues aside).

rebase

Build is green

Patch application report for D5843 (id=21831)

Rebasing onto 509280132c...

Current branch diff-target is up to date.

Changes applied before test

commit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/275/ for more details.

Harbormaster completed remote builds in B22794: Diff 21831.Jul 28 2021, 3:16 PM

I agree some more tests and validations needs to be done on this storage schema, but can we please land it for now as is? I've put a warning in the documentation (in D6031) to point the fact this flavor is not "production ready". cc @aeviso

This revision was not accepted when it landed; it landed in state Needs Review.Aug 6 2021, 10:58 AM

Closed by commit rDPROV1c3d6426ebd2: Add support for a denormalized version of the provenance DB (authored by douardda). · Explain Why

This revision was automatically updated to reflect the committed changes.

douardda added a commit: rDPROV1c3d6426ebd2: Add support for a denormalized version of the provenance DB.

Add support for a denormalized version of the provenance DB
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D5843 (id=20896)

Changes applied before test

Patch application report for D5843 (id=20924)

Changes applied before test

Patch application report for D5843 (id=21215)

Changes applied before test

Patch application report for D5843 (id=21216)

Changes applied before test

Patch application report for D5843 (id=21726)

Changes applied before test

Patch application report for D5843 (id=21795)

Changes applied before test

Patch application report for D5843 (id=21831)

Changes applied before test

Revision Contents
Changeset List

Diff 21950

swh/provenance/init.py

swh/provenance/postgresql/provenancedb_base.py

swh/provenance/postgresql/provenancedb_with_path.py

swh/provenance/postgresql/provenancedb_without_path.py

swh/provenance/sql/15-flavor.sql

swh/provenance/sql/30-schema.sql

swh/provenance/sql/60-indexes.sql

swh/provenance/tests/conftest.py

swh/provenance/tests/test_provenance_db.py

Add support for a denormalized version of the provenance DBClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D5843 (id=20896)

Changes applied before test

Patch application report for D5843 (id=20924)

Changes applied before test

Patch application report for D5843 (id=21215)

Changes applied before test

Patch application report for D5843 (id=21216)

Changes applied before test

Patch application report for D5843 (id=21726)

Changes applied before test

Patch application report for D5843 (id=21795)

Changes applied before test

Patch application report for D5843 (id=21831)

Changes applied before test

Revision ContentsChangeset List

Diff 21950

swh/provenance/__init__.py

swh/provenance/postgresql/provenancedb_base.py

swh/provenance/postgresql/provenancedb_with_path.py

swh/provenance/postgresql/provenancedb_without_path.py

swh/provenance/sql/15-flavor.sql

swh/provenance/sql/30-schema.sql

swh/provenance/sql/60-indexes.sql

swh/provenance/tests/conftest.py

swh/provenance/tests/test_provenance_db.py

Add support for a denormalized version of the provenance DB
ClosedPublic
Actions

Revision Contents
Changeset List

swh/provenance/init.py