Page MenuHomeSoftware Heritage

Add support for a denormalized version of the provenance DB
ClosedPublic

Authored by douardda on Jun 10 2021, 10:38 AM.

Details

Summary

only the content_early_in_revision table is implemented so far.

Depends on D5842

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5843 (id=20896)

Could not rebase; Attempt merge onto 6cdd424eba...

Updating 6cdd424..a3d3aa5
Fast-forward
 swh/provenance/__init__.py                         |  18 +-
 swh/provenance/postgresql/provenancedb.py          | 455 +++++++++++++++++++++
 swh/provenance/postgresql/provenancedb_base.py     | 325 ---------------
 .../postgresql/provenancedb_with_path.py           | 157 -------
 .../postgresql/provenancedb_without_path.py        | 140 -------
 swh/provenance/provenance.py                       |   3 +-
 swh/provenance/sql/15-flavor.sql                   |   6 +-
 swh/provenance/sql/30-schema.sql                   |  32 +-
 swh/provenance/sql/60-indexes.sql                  |  13 +-
 swh/provenance/tests/conftest.py                   |  25 +-
 swh/provenance/tests/test_cli.py                   |  32 +-
 11 files changed, 500 insertions(+), 706 deletions(-)
 create mode 100644 swh/provenance/postgresql/provenancedb.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_base.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
Changes applied before test
commit a3d3aa5ff90e081c85aeb3408959cf97f595e665
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 16:59:39 2021 +0200

    Add support for a denormalized version of the provenance DB

commit ac1b33b66ebccff3d5e2a2280e1c7446e8fa087a
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 10:26:38 2021 +0200

    Simplify the ProvenanceDB.insert_all() method
    
    factorize insertions in content, revision and directory tables.

commit 6b2b97ac23fe43146d4964a56806d5ce9f726c06
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 09:13:59 2021 +0200

    Refactor the provenanceDB.insert_location() method
    
    simplify the code and reduce it to a couple of INSERT queries (one for
    locations, one for the dst_table).

commit fe35120741d76ff4d91d82bd1db029ff90ce8d60
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 14:55:54 2021 +0200

    Remove the without-path flavor of ProvenanceDB

commit e23832b21ad4ee7afcb56f98147e51f633b6c2d7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 10:27:32 2021 +0200

    Refactor the cache handling in ProvenanceDB
    
    - use TypedDict structures to properly type the caches needed by the
      ProvenanceDB objects,
    - use only one cache plus a set of added (and eventually removed) ids of
      objects (within the cache) for revisisons, contents and directories.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/104/ for more details.

Build is green

Patch application report for D5843 (id=20924)

Could not rebase; Attempt merge onto 075b0d6cd6...

Updating 075b0d6..b692ca1
Fast-forward
 swh/provenance/__init__.py                         |  18 +-
 swh/provenance/postgresql/provenancedb.py          | 455 +++++++++++++++++++++
 swh/provenance/postgresql/provenancedb_base.py     | 325 ---------------
 .../postgresql/provenancedb_with_path.py           | 157 -------
 .../postgresql/provenancedb_without_path.py        | 140 -------
 swh/provenance/provenance.py                       |   3 +-
 swh/provenance/sql/15-flavor.sql                   |   6 +-
 swh/provenance/sql/30-schema.sql                   |  32 +-
 swh/provenance/sql/60-indexes.sql                  |  13 +-
 swh/provenance/tests/conftest.py                   |  25 +-
 swh/provenance/tests/test_cli.py                   |  32 +-
 11 files changed, 500 insertions(+), 706 deletions(-)
 create mode 100644 swh/provenance/postgresql/provenancedb.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_base.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
Changes applied before test
commit b692ca10b6fd7d0aeded8a3717416c7d2b496ae1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 16:59:39 2021 +0200

    Add support for a denormalized version of the provenance DB

commit aa3b89d7e98462d14d5ce3b13a79c90bb2398adf
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 10:26:38 2021 +0200

    Simplify the ProvenanceDB.insert_all() method
    
    factorize insertions in content, revision and directory tables.

commit 3e424af6b3d65daedf9f1923d2b214cb57676abe
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 10 09:13:59 2021 +0200

    Refactor the provenanceDB.insert_location() method
    
    simplify the code and reduce it to a couple of INSERT queries (one for
    locations, one for the dst_table).

commit 4c50588e85be58c0d17d0e55d3ebb0facc3ee173
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 14:55:54 2021 +0200

    Remove the without-path flavor of ProvenanceDB

commit 8aff35d251db39537a3a4bd14f98783dc06ebdc9
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 9 10:27:32 2021 +0200

    Refactor the cache handling in ProvenanceDB
    
    - use TypedDict structures to properly type the caches needed by the
      ProvenanceDB objects,
    - use only one cache plus a set of added (and eventually removed) ids of
      objects (within the cache) for revisisons, contents and directories.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/116/ for more details.

rebase, adapt and implement denormalization for content_in_dir and dir_in_rev

douardda retitled this revision from [WIP] Add support for a denormalized version of the provenance DB to Add support for a denormalized version of the provenance DB.Jun 23 2021, 11:04 AM

yes I know, names for the subqueries are horrible...

yes I know, names for the subqueries are horrible...

also it might be possible to make them simpler...

Build is green

Patch application report for D5843 (id=21215)

Rebasing onto d892b29e40...

Current branch diff-target is up to date.
Changes applied before test
commit 91e902ca6935d3ce99d6815c5a60afd7c746a3ee
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/187/ for more details.

reword a bit the ci message and kill a few tabs in 30-schema.sql

Build is green

Patch application report for D5843 (id=21216)

Rebasing onto d892b29e40...

Current branch diff-target is up to date.
Changes applied before test
commit b2807328360f37d899c425e95d281e8af0a61098
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/188/ for more details.

Also, at some point we might want to use better templating to write these SQL queries, or use stored procedures (with the proper "variation" being chosen at db creation time on the selected flavor; would simplify the python code a lot.

Should this be documented somewhere? (How to use it / why)

swh/provenance/postgresql/provenancedb_without_path.py
56โ€“113

The 1-space indentation isn't great for readability.

Can you make it use 4-spaces, and also parenthesize like Black would?

120โ€“121

caps

Should this be documented somewhere? (How to use it / why)

You are right, a bit of doc somewhere would not be superfluous...

swh/provenance/postgresql/provenancedb_without_path.py
56โ€“113

Actually my idea is to move all these SQL code the DB as (PL/)SQL functions...

Build is green

Patch application report for D5843 (id=21726)

Rebasing onto 1ae32c0a61...

Current branch diff-target is up to date.
Changes applied before test
commit 2f454a94987419e52f595bf5ff9ddf4a15b0cf9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/263/ for more details.

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

swh/provenance/postgresql/provenancedb_without_path.py
114

Capitalization: as -> AS (in other places as well)

swh/provenance/sql/15-flavor.sql
4

I don't really like having this a flavors. It makes the code more cryptic and difficult to follow. I would rather have different backends if necessary, with their own schema.

aeviso requested changes to this revision.Jul 27 2021, 12:53 PM
This revision now requires changes to proceed.Jul 27 2021, 12:53 PM

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

swh/provenance/postgresql/provenancedb_without_path.py
114

Capitalization: as -> AS (in other places as well)

ok but you did notice that the next revision moves all this in sql functions so it disappear from here, right?

swh/provenance/sql/15-flavor.sql
4

I don't really like having this a flavors. It makes the code more cryptic and difficult to follow. I would rather have different backends if necessary, with their own schema.

Tthe code was cryptic before, it's slightly more so now, agreed, that's why in the end I'd like to remove this in favor of stored SQL functions.

There is no need to have different backends, we just need to just store as much specificity of a flavor in the database itself as possible; the python logic if then the exact (well mostly) same.

Build is green

Patch application report for D5843 (id=21795)

Rebasing onto 1ae32c0a61...

Current branch diff-target is up to date.
Changes applied before test
commit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/268/ for more details.

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same.
So it looks to be ok.

(I've also played a bit more with locking, I'll this in a future diff).

It's not clear to me how the denormalized version handles the insertion of duplicated entries.

It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP.

Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same.
So it looks to be ok.

(I've also played a bit more with locking, I'll this in a future diff).

I'm not sure we are talking about the same here...

Comparing via content_find_all will check that the semantics of the db reaming the same (at least from the final user point of view), but will guarantee there are no (unnecessary) duplicated entries in the db. content_find_all will filter duplicated results in the end.

With a 10k revisions dataset, inserting duplicated entries might not make a big difference, but when processing millions of revision with several clients this can make the size of the db explode. That's why it is important not to store duplicated information in the first place (potential consistency issues aside).

Build is green

Patch application report for D5843 (id=21831)

Rebasing onto 509280132c...

Current branch diff-target is up to date.
Changes applied before test
commit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/275/ for more details.

I agree some more tests and validations needs to be done on this storage schema, but can we please land it for now as is? I've put a warning in the documentation (in D6031) to point the fact this flavor is not "production ready". cc @aeviso

This revision was not accepted when it landed; it landed in state Needs Review.Aug 6 2021, 10:58 AM
This revision was automatically updated to reflect the committed changes.