only the content_early_in_revision table is implemented so far.
Depends on D5842
Differential D5843
Add support for a denormalized version of the provenance DB Authored by douardda on Jun 10 2021, 10:38 AM.
Details
only the content_early_in_revision table is implemented so far. Depends on D5842
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D5843 (id=20896)Could not rebase; Attempt merge onto 6cdd424eba... Updating 6cdd424..a3d3aa5 Fast-forward swh/provenance/__init__.py | 18 +- swh/provenance/postgresql/provenancedb.py | 455 +++++++++++++++++++++ swh/provenance/postgresql/provenancedb_base.py | 325 --------------- .../postgresql/provenancedb_with_path.py | 157 ------- .../postgresql/provenancedb_without_path.py | 140 ------- swh/provenance/provenance.py | 3 +- swh/provenance/sql/15-flavor.sql | 6 +- swh/provenance/sql/30-schema.sql | 32 +- swh/provenance/sql/60-indexes.sql | 13 +- swh/provenance/tests/conftest.py | 25 +- swh/provenance/tests/test_cli.py | 32 +- 11 files changed, 500 insertions(+), 706 deletions(-) create mode 100644 swh/provenance/postgresql/provenancedb.py delete mode 100644 swh/provenance/postgresql/provenancedb_base.py delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py Changes applied before testcommit a3d3aa5ff90e081c85aeb3408959cf97f595e665
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 16:59:39 2021 +0200
Add support for a denormalized version of the provenance DB
commit ac1b33b66ebccff3d5e2a2280e1c7446e8fa087a
Author: David Douard <david.douard@sdfa3.org>
Date: Thu Jun 10 10:26:38 2021 +0200
Simplify the ProvenanceDB.insert_all() method
factorize insertions in content, revision and directory tables.
commit 6b2b97ac23fe43146d4964a56806d5ce9f726c06
Author: David Douard <david.douard@sdfa3.org>
Date: Thu Jun 10 09:13:59 2021 +0200
Refactor the provenanceDB.insert_location() method
simplify the code and reduce it to a couple of INSERT queries (one for
locations, one for the dst_table).
commit fe35120741d76ff4d91d82bd1db029ff90ce8d60
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 14:55:54 2021 +0200
Remove the without-path flavor of ProvenanceDB
commit e23832b21ad4ee7afcb56f98147e51f633b6c2d7
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 10:27:32 2021 +0200
Refactor the cache handling in ProvenanceDB
- use TypedDict structures to properly type the caches needed by the
ProvenanceDB objects,
- use only one cache plus a set of added (and eventually removed) ids of
objects (within the cache) for revisisons, contents and directories.See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/104/ for more details. Comment Actions Build is green Patch application report for D5843 (id=20924)Could not rebase; Attempt merge onto 075b0d6cd6... Updating 075b0d6..b692ca1 Fast-forward swh/provenance/__init__.py | 18 +- swh/provenance/postgresql/provenancedb.py | 455 +++++++++++++++++++++ swh/provenance/postgresql/provenancedb_base.py | 325 --------------- .../postgresql/provenancedb_with_path.py | 157 ------- .../postgresql/provenancedb_without_path.py | 140 ------- swh/provenance/provenance.py | 3 +- swh/provenance/sql/15-flavor.sql | 6 +- swh/provenance/sql/30-schema.sql | 32 +- swh/provenance/sql/60-indexes.sql | 13 +- swh/provenance/tests/conftest.py | 25 +- swh/provenance/tests/test_cli.py | 32 +- 11 files changed, 500 insertions(+), 706 deletions(-) create mode 100644 swh/provenance/postgresql/provenancedb.py delete mode 100644 swh/provenance/postgresql/provenancedb_base.py delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py Changes applied before testcommit b692ca10b6fd7d0aeded8a3717416c7d2b496ae1
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 16:59:39 2021 +0200
Add support for a denormalized version of the provenance DB
commit aa3b89d7e98462d14d5ce3b13a79c90bb2398adf
Author: David Douard <david.douard@sdfa3.org>
Date: Thu Jun 10 10:26:38 2021 +0200
Simplify the ProvenanceDB.insert_all() method
factorize insertions in content, revision and directory tables.
commit 3e424af6b3d65daedf9f1923d2b214cb57676abe
Author: David Douard <david.douard@sdfa3.org>
Date: Thu Jun 10 09:13:59 2021 +0200
Refactor the provenanceDB.insert_location() method
simplify the code and reduce it to a couple of INSERT queries (one for
locations, one for the dst_table).
commit 4c50588e85be58c0d17d0e55d3ebb0facc3ee173
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 14:55:54 2021 +0200
Remove the without-path flavor of ProvenanceDB
commit 8aff35d251db39537a3a4bd14f98783dc06ebdc9
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 9 10:27:32 2021 +0200
Refactor the cache handling in ProvenanceDB
- use TypedDict structures to properly type the caches needed by the
ProvenanceDB objects,
- use only one cache plus a set of added (and eventually removed) ids of
objects (within the cache) for revisisons, contents and directories.See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/116/ for more details. Comment Actions Build is green Patch application report for D5843 (id=21215)Rebasing onto d892b29e40... Current branch diff-target is up to date. Changes applied before testcommit 91e902ca6935d3ce99d6815c5a60afd7c746a3ee
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 16 14:35:50 2021 +0200
Add support for a denormalized version of the provenance DB
in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
yyy relation (and the location, if any) are stored as arrays.See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/187/ for more details. Comment Actions Build is green Patch application report for D5843 (id=21216)Rebasing onto d892b29e40... Current branch diff-target is up to date. Changes applied before testcommit b2807328360f37d899c425e95d281e8af0a61098
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 16 14:35:50 2021 +0200
Add support for a denormalized version of the provenance DB
in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
yyy relation (and the location, if any) are stored as arrays.
Denormalized schema is chosen at db creation time using one of the 2
"-denormalized" flavors (aka "with-path-denormalized" or
"without-path-denormalized").See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/188/ for more details. Comment Actions Also, at some point we might want to use better templating to write these SQL queries, or use stored procedures (with the proper "variation" being chosen at db creation time on the selected flavor; would simplify the python code a lot. Comment Actions You are right, a bit of doc somewhere would not be superfluous...
Comment Actions Build is green Patch application report for D5843 (id=21726)Rebasing onto 1ae32c0a61... Current branch diff-target is up to date. Changes applied before testcommit 2f454a94987419e52f595bf5ff9ddf4a15b0cf9c
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 16 14:35:50 2021 +0200
Add support for a denormalized version of the provenance DB
in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
yyy relation (and the location, if any) are stored as arrays.
Denormalized schema is chosen at db creation time using one of the 2
"-denormalized" flavors (aka "with-path-denormalized" or
"without-path-denormalized").See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/263/ for more details. Comment Actions It's not clear to me how the denormalized version handles the insertion of duplicated entries.
Comment Actions
Comment Actions Build is green Patch application report for D5843 (id=21795)Rebasing onto 1ae32c0a61... Current branch diff-target is up to date. Changes applied before testcommit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 16 14:35:50 2021 +0200
Add support for a denormalized version of the provenance DB
in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
yyy relation (and the location, if any) are stored as arrays.
Denormalized schema is chosen at db creation time using one of the 2
"-denormalized" flavors (aka "with-path-denormalized" or
"without-path-denormalized").See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/268/ for more details. Comment Actions It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP. Comment Actions Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same. (I've also played a bit more with locking, I'll this in a future diff). Comment Actions I'm not sure we are talking about the same here... Comparing via content_find_all will check that the semantics of the db reaming the same (at least from the final user point of view), but will guarantee there are no (unnecessary) duplicated entries in the db. content_find_all will filter duplicated results in the end. With a 10k revisions dataset, inserting duplicated entries might not make a big difference, but when processing millions of revision with several clients this can make the size of the db explode. That's why it is important not to store duplicated information in the first place (potential consistency issues aside). Comment Actions Build is green Patch application report for D5843 (id=21831)Rebasing onto 509280132c... Current branch diff-target is up to date. Changes applied before testcommit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Jun 16 14:35:50 2021 +0200
Add support for a denormalized version of the provenance DB
in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
yyy relation (and the location, if any) are stored as arrays.
Denormalized schema is chosen at db creation time using one of the 2
"-denormalized" flavors (aka "with-path-denormalized" or
"without-path-denormalized").See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/275/ for more details. | ||||||||||||||||||||||||||||||||||||||||||||