only the content_early_in_revision table is implemented so far.
Depends on D5842
Differential D5843
Add support for a denormalized version of the provenance DB douardda on Jun 10 2021, 10:38 AM. Authored by
Details
only the content_early_in_revision table is implemented so far. Depends on D5842
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D5843 (id=20896)Could not rebase; Attempt merge onto 6cdd424eba... Updating 6cdd424..a3d3aa5 Fast-forward swh/provenance/__init__.py | 18 +- swh/provenance/postgresql/provenancedb.py | 455 +++++++++++++++++++++ swh/provenance/postgresql/provenancedb_base.py | 325 --------------- .../postgresql/provenancedb_with_path.py | 157 ------- .../postgresql/provenancedb_without_path.py | 140 ------- swh/provenance/provenance.py | 3 +- swh/provenance/sql/15-flavor.sql | 6 +- swh/provenance/sql/30-schema.sql | 32 +- swh/provenance/sql/60-indexes.sql | 13 +- swh/provenance/tests/conftest.py | 25 +- swh/provenance/tests/test_cli.py | 32 +- 11 files changed, 500 insertions(+), 706 deletions(-) create mode 100644 swh/provenance/postgresql/provenancedb.py delete mode 100644 swh/provenance/postgresql/provenancedb_base.py delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py Changes applied before testcommit a3d3aa5ff90e081c85aeb3408959cf97f595e665 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 16:59:39 2021 +0200 Add support for a denormalized version of the provenance DB commit ac1b33b66ebccff3d5e2a2280e1c7446e8fa087a Author: David Douard <david.douard@sdfa3.org> Date: Thu Jun 10 10:26:38 2021 +0200 Simplify the ProvenanceDB.insert_all() method factorize insertions in content, revision and directory tables. commit 6b2b97ac23fe43146d4964a56806d5ce9f726c06 Author: David Douard <david.douard@sdfa3.org> Date: Thu Jun 10 09:13:59 2021 +0200 Refactor the provenanceDB.insert_location() method simplify the code and reduce it to a couple of INSERT queries (one for locations, one for the dst_table). commit fe35120741d76ff4d91d82bd1db029ff90ce8d60 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 14:55:54 2021 +0200 Remove the without-path flavor of ProvenanceDB commit e23832b21ad4ee7afcb56f98147e51f633b6c2d7 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 10:27:32 2021 +0200 Refactor the cache handling in ProvenanceDB - use TypedDict structures to properly type the caches needed by the ProvenanceDB objects, - use only one cache plus a set of added (and eventually removed) ids of objects (within the cache) for revisisons, contents and directories. See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/104/ for more details. Comment Actions Build is green Patch application report for D5843 (id=20924)Could not rebase; Attempt merge onto 075b0d6cd6... Updating 075b0d6..b692ca1 Fast-forward swh/provenance/__init__.py | 18 +- swh/provenance/postgresql/provenancedb.py | 455 +++++++++++++++++++++ swh/provenance/postgresql/provenancedb_base.py | 325 --------------- .../postgresql/provenancedb_with_path.py | 157 ------- .../postgresql/provenancedb_without_path.py | 140 ------- swh/provenance/provenance.py | 3 +- swh/provenance/sql/15-flavor.sql | 6 +- swh/provenance/sql/30-schema.sql | 32 +- swh/provenance/sql/60-indexes.sql | 13 +- swh/provenance/tests/conftest.py | 25 +- swh/provenance/tests/test_cli.py | 32 +- 11 files changed, 500 insertions(+), 706 deletions(-) create mode 100644 swh/provenance/postgresql/provenancedb.py delete mode 100644 swh/provenance/postgresql/provenancedb_base.py delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py Changes applied before testcommit b692ca10b6fd7d0aeded8a3717416c7d2b496ae1 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 16:59:39 2021 +0200 Add support for a denormalized version of the provenance DB commit aa3b89d7e98462d14d5ce3b13a79c90bb2398adf Author: David Douard <david.douard@sdfa3.org> Date: Thu Jun 10 10:26:38 2021 +0200 Simplify the ProvenanceDB.insert_all() method factorize insertions in content, revision and directory tables. commit 3e424af6b3d65daedf9f1923d2b214cb57676abe Author: David Douard <david.douard@sdfa3.org> Date: Thu Jun 10 09:13:59 2021 +0200 Refactor the provenanceDB.insert_location() method simplify the code and reduce it to a couple of INSERT queries (one for locations, one for the dst_table). commit 4c50588e85be58c0d17d0e55d3ebb0facc3ee173 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 14:55:54 2021 +0200 Remove the without-path flavor of ProvenanceDB commit 8aff35d251db39537a3a4bd14f98783dc06ebdc9 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 9 10:27:32 2021 +0200 Refactor the cache handling in ProvenanceDB - use TypedDict structures to properly type the caches needed by the ProvenanceDB objects, - use only one cache plus a set of added (and eventually removed) ids of objects (within the cache) for revisisons, contents and directories. See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/116/ for more details. Comment Actions Build is green Patch application report for D5843 (id=21215)Rebasing onto d892b29e40... Current branch diff-target is up to date. Changes applied before testcommit 91e902ca6935d3ce99d6815c5a60afd7c746a3ee Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 16 14:35:50 2021 +0200 Add support for a denormalized version of the provenance DB in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the yyy relation (and the location, if any) are stored as arrays. See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/187/ for more details. Comment Actions Build is green Patch application report for D5843 (id=21216)Rebasing onto d892b29e40... Current branch diff-target is up to date. Changes applied before testcommit b2807328360f37d899c425e95d281e8af0a61098 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 16 14:35:50 2021 +0200 Add support for a denormalized version of the provenance DB in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the yyy relation (and the location, if any) are stored as arrays. Denormalized schema is chosen at db creation time using one of the 2 "-denormalized" flavors (aka "with-path-denormalized" or "without-path-denormalized"). See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/188/ for more details. Comment Actions Also, at some point we might want to use better templating to write these SQL queries, or use stored procedures (with the proper "variation" being chosen at db creation time on the selected flavor; would simplify the python code a lot. Comment Actions You are right, a bit of doc somewhere would not be superfluous...
Comment Actions Build is green Patch application report for D5843 (id=21726)Rebasing onto 1ae32c0a61... Current branch diff-target is up to date. Changes applied before testcommit 2f454a94987419e52f595bf5ff9ddf4a15b0cf9c Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 16 14:35:50 2021 +0200 Add support for a denormalized version of the provenance DB in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the yyy relation (and the location, if any) are stored as arrays. Denormalized schema is chosen at db creation time using one of the 2 "-denormalized" flavors (aka "with-path-denormalized" or "without-path-denormalized"). See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/263/ for more details. Comment Actions It's not clear to me how the denormalized version handles the insertion of duplicated entries.
Comment Actions
Comment Actions Build is green Patch application report for D5843 (id=21795)Rebasing onto 1ae32c0a61... Current branch diff-target is up to date. Changes applied before testcommit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 16 14:35:50 2021 +0200 Add support for a denormalized version of the provenance DB in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the yyy relation (and the location, if any) are stored as arrays. Denormalized schema is chosen at db creation time using one of the 2 "-denormalized" flavors (aka "with-path-denormalized" or "without-path-denormalized"). See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/268/ for more details. Comment Actions It's something I am still trying to figure also (whether this code performs as expected under heavy concurrent workload). I want to make more tests (by hand, this is hard to implement as a "unit" test) ASAP. Comment Actions Just run a series of tests: I'm importing the sample_10k dataset using 16 concurrent workers in 2 databases (with and without denormalization), then I compare the content of the databases by comparing the result of content_find_all for all content objects. The results are exactly the same. (I've also played a bit more with locking, I'll this in a future diff). Comment Actions I'm not sure we are talking about the same here... Comparing via content_find_all will check that the semantics of the db reaming the same (at least from the final user point of view), but will guarantee there are no (unnecessary) duplicated entries in the db. content_find_all will filter duplicated results in the end. With a 10k revisions dataset, inserting duplicated entries might not make a big difference, but when processing millions of revision with several clients this can make the size of the db explode. That's why it is important not to store duplicated information in the first place (potential consistency issues aside). Comment Actions Build is green Patch application report for D5843 (id=21831)Rebasing onto 509280132c... Current branch diff-target is up to date. Changes applied before testcommit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3 Author: David Douard <david.douard@sdfa3.org> Date: Wed Jun 16 14:35:50 2021 +0200 Add support for a denormalized version of the provenance DB in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the yyy relation (and the location, if any) are stored as arrays. Denormalized schema is chosen at db creation time using one of the 2 "-denormalized" flavors (aka "with-path-denormalized" or "without-path-denormalized"). See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/275/ for more details. |