Previous version was storing arrays of strings representing tuples for the
denormalized relations (dst and loc of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if bigint
(ie. internal ids) for dst and loc resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.
Details
- Reviewers
olasd - Group Reviewers
Reviewers - Commits
- rDPROV579c3bd35e56: Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
Diff Detail
- Repository
- rDPROV Provenance database
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Build is green
Patch application report for D6473 (id=23513)
Rebasing onto 3e87301a28...
Current branch diff-target is up to date.
Changes applied before test
commit eca0242e0e00093e12ba45134bbaedc0f85da39a Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered.
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/442/ for more details.
Build is green
Patch application report for D6473 (id=23516)
Rebasing onto 3e87301a28...
Current branch diff-target is up to date.
Changes applied before test
commit 37da3774d8dc34365b7b1cbed469d970c51ecc58 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered.
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/443/ for more details.
Looks sensible to me, thanks.
Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?
Well, that's exactly what I've tried to say in the comment. The lists are zipped together, so you get a list of the form [(entity, location), ...], then duplicated pairs are removed. I'll try to rephrase it to make it clearer.
Anyway, I'm not sure I will land this diff after all. In the first experiments I've done it actually seems to perform worst than before in terms of space usage. It needs more experimentation, but the results so far are not promising at all.
Build is green
Patch application report for D6473 (id=23634)
Could not rebase; Attempt merge onto 3e87301a28...
Updating 3e87301..8168ab4 Fast-forward swh/provenance/graph.py | 2 +- swh/provenance/postgresql/provenance.py | 29 +++++++++++---- swh/provenance/provenance.py | 63 +++++++++++++++++++++++++++++++++ swh/provenance/sql/30-schema.sql | 20 +++++------ swh/provenance/sql/40-funcs.sql | 50 +++++++++++++++----------- 5 files changed, 124 insertions(+), 40 deletions(-)
Changes applied before test
commit 8168ab4fc3f0fc3556623dd3de854f222ffe5d7e Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit c7ae90e08b39919da9d67ad3436a71d47a6ad5e7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 12:10:10 2021 +0200 Add metrics on retries when flushing cache on the provenance backend commit bfea53a97c588aa85ddd2ea93fa3dcf17b34a6a4 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Oct 19 16:12:23 2021 +0200 Export page size as a parameter for postgresql storage
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/458/ for more details.
Build is green
Patch application report for D6473 (id=23655)
Could not rebase; Attempt merge onto 3e87301a28...
Updating 3e87301..62884e2 Fast-forward swh/provenance/graph.py | 2 +- swh/provenance/postgresql/provenance.py | 29 +++++++++++---- swh/provenance/provenance.py | 63 +++++++++++++++++++++++++++++++++ swh/provenance/sql/30-schema.sql | 20 +++++------ swh/provenance/sql/40-funcs.sql | 50 +++++++++++++++----------- 5 files changed, 124 insertions(+), 40 deletions(-)
Changes applied before test
commit 62884e23dd1164274fd89a09acedae8977a8e0f3 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit ef49e3100cf40fe7427855cd7f893e59f6c07379 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 12:10:10 2021 +0200 Add metrics on retries when flushing cache on the provenance backend commit 665b8a4430cf2a74cced46c31e744efa4efe5662 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Oct 19 16:12:23 2021 +0200 Export page size as a parameter for postgresql storage
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/463/ for more details.
Build is green
Patch application report for D6473 (id=23904)
Could not rebase; Attempt merge onto ef49e3100c...
Updating ef49e31..743b595 Fast-forward swh/provenance/sql/30-schema.sql | 20 +++++------- swh/provenance/sql/40-funcs.sql | 50 +++++++++++++++++------------- swh/provenance/tests/data/generate_repo.py | 2 +- 3 files changed, 38 insertions(+), 34 deletions(-)
Changes applied before test
commit 743b5954068fcc98203d9d254c53c076856e3426 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit 30d8899bcfd60019b84064eba6916af0b2b5173e Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:58:32 2021 +0200 Fix `yaml.load` deprecated warning
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/473/ for more details.
Build is green
Patch application report for D6473 (id=24270)
Could not rebase; Attempt merge onto 94baaab052...
Updating 94baaab..579c3bd Fast-forward swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 4 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 15 +++---- swh/provenance/provenance.py | 77 +++++++++++++----------------------- swh/provenance/revision.py | 12 ++++-- swh/provenance/sql/30-schema.sql | 20 ++++------ swh/provenance/sql/40-funcs.sql | 50 +++++++++++++---------- swh/provenance/storage/archive.py | 16 ++++---- swh/provenance/tests/conftest.py | 34 +++++++++------- 11 files changed, 118 insertions(+), 119 deletions(-)
Changes applied before test
commit 579c3bd35e5668ad9ef5fea58c20d5c66e5699f2 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit 584845d3715ea6c536e7cf5f697cac628032416f Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index. commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Nov 23 16:11:09 2021 +0100 Reorder flushing operations to avoid unnecessary updated in the storage commit 62a31f6f986bb38ced99331ab66eb0717600ea5b Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Nov 24 11:10:40 2021 +0100 Rework conftest and improve type annotations
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/488/ for more details.