Page MenuHomeSoftware Heritage

Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
ClosedPublic

Authored by aeviso on Oct 14 2021, 12:17 PM.

Details

Summary

Previous version was storing arrays of strings representing tuples for the
denormalized relations (dst and loc of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if bigint
(ie. internal ids) for dst and loc resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.

Diff Detail

Event Timeline

Build is green

Patch application report for D6473 (id=23513)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.
Changes applied before test
commit eca0242e0e00093e12ba45134bbaedc0f85da39a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/442/ for more details.

Build is green

Patch application report for D6473 (id=23516)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.
Changes applied before test
commit 37da3774d8dc34365b7b1cbed469d970c51ecc58
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/443/ for more details.

olasd added a subscriber: olasd.

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

This revision is now accepted and ready to land.Oct 15 2021, 3:58 PM
In D6473#168303, @olasd wrote:

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

Well, that's exactly what I've tried to say in the comment. The lists are zipped together, so you get a list of the form [(entity, location), ...], then duplicated pairs are removed. I'll try to rephrase it to make it clearer.

Anyway, I'm not sure I will land this diff after all. In the first experiments I've done it actually seems to perform worst than before in terms of space usage. It needs more experimentation, but the results so far are not promising at all.

Build is green

Patch application report for D6473 (id=23634)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..8168ab4
Fast-forward
 swh/provenance/graph.py                 |  2 +-
 swh/provenance/postgresql/provenance.py | 29 +++++++++++----
 swh/provenance/provenance.py            | 63 +++++++++++++++++++++++++++++++++
 swh/provenance/sql/30-schema.sql        | 20 +++++------
 swh/provenance/sql/40-funcs.sql         | 50 +++++++++++++++-----------
 5 files changed, 124 insertions(+), 40 deletions(-)
Changes applied before test
commit 8168ab4fc3f0fc3556623dd3de854f222ffe5d7e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit c7ae90e08b39919da9d67ad3436a71d47a6ad5e7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 12:10:10 2021 +0200

    Add metrics on retries when flushing cache on the provenance backend

commit bfea53a97c588aa85ddd2ea93fa3dcf17b34a6a4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 19 16:12:23 2021 +0200

    Export page size as a parameter for postgresql storage

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/458/ for more details.

Build is green

Patch application report for D6473 (id=23655)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..62884e2
Fast-forward
 swh/provenance/graph.py                 |  2 +-
 swh/provenance/postgresql/provenance.py | 29 +++++++++++----
 swh/provenance/provenance.py            | 63 +++++++++++++++++++++++++++++++++
 swh/provenance/sql/30-schema.sql        | 20 +++++------
 swh/provenance/sql/40-funcs.sql         | 50 +++++++++++++++-----------
 5 files changed, 124 insertions(+), 40 deletions(-)
Changes applied before test
commit 62884e23dd1164274fd89a09acedae8977a8e0f3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit ef49e3100cf40fe7427855cd7f893e59f6c07379
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 12:10:10 2021 +0200

    Add metrics on retries when flushing cache on the provenance backend

commit 665b8a4430cf2a74cced46c31e744efa4efe5662
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 19 16:12:23 2021 +0200

    Export page size as a parameter for postgresql storage

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/463/ for more details.

Build is green

Patch application report for D6473 (id=23904)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..743b595
Fast-forward
 swh/provenance/sql/30-schema.sql           | 20 +++++-------
 swh/provenance/sql/40-funcs.sql            | 50 +++++++++++++++++-------------
 swh/provenance/tests/data/generate_repo.py |  2 +-
 3 files changed, 38 insertions(+), 34 deletions(-)
Changes applied before test
commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:58:32 2021 +0200

    Fix `yaml.load` deprecated warning

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/473/ for more details.

Build is green

Patch application report for D6473 (id=24270)

Could not rebase; Attempt merge onto 94baaab052...

Updating 94baaab..579c3bd
Fast-forward
 swh/provenance/archive.py            |  2 +-
 swh/provenance/cli.py                |  4 +-
 swh/provenance/graph.py              |  3 +-
 swh/provenance/model.py              |  4 +-
 swh/provenance/postgresql/archive.py | 15 +++----
 swh/provenance/provenance.py         | 77 +++++++++++++-----------------------
 swh/provenance/revision.py           | 12 ++++--
 swh/provenance/sql/30-schema.sql     | 20 ++++------
 swh/provenance/sql/40-funcs.sql      | 50 +++++++++++++----------
 swh/provenance/storage/archive.py    | 16 ++++----
 swh/provenance/tests/conftest.py     | 34 +++++++++-------
 11 files changed, 118 insertions(+), 119 deletions(-)
Changes applied before test
commit 579c3bd35e5668ad9ef5fea58c20d5c66e5699f2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 584845d3715ea6c536e7cf5f697cac628032416f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size
    
    The idea is to be able to filter files that are not meaningful from the
    provenance point of view. For instance, the empty file. This modification
    allows to define a minimum size for files to be considered for the
    provenance index.

commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Nov 23 16:11:09 2021 +0100

    Reorder flushing operations to avoid unnecessary updated in the storage

commit 62a31f6f986bb38ced99331ab66eb0717600ea5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Nov 24 11:10:40 2021 +0100

    Rework conftest and improve type annotations

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/488/ for more details.