Page MenuHomeSoftware Heritage

Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
AcceptedPublic

Authored by aeviso on Thu, Oct 14, 12:17 PM.

Details

Reviewers
olasd
Group Reviewers
Reviewers
Summary

Previous version was storing arrays of strings representing tuples for the
denormalized relations (dst and loc of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if bigint
(ie. internal ids) for dst and loc resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.

Diff Detail

Event Timeline

Build is green

Patch application report for D6473 (id=23513)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.
Changes applied before test
commit eca0242e0e00093e12ba45134bbaedc0f85da39a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/442/ for more details.

Build is green

Patch application report for D6473 (id=23516)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.
Changes applied before test
commit 37da3774d8dc34365b7b1cbed469d970c51ecc58
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/443/ for more details.

olasd added a subscriber: olasd.

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

This revision is now accepted and ready to land.Fri, Oct 15, 3:58 PM
In D6473#168303, @olasd wrote:

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

Well, that's exactly what I've tried to say in the comment. The lists are zipped together, so you get a list of the form [(entity, location), ...], then duplicated pairs are removed. I'll try to rephrase it to make it clearer.

Anyway, I'm not sure I will land this diff after all. In the first experiments I've done it actually seems to perform worst than before in terms of space usage. It needs more experimentation, but the results so far are not promising at all.