Details

Reviewers

Group Reviewers

Commits

rDPROV579c3bd35e56: Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor

Summary

Previous version was storing arrays of strings representing tuples for the
denormalized relations (dst and loc of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if bigint
(ie. internal ids) for dst and loc resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.

Diff Detail

Repository

rDPROV Provenance database

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

aeviso created this revision.Oct 14 2021, 12:17 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 14 2021, 12:17 PM

Build is green

Patch application report for D6473 (id=23513)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.

Changes applied before test

commit eca0242e0e00093e12ba45134bbaedc0f85da39a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/442/ for more details.

Harbormaster completed remote builds in B24424: Diff 23513.Oct 14 2021, 12:20 PM

aeviso requested review of this revision.Oct 14 2021, 12:20 PM

rebase

aeviso added a child revision: D6165: Add new RabbitMQ-based client/server API.Oct 14 2021, 1:40 PM

Build is green

Patch application report for D6473 (id=23516)

Rebasing onto 3e87301a28...

Current branch diff-target is up to date.

Changes applied before test

commit 37da3774d8dc34365b7b1cbed469d970c51ecc58
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/443/ for more details.

Harbormaster completed remote builds in B24426: Diff 23516.Oct 14 2021, 1:44 PM

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

This revision is now accepted and ready to land.Oct 15 2021, 3:58 PM

In D6473#168303, @olasd wrote:

Looks sensible to me, thanks.

Maybe add a couple comments to explain what the conflict clause does (concatenate the loc and $entity lists, and deduplicates their entries together)?

Well, that's exactly what I've tried to say in the comment. The lists are zipped together, so you get a list of the form [(entity, location), ...], then duplicated pairs are removed. I'll try to rephrase it to make it clearer.

Anyway, I'm not sure I will land this diff after all. In the first experiments I've done it actually seems to perform worst than before in terms of space usage. It needs more experimentation, but the results so far are not promising at all.

rebase

aeviso added a parent revision: D6507: Add metrics on retries when flushing cache on the provenance backend.Oct 19 2021, 4:29 PM

Build is green

Patch application report for D6473 (id=23634)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..8168ab4
Fast-forward
 swh/provenance/graph.py                 |  2 +-
 swh/provenance/postgresql/provenance.py | 29 +++++++++++----
 swh/provenance/provenance.py            | 63 +++++++++++++++++++++++++++++++++
 swh/provenance/sql/30-schema.sql        | 20 +++++------
 swh/provenance/sql/40-funcs.sql         | 50 +++++++++++++++-----------
 5 files changed, 124 insertions(+), 40 deletions(-)

Changes applied before test

commit 8168ab4fc3f0fc3556623dd3de854f222ffe5d7e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit c7ae90e08b39919da9d67ad3436a71d47a6ad5e7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 12:10:10 2021 +0200

    Add metrics on retries when flushing cache on the provenance backend

commit bfea53a97c588aa85ddd2ea93fa3dcf17b34a6a4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 19 16:12:23 2021 +0200

    Export page size as a parameter for postgresql storage

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/458/ for more details.

Harbormaster completed remote builds in B24525: Diff 23634.Oct 19 2021, 4:32 PM

rebase

aeviso removed a parent revision: D6507: Add metrics on retries when flushing cache on the provenance backend.Oct 20 2021, 10:48 AM

Build is green

Patch application report for D6473 (id=23655)

Could not rebase; Attempt merge onto 3e87301a28...

Updating 3e87301..62884e2
Fast-forward
 swh/provenance/graph.py                 |  2 +-
 swh/provenance/postgresql/provenance.py | 29 +++++++++++----
 swh/provenance/provenance.py            | 63 +++++++++++++++++++++++++++++++++
 swh/provenance/sql/30-schema.sql        | 20 +++++------
 swh/provenance/sql/40-funcs.sql         | 50 +++++++++++++++-----------
 5 files changed, 124 insertions(+), 40 deletions(-)

Changes applied before test

commit 62884e23dd1164274fd89a09acedae8977a8e0f3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit ef49e3100cf40fe7427855cd7f893e59f6c07379
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 12:10:10 2021 +0200

    Add metrics on retries when flushing cache on the provenance backend

commit 665b8a4430cf2a74cced46c31e744efa4efe5662
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Oct 19 16:12:23 2021 +0200

    Export page size as a parameter for postgresql storage

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/463/ for more details.

Harbormaster completed remote builds in B24545: Diff 23655.Oct 20 2021, 10:49 AM

rebase

aeviso added a parent revision: D6577: Fix `yaml.load` deprecated warning.Oct 28 2021, 2:24 PM

Build is green

Patch application report for D6473 (id=23904)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..743b595
Fast-forward
 swh/provenance/sql/30-schema.sql           | 20 +++++-------
 swh/provenance/sql/40-funcs.sql            | 50 +++++++++++++++++-------------
 swh/provenance/tests/data/generate_repo.py |  2 +-
 3 files changed, 38 insertions(+), 34 deletions(-)

Changes applied before test

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:58:32 2021 +0200

    Fix `yaml.load` deprecated warning

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/473/ for more details.

Harbormaster completed remote builds in B24783: Diff 23904.Oct 28 2021, 2:27 PM

aeviso removed a parent revision: D6577: Fix `yaml.load` deprecated warning.Oct 28 2021, 2:48 PM

rebase

aeviso added a parent revision: D6578: Add support to filter files a minimum size.Nov 24 2021, 11:22 AM

Build is green

Patch application report for D6473 (id=24270)

Could not rebase; Attempt merge onto 94baaab052...

Updating 94baaab..579c3bd
Fast-forward
 swh/provenance/archive.py            |  2 +-
 swh/provenance/cli.py                |  4 +-
 swh/provenance/graph.py              |  3 +-
 swh/provenance/model.py              |  4 +-
 swh/provenance/postgresql/archive.py | 15 +++----
 swh/provenance/provenance.py         | 77 +++++++++++++-----------------------
 swh/provenance/revision.py           | 12 ++++--
 swh/provenance/sql/30-schema.sql     | 20 ++++------
 swh/provenance/sql/40-funcs.sql      | 50 +++++++++++++----------
 swh/provenance/storage/archive.py    | 16 ++++----
 swh/provenance/tests/conftest.py     | 34 +++++++++-------
 11 files changed, 118 insertions(+), 119 deletions(-)

Changes applied before test

commit 579c3bd35e5668ad9ef5fea58c20d5c66e5699f2
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 584845d3715ea6c536e7cf5f697cac628032416f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size
    
    The idea is to be able to filter files that are not meaningful from the
    provenance point of view. For instance, the empty file. This modification
    allows to define a minimum size for files to be considered for the
    provenance index.

commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Nov 23 16:11:09 2021 +0100

    Reorder flushing operations to avoid unnecessary updated in the storage

commit 62a31f6f986bb38ced99331ab66eb0717600ea5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Nov 24 11:10:40 2021 +0100

    Rework conftest and improve type annotations

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/488/ for more details.

Harbormaster completed remote builds in B25145: Diff 24270.Nov 24 2021, 11:24 AM

Closed by commit rDPROV579c3bd35e56: Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor (authored by aeviso). · Explain WhyNov 24 2021, 1:45 PM

This revision was automatically updated to reflect the committed changes.

aeviso added a commit: rDPROV579c3bd35e56: Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor.

aeviso removed a child revision: D6165: Add new RabbitMQ-based client/server API.Nov 24 2021, 1:51 PM

Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D6473 (id=23513)

Changes applied before test

Patch application report for D6473 (id=23516)

Changes applied before test

Patch application report for D6473 (id=23634)

Changes applied before test

Patch application report for D6473 (id=23655)

Changes applied before test

Patch application report for D6473 (id=23904)

Changes applied before test

Patch application report for D6473 (id=24270)

Changes applied before test

Revision Contents
Changeset List

Diff 24280

swh/provenance/sql/30-schema.sql

swh/provenance/sql/40-funcs.sql

Improve PostgreSQL storage scheme for the `with-path-denormalized` flavorClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D6473 (id=23513)

Changes applied before test

Patch application report for D6473 (id=23516)

Changes applied before test

Patch application report for D6473 (id=23634)

Changes applied before test

Patch application report for D6473 (id=23655)

Changes applied before test

Patch application report for D6473 (id=23904)

Changes applied before test

Patch application report for D6473 (id=24270)

Changes applied before test

Revision ContentsChangeset List

Diff 24280

swh/provenance/sql/30-schema.sql

swh/provenance/sql/40-funcs.sql

Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
ClosedPublic
Actions

Revision Contents
Changeset List