Page MenuHomeSoftware Heritage

content_get: Add support for queries by sha1_git
ClosedPublic

Authored by vlorentz on May 10 2021, 9:47 PM.

Details

Summary

Before this commit, the only way to get Content objects from their sha1_git
was to call content_find for each object.
This was obviously neither convenient nor efficient.

Using this endpoint to batch calls reduces the runtime of the git-bare
vault cooker by 30%.

I only implemented sha1_git because it is the only one I need for now
and it is simpler this way, but support for other algos can easily
be added later.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5729 (id=20462)

Could not rebase; Attempt merge onto b487a21f27...

Merge made by the 'recursive' strategy.
 swh/storage/cassandra/cql.py         | 11 +++++
 swh/storage/cassandra/storage.py     | 48 ++++++++++++++++++----
 swh/storage/in_memory.py             | 11 +++++
 swh/storage/interface.py             | 33 ++++++++++++++-
 swh/storage/postgresql/db.py         | 23 ++++++++++-
 swh/storage/postgresql/storage.py    | 48 +++++++++++++++++++---
 swh/storage/proxies/retry.py         |  3 +-
 swh/storage/sql/40-funcs.sql         | 28 +++++++++++++
 swh/storage/tests/storage_tests.py   | 54 ++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py  | 80 +++++++++++++++++++++++++++++++++++-
 swh/storage/tests/test_postgresql.py | 10 ++++-
 11 files changed, 331 insertions(+), 18 deletions(-)
Changes applied before test
commit 9d01c638ac090b0eccb99b3a38e66f150209a831
Merge: b487a21f 5f69f028
Author: Jenkins user <jenkins@localhost>
Date:   Mon May 10 19:47:17 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 5f69f0280f08d11591b09e2284b6f6074e7b13d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.
    
    I only implemented sha1_git because it is the only one I need for now
    and it is simpler this way, but support for other algos can easily
    be added later.

commit a6a283195782b7f8d2c33b24e8328b1cbbdd599b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 16:12:05 2021 +0200

    Add endpoint directory_get_entries, to quickly list a directory's entries
    
    It spares a join with the content table, which should hopefully make
    the vault (and possibly other users) faster when they don't need this
    join.

commit 4d3eeb2edd5b1413a968a30b1b0f585be4dcf4e0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 14:13:20 2021 +0200

    cassandra: Add tests checking directory_add and snapshot_add are atomic.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1322/ for more details.

Build is green

Patch application report for D5729 (id=20493)

Could not rebase; Attempt merge onto f140f634b6...

Updating f140f634..168d4864
Fast-forward
 swh/storage/cassandra/cql.py         | 11 +++++++++
 swh/storage/cassandra/storage.py     | 48 ++++++++++++++++++++++++++++++------
 swh/storage/in_memory.py             | 11 +++++++++
 swh/storage/interface.py             | 31 ++++++++++++++++++++++-
 swh/storage/postgresql/db.py         | 23 ++++++++++++++++-
 swh/storage/postgresql/storage.py    | 48 ++++++++++++++++++++++++++++++++----
 swh/storage/sql/40-funcs.sql         | 28 +++++++++++++++++++++
 swh/storage/tests/storage_tests.py   | 48 ++++++++++++++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py  |  1 +
 swh/storage/tests/test_postgresql.py | 10 +++++++-
 10 files changed, 243 insertions(+), 16 deletions(-)
Changes applied before test
commit 168d48648549a575fb5f8f3c88931e32344e583d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.
    
    I only implemented sha1_git because it is the only one I need for now
    and it is simpler this way, but support for other algos can easily
    be added later.

commit e3cbd5ee425cefa1e290a34cd889256036a06db0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 16:12:05 2021 +0200

    Add endpoint directory_get_entries, to quickly list a directory's entries
    
    It spares a join with the content table, which should hopefully make
    the vault (and possibly other users) faster when they don't need this
    join.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1325/ for more details.

olasd requested changes to this revision.May 11 2021, 2:19 PM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/storage/cassandra/storage.py
299

Why not all available hashes?

swh/storage/postgresql/db.py
130–141

Surely there is a way to properly parametrize the previous function on the hash type.

swh/storage/tests/storage_tests.py
672–687

This could probably be pytest.mark.parametrized

This revision now requires changes to proceed.May 11 2021, 2:19 PM

generalize to all hash types

Build is green

Patch application report for D5729 (id=20500)

Rebasing onto e3cbd5ee42...

Current branch diff-target is up to date.
Changes applied before test
commit f328367979b98ab121cbd2ddff36c2e56c193923
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1326/ for more details.

This revision is now accepted and ready to land.May 11 2021, 2:45 PM