Page MenuHomeSoftware Heritage

content_get: Add support for queries by sha1_git
ClosedPublic

Authored by vlorentz on May 10 2021, 9:47 PM.

Details

Summary

Before this commit, the only way to get Content objects from their sha1_git
was to call content_find for each object.
This was obviously neither convenient nor efficient.

Using this endpoint to batch calls reduces the runtime of the git-bare
vault cooker by 30%.

I only implemented sha1_git because it is the only one I need for now
and it is simpler this way, but support for other algos can easily
be added later.

Diff Detail

Repository
rDSTO Storage manager
Branch
content_get
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21466
Build 33348: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 33347: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5729 (id=20462)

Could not rebase; Attempt merge onto b487a21f27...

Merge made by the 'recursive' strategy.
 swh/storage/cassandra/cql.py         | 11 +++++
 swh/storage/cassandra/storage.py     | 48 ++++++++++++++++++----
 swh/storage/in_memory.py             | 11 +++++
 swh/storage/interface.py             | 33 ++++++++++++++-
 swh/storage/postgresql/db.py         | 23 ++++++++++-
 swh/storage/postgresql/storage.py    | 48 +++++++++++++++++++---
 swh/storage/proxies/retry.py         |  3 +-
 swh/storage/sql/40-funcs.sql         | 28 +++++++++++++
 swh/storage/tests/storage_tests.py   | 54 ++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py  | 80 +++++++++++++++++++++++++++++++++++-
 swh/storage/tests/test_postgresql.py | 10 ++++-
 11 files changed, 331 insertions(+), 18 deletions(-)
Changes applied before test
commit 9d01c638ac090b0eccb99b3a38e66f150209a831
Merge: b487a21f 5f69f028
Author: Jenkins user <jenkins@localhost>
Date:   Mon May 10 19:47:17 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 5f69f0280f08d11591b09e2284b6f6074e7b13d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.
    
    I only implemented sha1_git because it is the only one I need for now
    and it is simpler this way, but support for other algos can easily
    be added later.

commit a6a283195782b7f8d2c33b24e8328b1cbbdd599b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 16:12:05 2021 +0200

    Add endpoint directory_get_entries, to quickly list a directory's entries
    
    It spares a join with the content table, which should hopefully make
    the vault (and possibly other users) faster when they don't need this
    join.

commit 4d3eeb2edd5b1413a968a30b1b0f585be4dcf4e0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 14:13:20 2021 +0200

    cassandra: Add tests checking directory_add and snapshot_add are atomic.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1322/ for more details.

Build is green

Patch application report for D5729 (id=20493)

Could not rebase; Attempt merge onto f140f634b6...

Updating f140f634..168d4864
Fast-forward
 swh/storage/cassandra/cql.py         | 11 +++++++++
 swh/storage/cassandra/storage.py     | 48 ++++++++++++++++++++++++++++++------
 swh/storage/in_memory.py             | 11 +++++++++
 swh/storage/interface.py             | 31 ++++++++++++++++++++++-
 swh/storage/postgresql/db.py         | 23 ++++++++++++++++-
 swh/storage/postgresql/storage.py    | 48 ++++++++++++++++++++++++++++++++----
 swh/storage/sql/40-funcs.sql         | 28 +++++++++++++++++++++
 swh/storage/tests/storage_tests.py   | 48 ++++++++++++++++++++++++++++++++++++
 swh/storage/tests/test_cassandra.py  |  1 +
 swh/storage/tests/test_postgresql.py | 10 +++++++-
 10 files changed, 243 insertions(+), 16 deletions(-)
Changes applied before test
commit 168d48648549a575fb5f8f3c88931e32344e583d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.
    
    I only implemented sha1_git because it is the only one I need for now
    and it is simpler this way, but support for other algos can easily
    be added later.

commit e3cbd5ee425cefa1e290a34cd889256036a06db0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 16:12:05 2021 +0200

    Add endpoint directory_get_entries, to quickly list a directory's entries
    
    It spares a join with the content table, which should hopefully make
    the vault (and possibly other users) faster when they don't need this
    join.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1325/ for more details.

olasd requested changes to this revision.May 11 2021, 2:19 PM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/storage/cassandra/storage.py
300

Why not all available hashes?

swh/storage/postgresql/db.py
128–139

Surely there is a way to properly parametrize the previous function on the hash type.

swh/storage/tests/storage_tests.py
668–683

This could probably be pytest.mark.parametrized

This revision now requires changes to proceed.May 11 2021, 2:19 PM

generalize to all hash types

Build is green

Patch application report for D5729 (id=20500)

Rebasing onto e3cbd5ee42...

Current branch diff-target is up to date.
Changes applied before test
commit f328367979b98ab121cbd2ddff36c2e56c193923
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:46:50 2021 +0200

    content_get: Add support for queries by sha1_git
    
    Before this commit, the only way to get Content objects from their sha1_git
    was to call content_find for each object.
    This was obviously neither convenient nor efficient.
    
    Using this endpoint to batch calls reduces the runtime of the git-bare
    vault cooker by 30%.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1326/ for more details.

This revision is now accepted and ready to land.May 11 2021, 2:45 PM