Paths

Table of Contentst

Storage: Enable to paginate, filter and count snapshot content
ClosedPublic
Actions

Authored by anlambert on Oct 4 2018, 6:00 PM.

Details

Reviewers

olasd
ardumont

Group Reviewers

Reviewers

Commits

rDSTOCe9e5e73fe421: storage: Enable to paginate, filter and count snapshot content
R65:e9e5e73fe421: storage: Enable to paginate, filter and count snapshot content
rDSTOe9e5e73fe421: storage: Enable to paginate, filter and count snapshot content

Summary

First draft in order to enable pagination and filtering when one wants
to query the content of a snapshot.

That diff adds new optional parameters to the Storage methods returning
the content of a snapshot (snapshot_get, snapshot_get_by_origin_visit,
snapshot_get_latest):

branches_offset: enable to skip a given amount of branches before returning the snapshot content
branches_limit: enable to restrain the amount of branches returned in the snapsphot content
branches_targets: enable to filter the target type of branches returned in the snapshot branches

I have also added a new method snapshot_branches_count in order to
only return the amount of branches according to their target type.

Plugin these new features in swh-web effectively remove the slowdowns
observed when browsing an origin with a large amount of branches like
for instance https://github.com/v8/v8

Related T1207

Test Plan

Tests showing how to use these new parameters have been added.

Diff Detail

Repository

rDSTO Storage manager

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Oct 4 2018, 6:00 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 4 2018, 6:00 PM

Harbormaster completed remote builds in B1524: Diff 1464.Oct 4 2018, 6:00 PM

anlambert marked an inline comment as done.Oct 4 2018, 6:04 PM

anlambert added inline comments.

sql/swh-func.sql
841	For the record, this is the only way I found to query dangling branches (when target_type is NULL) as 'target_type = any(branches_targets)' returns false when target_type is NULL and NULL is contained in the branches_targets array (sql standard).

Rebase to master and fix a couple of typos

Harbormaster completed remote builds in B1559: Diff 1502.Oct 8 2018, 11:00 AM

Thanks!

I have made a few comments inline.

I also have a more generic doubt about the interface : I don't like multiplying the arguments to all functions that access snapshots; I'd rather we limit the results to a sensible value by default (e.g. the first 100 branches), and give callers the information that more data is available;

The current functions would be unchanged, they would return a snapshot with its id and a list of branches. We would just add one field to the return value, next_branch, defaulting to None, that the caller would have to check to see if the list of branches was complete or not.

The full scrolling list of branches would then be available through a single (new) endpoint referencing the snapshot id and the first branch to fetch:

def snapshot_get_branches(id, first_branch=None, count=None, target_types=None):
   ...

This keeps the API more regular and avoids us doing more joins than necessary on the backend when a client wants the full list of branches for a given snapshot.

sql/swh-func.sql
829–830	I'm not sure about the argument names here; maybe : `branches_from bytea default ''` `limit bigint default null` `target_types snapshot_target[] default null`
841	Mixing significant nulls with other data is fairly painful; I think we should add an explicit `'dangling'` type for branches.
842	Ah :) there's a rule of thumb when doing performance-sensitive things in SQL : Every time you use offset and limit, the database has to fetch `offset + limit` rows, discarding the first `offset` rows (more details). While the client won't get flooded by useless rows, the server will still have to fetch them (which means you end up having the server fetch O(n²) rows instead of O(n)). In this case it'd be better to have the query scroll from the last branch name seen (so, adding `where name >= next_name` to the query). This would also potentially help filter using patterns, e.g. getting all branches of a git snapshot would be scrolling from `refs/heads/` onwards. The limit clause is fine.
856	We need to make sure that this function call really gets inlined by the optimizer (it should...), else we should inline the query ourselves.
swh/storage/db.py
376–379	These argument names should be kept matching the SQL-side ones.
swh/storage/storage.py
759	We use the `object_verb[_object]` for other function names, so that would be `snapshot_count_branches` (same comment for the `db` function name).

This revision now requires changes to proceed.Oct 8 2018, 2:42 PM

First diff update adressing @olasd comments:

remove the use of offset in the sql query used to return snapshot branches
do not modify parameters list of method snapshot_get, snapshot_get_by_origin_visit, snapshot_get_latest
snapshot content returned by these methods only contains the first 1000 branches (this seems a good default value imho, as I did not notice any performance issues and it should returned the whole set of branches for a majority of snapshots), a new field called next_branch is now present in the returned dict possibly containing the name of the first branch not returned.
add method snapshot_get_branches with the following optional parameters:
- branches_from: skip branches to return whose name is lesser than this value
- branches_count: maximum amount of branches to return
- target_types: list of target types to return while filtering out the others
rename method snapshot_branches_count to snapshot_count_branches

Harbormaster completed remote builds in B1600: Diff 1567.Oct 9 2018, 3:41 PM

anlambert marked 3 inline comments as done.Oct 9 2018, 3:52 PM

anlambert added inline comments.

sql/swh-func.sql

829–830

So I ended up with:

branches_from bytea default ''
branches_count bigint default null
target_types snapshot_target[] default null

This naming is also reflected in the Python API.

841

I agree on that, querying dangling branches with None is not really straightforward. I removed the test case on this until the dangling target type is put in place.

856

Using postgres explain gives me the following:

explain select * from swh_snapshot_get_by_id(decode('4eea8dff10495777fdd7db0c790c1ce8506e2a32', 'hex'));

                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Limit  (cost=104.23..106.44 rows=885 width=76)
   InitPlan 1 (returns $0)
     ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.15..8.17 rows=1 width=8)
           Index Cond: ((id)::bytea = ('\x4eea8dff10495777fdd7db0c790c1ce8506e2a32'::bytea)::bytea)
   ->  Sort  (cost=96.05..98.27 rows=885 width=76)
         Sort Key: snapshot_branch.name
         ->  Hash Join  (cost=27.12..52.73 rows=885 width=76)
               Hash Cond: (snapshot_branch.object_id = snapshot_branches.branch_id)
               ->  Seq Scan on snapshot_branch  (cost=0.00..21.06 rows=885 width=52)
                     Filter: (name >= '\x'::bytea)
               ->  Hash  (cost=16.06..16.06 rows=885 width=8)
                     ->  Seq Scan on snapshot_branches  (cost=0.00..16.06 rows=885 width=8)
                           Filter: (snapshot_id = $0)

explain select * from swh_snapshot_branches_count(decode('4eea8dff10495777fdd7db0c790c1ce8506e2a32', 'hex'));

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=117.50..117.52 rows=2 width=12)
   Group Key: snapshot_branch.target_type
   ->  Limit  (cost=102.01..104.23 rows=885 width=76)
         InitPlan 1 (returns $0)
           ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.15..8.17 rows=1 width=8)
                 Index Cond: ((id)::bytea = ('\x4eea8dff10495777fdd7db0c790c1ce8506e2a32'::bytea)::bytea)
         ->  Sort  (cost=93.84..96.05 rows=885 width=76)
               Sort Key: snapshot_branch.name
               ->  Hash Join  (cost=27.12..50.52 rows=885 width=76)
                     Hash Cond: (snapshot_branch.object_id = snapshot_branches.branch_id)
                     ->  Seq Scan on snapshot_branch  (cost=0.00..21.06 rows=885 width=52)
                           Filter: (name >= '\x'::bytea)
                     ->  Hash  (cost=16.06..16.06 rows=885 width=8)
                           ->  Seq Scan on snapshot_branches  (cost=0.00..16.06 rows=885 width=8)
                                 Filter: (snapshot_id = $0)

It looks like the call to swh_snapshot_get_by_id in swh_snapshot_count_branches gets correctly inlined.

olasd accepted this revision.Oct 10 2018, 2:29 PM

This revision is now accepted and ready to land.Oct 10 2018, 2:29 PM

Closed by commit rDSTOe9e5e73fe421: storage: Enable to paginate, filter and count snapshot content (authored by anlambert). · Explain WhyOct 10 2018, 2:43 PM

This revision was automatically updated to reflect the committed changes.