Page MenuHomeSoftware Heritage

Storage: Enable to paginate, filter and count snapshot content
ClosedPublic

Authored by anlambert on Oct 4 2018, 6:00 PM.

Details

Summary

First draft in order to enable pagination and filtering when one wants
to query the content of a snapshot.

That diff adds new optional parameters to the Storage methods returning
the content of a snapshot (snapshot_get, snapshot_get_by_origin_visit,
snapshot_get_latest):

  • branches_offset: enable to skip a given amount of branches before returning the snapshot content
  • branches_limit: enable to restrain the amount of branches returned in the snapsphot content
  • branches_targets: enable to filter the target type of branches returned in the snapshot branches

I have also added a new method snapshot_branches_count in order to
only return the amount of branches according to their target type.

Plugin these new features in swh-web effectively remove the slowdowns
observed when browsing an origin with a large amount of branches like
for instance https://github.com/v8/v8

Related T1207

Test Plan

Tests showing how to use these new parameters have been added.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert added inline comments.
sql/swh-func.sql
841

For the record, this is the only way I found to query dangling branches (when target_type is NULL)
as 'target_type = any(branches_targets)' returns false when target_type is NULL and NULL is
contained in the branches_targets array (sql standard).

Rebase to master and fix a couple of typos

olasd requested changes to this revision.Oct 8 2018, 2:42 PM

Thanks!

I have made a few comments inline.

I also have a more generic doubt about the interface : I don't like multiplying the arguments to all functions that access snapshots; I'd rather we limit the results to a sensible value by default (e.g. the first 100 branches), and give callers the information that more data is available;

The current functions would be unchanged, they would return a snapshot with its id and a list of branches. We would just add one field to the return value, next_branch, defaulting to None, that the caller would have to check to see if the list of branches was complete or not.

The full scrolling list of branches would then be available through a single (new) endpoint referencing the snapshot id and the first branch to fetch:

def snapshot_get_branches(id, first_branch=None, count=None, target_types=None):
   ...

This keeps the API more regular and avoids us doing more joins than necessary on the backend when a client wants the full list of branches for a given snapshot.

sql/swh-func.sql
829–830

I'm not sure about the argument names here; maybe :

  • branches_from bytea default ''
  • limit bigint default null
  • target_types snapshot_target[] default null
841

Mixing significant nulls with other data is fairly painful; I think we should add an explicit 'dangling' type for branches.

842

Ah :) there's a rule of thumb when doing performance-sensitive things in SQL :

Every time you use offset and limit, the database has to fetch offset + limit rows, discarding the first offset rows (more details). While the client won't get flooded by useless rows, the server will still have to fetch them (which means you end up having the server fetch O(n²) rows instead of O(n)).

In this case it'd be better to have the query scroll from the last branch name seen (so, adding where name >= next_name to the query). This would also potentially help filter using patterns, e.g. getting all branches of a git snapshot would be scrolling from refs/heads/ onwards.

The limit clause is fine.

856

We need to make sure that this function call really gets inlined by the optimizer (it should...), else we should inline the query ourselves.

swh/storage/db.py
376–379

These argument names should be kept matching the SQL-side ones.

swh/storage/storage.py
759

We use the object_verb[_object] for other function names, so that would be snapshot_count_branches (same comment for the db function name).

This revision now requires changes to proceed.Oct 8 2018, 2:42 PM

First diff update adressing @olasd comments:

  • remove the use of offset in the sql query used to return snapshot branches
  • do not modify parameters list of method snapshot_get, snapshot_get_by_origin_visit, snapshot_get_latest
  • snapshot content returned by these methods only contains the first 1000 branches (this seems a good default value imho, as I did not notice any performance issues and it should returned the whole set of branches for a majority of snapshots), a new field called next_branch is now present in the returned dict possibly containing the name of the first branch not returned.
  • add method snapshot_get_branches with the following optional parameters:
    • branches_from: skip branches to return whose name is lesser than this value
    • branches_count: maximum amount of branches to return
    • target_types: list of target types to return while filtering out the others
  • rename method snapshot_branches_count to snapshot_count_branches
anlambert added inline comments.
sql/swh-func.sql
829–830

So I ended up with:

  • branches_from bytea default ''
  • branches_count bigint default null
  • target_types snapshot_target[] default null

This naming is also reflected in the Python API.

841

I agree on that, querying dangling branches with None is not really straightforward. I removed the test case on this until the dangling target type is put in place.

856

Using postgres explain gives me the following:

explain select * from swh_snapshot_get_by_id(decode('4eea8dff10495777fdd7db0c790c1ce8506e2a32', 'hex'));

                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Limit  (cost=104.23..106.44 rows=885 width=76)
   InitPlan 1 (returns $0)
     ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.15..8.17 rows=1 width=8)
           Index Cond: ((id)::bytea = ('\x4eea8dff10495777fdd7db0c790c1ce8506e2a32'::bytea)::bytea)
   ->  Sort  (cost=96.05..98.27 rows=885 width=76)
         Sort Key: snapshot_branch.name
         ->  Hash Join  (cost=27.12..52.73 rows=885 width=76)
               Hash Cond: (snapshot_branch.object_id = snapshot_branches.branch_id)
               ->  Seq Scan on snapshot_branch  (cost=0.00..21.06 rows=885 width=52)
                     Filter: (name >= '\x'::bytea)
               ->  Hash  (cost=16.06..16.06 rows=885 width=8)
                     ->  Seq Scan on snapshot_branches  (cost=0.00..16.06 rows=885 width=8)
                           Filter: (snapshot_id = $0)

explain select * from swh_snapshot_branches_count(decode('4eea8dff10495777fdd7db0c790c1ce8506e2a32', 'hex'));

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=117.50..117.52 rows=2 width=12)
   Group Key: snapshot_branch.target_type
   ->  Limit  (cost=102.01..104.23 rows=885 width=76)
         InitPlan 1 (returns $0)
           ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.15..8.17 rows=1 width=8)
                 Index Cond: ((id)::bytea = ('\x4eea8dff10495777fdd7db0c790c1ce8506e2a32'::bytea)::bytea)
         ->  Sort  (cost=93.84..96.05 rows=885 width=76)
               Sort Key: snapshot_branch.name
               ->  Hash Join  (cost=27.12..50.52 rows=885 width=76)
                     Hash Cond: (snapshot_branch.object_id = snapshot_branches.branch_id)
                     ->  Seq Scan on snapshot_branch  (cost=0.00..21.06 rows=885 width=52)
                           Filter: (name >= '\x'::bytea)
                     ->  Hash  (cost=16.06..16.06 rows=885 width=8)
                           ->  Seq Scan on snapshot_branches  (cost=0.00..16.06 rows=885 width=8)
                                 Filter: (snapshot_id = $0)

It looks like the call to swh_snapshot_get_by_id in swh_snapshot_count_branches gets correctly inlined.

This revision is now accepted and ready to land.Oct 10 2018, 2:29 PM
This revision was automatically updated to reflect the committed changes.