Page MenuHomeSoftware Heritage

storage: Add branch names filtering support in snapshot_get_branches
Changes PlannedPublic

Authored by anlambert on Nov 26 2020, 6:16 PM.

Details

Summary

Add optional branches_name_pattern parameter to snapshot_get_branches
enabling to filter the branches to return according to their names.

If provided, only the branches whose names match the provided pattern
will be returned.

The purpose is to add branches filtering in swh-web. As that feature
could be useful elsewhere, I thought adding it at swh-storage level
was better than implementing it client side each time it is required.

Related to T2782

Diff Detail

Repository
rDSTO Storage manager
Branch
snapshot-branch-names-filtering
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17555
Build 27135: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 27134: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4615 (id=16363)

Rebasing onto 04ae89f4b4...

First, rewinding head to replay your work on top of it...
Applying: storage: Add branch names filtering support in snapshot_get_branches
Changes applied before test
commit 27b557fe444e54226060b792e08152aeaaf49de1
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Nov 26 01:03:43 2020 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branches_name_pattern parameter to snapshot_get_branches
    enabling  to filter the branches to return according to their names.
    
    If provided, only the branches whose names match the provided pattern
    will be returned.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1072/ for more details.

Thanks for tackling this issue!

I'm not keen on the idea of passing (more) regular expressions to the database backend. I would much prefer adding a filter for included, or excluded, branch name prefixes only, rather than allowing arbitrary regular expressions which can become expensive quite quickly. This gives a chance for the queries to be indexed too.

I also think that it may be time to bite the bullet and flatten the snapshot_branch/snapshot_branches tables in the postgresql backend, this would make the query behavior much more predictable there.

The total storage for both tables is 28 GB (snapshot_branch, 340 million unique branches) + 107 GB (snapshot_branches mapping, 2.54 billion branch mappings); I expect a flattened storage would take at most around 300 GB which is not that far off from the combined storage of both tables, and would save us quite some headaches.

I'm not keen on the idea of passing (more) regular expressions to the database backend. I would much prefer adding a filter for included, or excluded, branch name prefixes only, rather than allowing arbitrary regular expressions which can become expensive quite quickly. This gives a chance for the queries to be indexed too.

Ack, I understand the concerns. In that case, we could perform the name filtering in the Python code right after the branches have been returned by the database query. This is currently what is done in the cassandra storage implementation.
But this has the limitations to execute multiple queries when there is a lot of branches to filter (for instance when a snapshot contains several thousands of PR branches to filter out).
Anyway, most of the snapshots have reasonable sizes so the example mentioned above is an edge case.

I also think we should not restrict name filtering on prefixes only in order to offer a search feature on branch names.

I also think that it may be time to bite the bullet and flatten the snapshot_branch/snapshot_branches tables in the postgresql backend, this would make the query behavior much more predictable there.
The total storage for both tables is 28 GB (snapshot_branch, 340 million unique branches) + 107 GB (snapshot_branches mapping, 2.54 billion branch mappings); I expect a flattened storage would take at most around 300 GB which is not that far off from the combined storage of both tables, and > would save us quite some headaches.

Seems reasonable to flatten those tables indeed. What would be the migration plan here ? Create a new table for flatten snapshot, execute a migration script then modify the snapshot related methods in storage to use the new table ?