Page MenuHomeSoftware Heritage

storage: Add branch names filtering support in snapshot_get_branches
ClosedPublic

Authored by anlambert on Nov 26 2020, 6:16 PM.

Details

Summary

Add optional branch_name_include_substring parameter to snapshot_get_branches,
if provided only branches whose name contains the given substring will be
returned.

Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
if provided branches whose name starts with the given pprefix will not be
returned.

The purpose is to add branches filtering in swh-web. As that feature
could be useful elsewhere, I thought adding it at swh-storage level
was better than implementing it client side each time it is required.

Related to T2782

Diff Detail

Repository
rDSTO Storage manager
Branch
snapshot-branch-names-filtering
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19844
Build 30816: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 30815: arc lint + arc unit

Unit TestsFailed

TimeTest
2,066 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_backfill::test_backfiller
swh_storage_backend_config = {'check_config': {'check_write': True}, 'cls': 'local', 'db': "dbname=storage user=postgres host=127.0.0.1 port=16873 ...riter': {'brokers': ['127.0.0.1:41371'], 'client_id': 'kafka_writer-1', 'cls': 'kafka', 'prefix': 'muhacqhoos-1'}, ...} kafka_prefix = 'muhacqhoos', kafka_consumer_group = 'test-consumer-muhacqhoos' kafka_server = '127.0.0.1:41371'
3,006 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer
kafka_prefix = 'spgjmbwrev', kafka_server = '127.0.0.1:41371' consumer = <cimpl.Consumer object at 0x7f56dfc32400>
2,009 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer_anonymized
kafka_prefix = 'zojblblzpy', kafka_server = '127.0.0.1:41371' consumer = <cimpl.Consumer object at 0x7f56dfc3f9d8>
2,009 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_replay::test_storage_play_anonymized[False]
kafka_prefix = 'begfbpyhzi', kafka_consumer_group = 'test-consumer-begfbpyhzi' kafka_server = '127.0.0.1:41371', privileged = False
2,010 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_replay::test_storage_play_anonymized[True]
kafka_prefix = 'hsztdsxihy', kafka_consumer_group = 'test-consumer-hsztdsxihy' kafka_server = '127.0.0.1:41371', privileged = True
View Full Test Results (7 Failed · 753 Passed · 23 Skipped)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

hmm, indeed, it looks like we don't have any index on (object_id, branch_name).

Build is green

Patch application report for D4615 (id=18592)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit c7b6ca35f37f6634f50bd6c481ecf7d8d3ee0941
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_pattern parameter to snapshot_get_branches,
    if provided branches whose name contains the given pattern will not be
    returned.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1176/ for more details.

hmm, indeed, it looks like we don't have any index on (object_id, branch_name).

Indeed.

softwareheritage=> \d snapshot_branch
                                      Table "public.snapshot_branch"
   Column    |      Type       | Collation | Nullable |                      Default                       
-------------+-----------------+-----------+----------+----------------------------------------------------
 object_id   | bigint          |           | not null | nextval('snapshot_branch_object_id_seq'::regclass)
 name        | bytea           |           | not null | 
 target      | bytea           |           |          | 
 target_type | snapshot_target |           |          | 
Indexes:
    "snapshot_branch_pkey" PRIMARY KEY, btree (object_id)
    "snapshot_branch_name_idx" UNIQUE, btree (name) WHERE target_type IS NULL AND target IS NULL
    "snapshot_branch_name_target_target_type_idx" btree (name, target, target_type)
    "snapshot_branch_target_type_target_name_idx" UNIQUE, btree (target_type, target, name)
Check constraints:
    "snapshot_branch_target_check" CHECK ((target_type IS NULL) = (target IS NULL))
    "snapshot_target_check" CHECK ((target_type <> ALL (ARRAY['content'::snapshot_target, 'directory'::snapshot_target, 'revision'::snapshot_target, 'release'::snapshot_target, 'snapshot'::snapshot_target])) OR length(target) = 20)
Referenced by:
    TABLE "snapshot_branches" CONSTRAINT "snapshot_branches_branch_id_fkey" FOREIGN KEY (branch_id) REFERENCES snapshot_branch(object_id)
Publications:
    "softwareheritage"

Prefer to filter out branches by prefix only as the main use case of that
feature will be to filter out pull request branches (whose names start with
"/refs/pull/") in swh-web views.

Also it enables to count branches not starting with prefix in an efficient
way when using the cassandra backend (diff incoming).

Build is green

Patch application report for D4615 (id=18650)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 2f1fdfe322435b59d08bd56e4a521d01caddacd6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1187/ for more details.

Change branch_name_include_pattern and branch_name_exclude_prefix parameters type from str to bytes.

Build was aborted

Patch application report for D4615 (id=18676)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit b1f0caaee70b5291fbe6b90f7ae1eea434ae821b
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1192/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1192/console

Build is green

Patch application report for D4615 (id=18676)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit b1f0caaee70b5291fbe6b90f7ae1eea434ae821b
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1194/ for more details.

This revision is now accepted and ready to land.Mar 10 2021, 11:02 AM

do we really need branch_name_include_pattern, though?

It should at least be discouraged in the documentation

do we really need branch_name_include_pattern, though?

It should at least be discouraged in the documentation

I intend to use it in swh-web by adding a search form in the branches view. But indeed performances are not great
with the cassandra backend as the filtering is done client side. With postgresql I did not observe any kind of performance
issues though.

EDIT: disregard this comment, I can't read

You should be able to implement branch_name_exclude_prefix server-side on cassandra, by using two queries (one for the branches before, and one for the branches after the prefix)

This revision now requires changes to proceed.Mar 10 2021, 1:24 PM

You should be able to implement branch_name_exclude_prefix server-side on cassandra, by using two queries (one for the branches before, and one for the branches after the prefix)

Ack, got it.

Filter out branches whose name starts with branch_name_exclude_prefix
server-side in the cassandra backend by adapting the query to scan branches:

  • branches in a [from_name, to_name[ range are now considered when scanning rows
  • when one wants to filter out branches whose name starts with a given prefix, the from_name and to_name values will be updated to not scan the rows whose name starts with the prefix

Build is green

Patch application report for D4615 (id=18738)

Rebasing onto c4fdd6dbd0...

First, rewinding head to replay your work on top of it...
Applying: storage: Add branch names filtering support in snapshot_get_branches
Changes applied before test
commit 23380fe65e326ffbeaa4d9b56de88787e8ab4dd4
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1210/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/storage.py
689–699

I'm confused. How does that return branches that are after the prefix?

693

b"\xff" is not the last possible branch name, there could be branches named b"\xff\xff"

729–730

would be easier by setting a variable before filtering on branch_name_include_pattern

731–739

what does this do?

This revision now requires changes to proceed.Mar 11 2021, 1:27 PM

Current cassandra implementation for filtering out branches whose name starts with a prefix will not work for prefix starting with \xff so two different CQL queries are needed after all, same for the branches count in D5208.

swh/storage/cassandra/storage.py
689–699

The snapshot branches CQL query have been updated to scan branches inside a range [from_branch_name, to_branch_name[.
The loop will start scanning rows inside the [branches_from, exclude_prefix[ then it will update the range to scan to `[next_exclude_prefix, b"\xff"[ when all branches before the prefix have been returned.

693

I guess I am forced to use two different CQL queries then, one with WHERE name < prefix and the other with WHERE name > prefix. I wanted to modify the less code as possible but looks like it is not possible to use a single query.

729–730

you mean for the right part of the if condition ?

731–739

It updates the names range to scan snapshot branches:

  • when all branches before the prefix have been scanned, the range is set to branches after the prefix
  • otherwise the lower bound of the range is updated

It looks like you are using one loop to do one thing, then a different one; which complicates all the conditional inside because they have to check for two different things.

Maybe move the loop to a function and call it twice with different arguments?

It looks like you are using one loop to do one thing, then a different one; which complicates all the conditional inside because they have to check for two different things.

Maybe move the loop to a function and call it twice with different arguments?

I took a wrong direction, the only solution is the one that you suggested at the beginning, split the CQL query into two when ones wants to filter out a branch name prefix.

I just finish to implement it at the cql module level which leads to few changes at the storage module level.

Update: Properly implement branch name prefix exclude filter for the cassandra backend.

Build has FAILED

Patch application report for D4615 (id=18748)

Rebasing onto b8e10f00cf...

First, rewinding head to replay your work on top of it...
Applying: storage: Add branch names filtering support in snapshot_get_branches
Using index info to reconstruct a base tree...
M	swh/storage/cassandra/cql.py
M	swh/storage/cassandra/storage.py
M	swh/storage/in_memory.py
M	swh/storage/interface.py
M	swh/storage/postgresql/db.py
M	swh/storage/postgresql/storage.py
M	swh/storage/sql/30-schema.sql
M	swh/storage/sql/40-funcs.sql
M	swh/storage/tests/storage_tests.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/tests/storage_tests.py
Auto-merging swh/storage/sql/40-funcs.sql
Auto-merging swh/storage/sql/30-schema.sql
Auto-merging swh/storage/postgresql/storage.py
Auto-merging swh/storage/postgresql/db.py
Auto-merging swh/storage/interface.py
Auto-merging swh/storage/in_memory.py
Auto-merging swh/storage/cassandra/storage.py
Auto-merging swh/storage/cassandra/cql.py
Auto-merging sql/upgrades/168.sql
CONFLICT (add/add): Merge conflict in sql/upgrades/168.sql
Patch failed at 0001 storage: Add branch names filtering support in snapshot_get_branches

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto b8e10f00cf...

Already up to date.
Changes applied before test
commit df867d3c7c07f9f5944016a26da95c4b2dbb6698
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1214/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1214/console

Build was aborted

Patch application report for D4615 (id=18749)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit 017055d705b3bd7b5cda2548d17e3afb2d9a75cd
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1215/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1215/console

Build is green

Patch application report for D4615 (id=18749)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit 017055d705b3bd7b5cda2548d17e3afb2d9a75cd
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1217/ for more details.

Update: Fix edge cases for branch name exclude filter in cassandra backend and update tests.

Build is green

Patch application report for D4615 (id=18754)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit 09a017827eb31cbec05252a7b200fd6af245a0a8
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1218/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/cql.py
641
670–674

can't you reuse the function from the other diff?

675–679
swh/storage/interface.py
703

s/pattern/substring/ ?

swh/storage/tests/storage_tests.py
3044–3049

could be improved; it doesn't check that the same branch isn't returned multiple times, or bogus branches are returned.

What about something like this?

expected_branches = [
    branch_name for branch_name in snapshot.branches
    if include_pattern is None or include_pattern in branch_name
    and exclude_prefix is None or branch_name.startswith(exclude_prefix)
]
assert sorted(branches) == sorted(expected_branches)

It also makes pytest give better errors, showing what branches are missing/too much instead of just a number inequality

3074

why a regexp instead of in?

3090–3095

same comment as above on the assertions, same code should work if you add a slicing

3109–3114

ditto

This revision now requires changes to proceed.Mar 12 2021, 9:46 AM
swh/storage/cassandra/cql.py
641

yes, better name indeed

670–674

The other diff depends on that diff and put that code in the _next_prefix function.
As it is only use here in that diff, I did not wrap it in a function yet.

swh/storage/interface.py
703

ack, will rename

swh/storage/tests/storage_tests.py
3044–3049

ack

3074

oh right, I initially used regexps when I started that diff, this must have remain through the numerous code changes.

3090–3095

ack

3109–3114

ack

swh/storage/cassandra/cql.py
670–674

ok

Update: Address latest @vlorentz comments.

Build is green

Patch application report for D4615 (id=18780)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit d080085e7ff75d41972e4106b76c02183fcce987
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1220/ for more details.

Build was aborted

Patch application report for D4615 (id=18782)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit 93301a1f67acba349319c383dac9031a132a7470
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_substring parameter to snapshot_get_branches,
    if provided only branches whose name contains the given substring will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1222/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1222/console

Build is green

Patch application report for D4615 (id=18782)

Rebasing onto b8e10f00cf...

Current branch diff-target is up to date.
Changes applied before test
commit 93301a1f67acba349319c383dac9031a132a7470
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_substring parameter to snapshot_get_branches,
    if provided only branches whose name contains the given substring will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1224/ for more details.

This revision is now accepted and ready to land.Mar 15 2021, 10:19 AM