Page MenuHomeSoftware Heritage

storage: Allow to filter out branches by prefix when counting them
ClosedPublic

Authored by anlambert on Mar 5 2021, 5:27 PM.

Details

Summary

Add an optional branch_name_exclude_prefix parameter to the
snapshot_count_branches method of the Storage interface.

It enables to filter out branches whose name starts with a
given prefix when counting.

The purpose is to get accurate counters in swh-web as pull
request branches will be filtered out by default.

Related to T2782

Depends on D4615

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5208 (id=18658)

Could not rebase; Attempt merge onto 88ff2c2fa0...

Updating 88ff2c2f..047061b1
Fast-forward
 sql/upgrades/168.sql               |  39 +++++++++++
 swh/storage/cassandra/cql.py       |  31 ++++++++-
 swh/storage/cassandra/storage.py   |  27 +++++++-
 swh/storage/in_memory.py           |   9 ++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++++--
 swh/storage/postgresql/storage.py  |  19 +++++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 +++-
 swh/storage/tests/storage_tests.py | 128 +++++++++++++++++++++++++++++++++++++
 10 files changed, 285 insertions(+), 20 deletions(-)
 create mode 100644 sql/upgrades/168.sql
Changes applied before test
commit 047061b14e37f4674394a24fc06a32c847b476be
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 2f1fdfe322435b59d08bd56e4a521d01caddacd6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1188/ for more details.

Simplify in memory implementation.

Build is green

Patch application report for D5208 (id=18660)

Could not rebase; Attempt merge onto 88ff2c2fa0...

Updating 88ff2c2f..2d8a6484
Fast-forward
 sql/upgrades/168.sql               |  39 +++++++++++
 swh/storage/cassandra/cql.py       |  31 ++++++++-
 swh/storage/cassandra/storage.py   |  27 +++++++-
 swh/storage/in_memory.py           |   8 ++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++++--
 swh/storage/postgresql/storage.py  |  19 +++++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 +++-
 swh/storage/tests/storage_tests.py | 128 +++++++++++++++++++++++++++++++++++++
 10 files changed, 284 insertions(+), 20 deletions(-)
 create mode 100644 sql/upgrades/168.sql
Changes applied before test
commit 2d8a648438a9540e91ac7b637f7eb7a26c0fd091
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 2f1fdfe322435b59d08bd56e4a521d01caddacd6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1190/ for more details.

vlorentz added a subscriber: vlorentz.

Thanks.

We could probably improve the perfs by restricting what rows are scanned (in both pg and cass), but we can do it in the future diff without changing this interface.

Just one comment though:

swh/storage/interface.py
730

Should be bytes, not str.

This revision now requires changes to proceed.Mar 8 2021, 10:01 AM

Bump db version and add missing migration file

Build is green

Patch application report for D5208 (id=18674)

Could not rebase; Attempt merge onto 88ff2c2fa0...

Updating 88ff2c2f..c8a1c643
Fast-forward
 sql/upgrades/168.sql               |  39 +++++++++++
 sql/upgrades/169.sql               |  19 ++++++
 swh/storage/cassandra/cql.py       |  31 ++++++++-
 swh/storage/cassandra/storage.py   |  27 +++++++-
 swh/storage/in_memory.py           |   8 ++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++++--
 swh/storage/postgresql/storage.py  |  19 +++++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 +++-
 swh/storage/tests/storage_tests.py | 128 +++++++++++++++++++++++++++++++++++++
 11 files changed, 303 insertions(+), 20 deletions(-)
 create mode 100644 sql/upgrades/168.sql
 create mode 100644 sql/upgrades/169.sql
Changes applied before test
commit c8a1c643f9460d1dd5f97e47828c2d26f2ea6dff
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 2f1fdfe322435b59d08bd56e4a521d01caddacd6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1191/ for more details.

Change branch_name_exclude_prefix parameter type from str to bytes.

Build is green

Patch application report for D5208 (id=18677)

Could not rebase; Attempt merge onto 88ff2c2fa0...

Updating 88ff2c2f..97edab89
Fast-forward
 sql/upgrades/168.sql               |  39 ++++++++++++
 sql/upgrades/169.sql               |  19 ++++++
 swh/storage/cassandra/cql.py       |  31 ++++++++-
 swh/storage/cassandra/storage.py   |  25 +++++++-
 swh/storage/in_memory.py           |   8 ++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++++--
 swh/storage/postgresql/storage.py  |  19 +++++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 +++-
 swh/storage/tests/storage_tests.py | 127 +++++++++++++++++++++++++++++++++++++
 11 files changed, 300 insertions(+), 20 deletions(-)
 create mode 100644 sql/upgrades/168.sql
 create mode 100644 sql/upgrades/169.sql
Changes applied before test
commit 97edab890047ae614950b1febbd17ac92d92be2b
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit b1f0caaee70b5291fbe6b90f7ae1eea434ae821b
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1193/ for more details.

This revision is now accepted and ready to land.Mar 8 2021, 12:45 PM

Update:

  • Rebase
  • Properly implement branches count with branch name prefix exclude pattern in cassandra backend
This revision is now accepted and ready to land.Mar 11 2021, 3:42 PM

Build is green

Patch application report for D5208 (id=18750)

Could not rebase; Attempt merge onto b8e10f00cf...

Updating b8e10f00..b41e3f20
Fast-forward
 sql/upgrades/169.sql               |  39 ++++++++++++
 sql/upgrades/170.sql               |  19 ++++++
 swh/storage/cassandra/cql.py       |  87 +++++++++++++++++++++++--
 swh/storage/cassandra/storage.py   |  26 +++++++-
 swh/storage/in_memory.py           |  19 +++++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++++--
 swh/storage/postgresql/storage.py  |  19 +++++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 +++-
 swh/storage/tests/storage_tests.py | 127 +++++++++++++++++++++++++++++++++++++
 11 files changed, 362 insertions(+), 26 deletions(-)
 create mode 100644 sql/upgrades/169.sql
 create mode 100644 sql/upgrades/170.sql
Changes applied before test
commit b41e3f200b6362ad34caa68a4dd13ddff11b4188
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 017055d705b3bd7b5cda2548d17e3afb2d9a75cd
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1216/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/cql.py
668

doesn't work if the last byte of the prefix is \xff. You'll need to either implement the carry, or convert to an int

This revision now requires changes to proceed.Mar 11 2021, 5:20 PM
swh/storage/cassandra/cql.py
668

and you'll also need to handle the issue of prefixes that are made entirely of \xff :/

Update: Fix edge cases for branch name exclude filter in cassandra backend and add test.

Build is green

Patch application report for D5208 (id=18755)

Could not rebase; Attempt merge onto b8e10f00cf...

Updating b8e10f00..83d8630c
Fast-forward
 sql/upgrades/169.sql               |  39 +++++++++
 sql/upgrades/170.sql               |  19 ++++
 swh/storage/cassandra/cql.py       |  92 ++++++++++++++++++--
 swh/storage/cassandra/storage.py   |  26 +++++-
 swh/storage/in_memory.py           |  19 +++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++--
 swh/storage/postgresql/storage.py  |  19 +++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 ++-
 swh/storage/tests/storage_tests.py | 174 +++++++++++++++++++++++++++++++++++++
 11 files changed, 414 insertions(+), 26 deletions(-)
 create mode 100644 sql/upgrades/169.sql
 create mode 100644 sql/upgrades/170.sql
Changes applied before test
commit 83d8630cf299ed4188c3b4011ae58be3a153863a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 09a017827eb31cbec05252a7b200fd6af245a0a8
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1219/ for more details.

Looks correct now!

Good you just add some a short comment in snapshot_count_branches and a docstring to _next_prefix to explain what they do?

This revision is now accepted and ready to land.Mar 11 2021, 9:30 PM

Update: Add comments and rename _next_prefix to _next_bytes_value.

Build was aborted

Patch application report for D5208 (id=18781)

Could not rebase; Attempt merge onto b8e10f00cf...

Updating b8e10f00..61fc4bfc
Fast-forward
 sql/upgrades/169.sql               |  39 ++++++++
 sql/upgrades/170.sql               |  19 ++++
 swh/storage/cassandra/cql.py       | 104 ++++++++++++++++++++--
 swh/storage/cassandra/storage.py   |  26 +++++-
 swh/storage/in_memory.py           |  19 +++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++--
 swh/storage/postgresql/storage.py  |  19 +++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 ++-
 swh/storage/tests/storage_tests.py | 178 +++++++++++++++++++++++++++++++++++++
 11 files changed, 430 insertions(+), 26 deletions(-)
 create mode 100644 sql/upgrades/169.sql
 create mode 100644 sql/upgrades/170.sql
Changes applied before test
commit 61fc4bfc75ca07dab08c893b1d72e3c8fac87f95
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit d080085e7ff75d41972e4106b76c02183fcce987
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_pattern parameter to snapshot_get_branches,
    if provided only branches whose name contains the given pattern will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1221/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1221/console

Build is green

Patch application report for D5208 (id=18783)

Could not rebase; Attempt merge onto b8e10f00cf...

Updating b8e10f00..b565201d
Fast-forward
 sql/upgrades/169.sql               |  39 ++++++++
 sql/upgrades/170.sql               |  19 ++++
 swh/storage/cassandra/cql.py       | 104 ++++++++++++++++++++--
 swh/storage/cassandra/storage.py   |  26 +++++-
 swh/storage/in_memory.py           |  19 +++-
 swh/storage/interface.py           |  10 ++-
 swh/storage/postgresql/db.py       |  28 ++++--
 swh/storage/postgresql/storage.py  |  19 +++-
 swh/storage/sql/30-schema.sql      |   2 +-
 swh/storage/sql/40-funcs.sql       |  12 ++-
 swh/storage/tests/storage_tests.py | 178 +++++++++++++++++++++++++++++++++++++
 11 files changed, 430 insertions(+), 26 deletions(-)
 create mode 100644 sql/upgrades/169.sql
 create mode 100644 sql/upgrades/170.sql
Changes applied before test
commit b565201dcfe2a007a628de94323b84c1045b7b0c
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Mar 5 16:33:29 2021 +0100

    storage: Allow to filter out branches by prefix when counting them
    
    Add an optional branch_name_exclude_prefix parameter to the
    snapshot_count_branches method of the Storage interface.
    
    It enables to filter out branches whose name starts with a
    given prefix when counting.
    
    The purpose is to get accurate counters in swh-web as pull
    request branches will be filtered out by default.
    
    Related to T2782

commit 93301a1f67acba349319c383dac9031a132a7470
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Mar 2 14:42:57 2021 +0100

    storage: Add branch names filtering support in snapshot_get_branches
    
    Add optional branch_name_include_substring parameter to snapshot_get_branches,
    if provided only branches whose name contains the given substring will be
    returned.
    
    Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
    if provided branches whose name starts with the given prefix will not be
    returned.
    
    Purpose of these new features: add a search form in the branches view
    of swh-web and filter out pull request branches (whose names start with
    "refs/pull/") by default.
    
    Related to T2782

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1223/ for more details.