Page MenuHomeSoftware Heritage

search*: Type origin_search(...) -> PagedResult[Dict]
ClosedPublic

Authored by ardumont on Jul 30 2020, 7:33 PM.

Details

Summary

Returned paginated result with PagedResult object.

The result per page is kept as a Dict because it could be enriched.
It'd be less effort to enrich it with Dict instead of a model object Origin (that has very little chance to grow new interesting fields).

Impacts swh-web (D3661)

grep -r 'origin_search(' */swh/** | grep -v "swh-storage"
swh-search/swh/search/elasticsearch.py:    def origin_search(
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="foobar")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="barb")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="bar")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="barbaz")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="qu")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="qux")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="foo bar baz")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="foobar", with_visit=True)
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="foobar", with_visit=True)
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(url_pattern="foobar", with_visit=True)
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="foo")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="foo bar")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="bar baz")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="foo bar baz")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="foo")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="foo bar")
swh-search/swh/search/tests/test_search.py:        actual_page = self.search.origin_search(metadata_pattern="bar baz")
swh-search/swh/search/tests/test_cli.py:    actual_page = swh_search.origin_search(url_pattern="foobar")
swh-search/swh/search/tests/test_cli.py:    actual_page = swh_search.origin_search(url_pattern="foobar", with_visit=True)
swh-search/swh/search/tests/test_cli.py:    actual_page = swh_search.origin_search(url_pattern="foobar", with_visit=False)
swh-search/swh/search/tests/test_cli.py:    actual_page = swh_search.origin_search(url_pattern="foobar", with_visit=True)
swh-search/swh/search/in_memory.py:    def origin_search(
swh-web/swh/web/api/views/origin.py:def api_origin_search(request, url_pattern):
swh-web/swh/web/common/service.py:        results = search.origin_search(
swh-web/swh/web/common/service.py:        origins_raw = storage.origin_search(
swh-web/swh/web/assets/src/bundles/browse/origin-search.js:    baseSearchUrl = new URL(Urls.api_1_origin_search(searchQueryText), window.location);
swh-web/swh/web/tests/api/views/test_origin.py:def test_api_origin_search(api_client, mocker, backend):

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSEA Archive search
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14120
Build 21689: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21688: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adapt according to review:

  • Rework commit message
  • Fix missing client part (fix the builds \m/)

Thanks

ardumont retitled this revision from wip: search*: Type origin_search(...) -> PagedResult[Origin] to search*: Type origin_search(...) -> PagedResult[Origin].Jul 30 2020, 7:51 PM
ardumont edited the test plan for this revision. (Show Details)
ardumont added inline comments.
swh/search/tests/test_search.py
223–227

I tried to replace this with pytest.mark.parametrize but that does not work... [1]

I think it's not working directly on method classes so later i guess.

[1] https://docs.pytest.org/en/stable/parametrize.html

Build is green

Patch application report for D3657 (id=12879)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 0542b743f6cecdfae7d987fd2e76beeacd285021
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/14/ for more details.

ardumont added inline comments.
swh/search/utils.py
19

Maybe that can go one module down in swh.core now.
It's shared behavior with storage now (swh.storage.algos.origin.iter_origin_visit and iter_origin_visit_status comes to mind...)

swh/search/utils.py
19

Yeah, I have this on my mental todo-list

I don't think the return type should be Origin, because ES won't contain full objects so it won't work for other types. It really just returns an ID (so only str for now, but possibly Sha1Git and tuples for other search types in the future).

And I used a dict rather than just IDs, to allow returning more information in the future, which I'm pretty sure we'll want (eg. add a score, add information on what matched, etc.)

swh/search/elasticsearch.py
109

Why doesn't mypy detect this? -_-

swh/search/in_memory.py
105–106

you should rename this to hits while you're at it, to match ES' terminology

swh/search/interface.py
12

switch to strings for consistency with swh-storage, and you can drop the base64encoding

swh/search/tests/test_search.py
15–21

I think you should include the TLD in the variable name (eg. origin_foobar_baz), so it's clearer below why it matches.

If you disagree, at least rename origin_quux to origin_qux

This revision now requires changes to proceed.Jul 30 2020, 9:39 PM
swh/search/elasticsearch.py
109

I asked myself the same question ¯\_(ツ)_/¯

swh/search/interface.py
12

yes, ok, it's currently annoying me in the webapp ;)

Adapt according to review:

  • PagedResult[origin, str] (next-page-token is a string now)
  • rename origin variables with tld to clarify reading
  • rename matched to hits

Still returning PagedResult[Origin] though

Build is green

Patch application report for D3657 (id=12880)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit e0731fd30688ac8ed45f961a120f10bc022581a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/15/ for more details.

Simplify in-memory implementation

Build is green

Patch application report for D3657 (id=12881)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/16/ for more details.

swh/search/in_memory.py
24

dead code ^ (unrelated to this diff though)

I don't think the return type should be Origin, because ES won't contain full objects so it won't work for other types. It really just returns an ID (so only str for now, but possibly Sha1Git and tuples for other search types in the future).
And I used a dict rather than just IDs, to allow returning more information in the future, which I'm pretty sure we'll want (eg. add a score, add information on what matched, etc.)

ok, makes sense.
I'll revert that then (i'll keep the changes in the tests though)

  • Returned PagedResult[Dict[str, Any]] instead of origin model object
ardumont retitled this revision from search*: Type origin_search(...) -> PagedResult[Origin] to search*: Type origin_search(...) -> PagedResult[Dict].Jul 31 2020, 9:48 AM

Build is green

Patch application report for D3657 (id=12883)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 568416e53803135fee612150645e1a69da1d1788
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/17/ for more details.

ardumont marked an inline comment as done.

Drop no longer needed code

Build is green

Patch application report for D3657 (id=12884)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit ee4a42e319cc0378031b457f74a91063959d224b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/18/ for more details.

ardumont added inline comments.
swh/search/elasticsearch.py
41

I extracted those out because i kept on modifying that part which is "buried" inside the module.
Now it's at the top, no longer need to check back and forth where it's at.

swh/search/elasticsearch.py
177

This does not affect the results.
It does affect the readability because it confuses the read, the actual limit used is at line 203 on the right.

Add tests around the tests.utils module

Drop spurious blank in copyright header on no longer touched module ¯\_(ツ)_/¯

Build is green

Patch application report for D3657 (id=12885)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit d5d70c5d36873f9697c185aed04953d16d785bf9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/19/ for more details.

Build is green

Patch application report for D3657 (id=12886)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit bb8db005e066aa46e483103923024011b819ab70
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/20/ for more details.

Build is green

Patch application report for D3657 (id=12887)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit cc3cd65a44dcd1a3e13d144f5786997b2338f914
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/21/ for more details.

I don't understand the point of assert_page_match

This code:

assert_page_match(actual_page, [{"url": "origin1"}], expected_page_token=True)

can be simply rewritten as:

assert actual_page.next_page_token is not None
assert actual_page.results == [{"url": "origin1"}]

and I find the latter more readable.

And you wouldn't need test_utils.py

I don't understand the point of assert_page_match

This code:

assert_page_match(actual_page, [{"url": "origin1"}], expected_page_token=True)

can be simply rewritten as:

assert actual_page.next_page_token is not None
assert actual_page.results == [{"url": "origin1"}]

and I find the latter more readable.

And you wouldn't need test_utils.py

when you have 1 result it's fine.
when you have multiple ones, the code is no longer what you exposed, we need to wrap checks with set on the urls...

Right.

Then you could write an assert_count_equal function, and use it like this: assert_count_equal(actual_page.results, [{"url": "origin1"}], {"url": "origin2"})

IMO it's much clearer what assert_count_equal does than assert_results_match without looking at the code.

Errrr, actually, no. The order of results matters, and tests must check it

swh/search/tests/test_search.py
269

Errrr, actually, no. The order of results matters, and tests must check it

In this test method, this does not look like the order matters.
What did I miss?

IMO it's much clearer what assert_count_equal does than assert_results_match without looking at the code.

What's the contract you foresee for assert_count_equal?

What's the contract you foresee for assert_count_equal?

https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertCountEqual

swh/search/tests/test_search.py
223–227

finding a word in a short URL is a better word than finding the same word in a long URL.

223–227

better match *

Use the same test logic as before sorted the actual result and expected
result before comparing those. Drop the wrapping set.

swh/search/tests/test_search.py
223–227

in what context are you telling me this?

I don't get it.

(My initial parametrize comment was about the todo to drop hypothesis and try to use pytest's parametrize instead).

Build is green

Patch application report for D3657 (id=12892)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 500a67eae7dead1152c7789069b0c5ed4b69254b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 31 09:48:13 2020 +0200

    Returned PagedResult[Dict[str, Any]] instead of origin model object
    
    This is a commit to fixup into its parent
    
    Related to T645

commit 4eb62fb38826947722713f0ffa459d673045d8ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Origin]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/22/ for more details.

swh/search/tests/test_search.py
223–227

sorry, I confused this comment with "In this test method, this does not look like the order matters.
What did I miss?"

swh/search/tests/test_search.py
269

What did I miss?

vlorentz answered elsewhere, here goes.

vlorentz: finding a word in a short URL is a better word than finding the same word in a long URL.

Adapt according to discussion

One test fails about the order, I don't get why yet.

Build has FAILED

Patch application report for D3657 (id=12893)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 5f57714c1a7aac08f9b3efe10040f18612e09bca
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Dict]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/23/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/23/console

Fix data test issue, build is back to green

Build is green

Patch application report for D3657 (id=12894)

Rebasing onto 586b7b96bc...

Current branch diff-target is up to date.
Changes applied before test
commit 0912f950416179740b961b538d7a874a965be474
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 30 19:32:38 2020 +0200

    search*: Type origin_search(...) -> PagedResult[Dict]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/24/ for more details.

This revision is now accepted and ready to land.Jul 31 2020, 12:41 PM
swh/search/utils.py
19