Page MenuHomeSoftware Heritage

Enable to filter searched origins by visit types
ClosedPublic

Authored by anlambert on Thu, Feb 11, 2:12 PM.

Details

Summary

That diff adds support to filter origins by visit types in swh-search.

The following changes have been made:

  • Add a new visit_types field to elasticsearch document for origin.
  • Add a new optional visit_types parameter to origin_search method in SearchInterface.
  • Implement visit types filtering in search backends.
  • Send origin visit types to elasticsearch when processing origin visits in journal client.

I have tested that I could populate a local elasticsearch instance with those new data
using the following command.

$ swh --log-level DEBUG search -C ~/.config/swh/search.yml journal-client objects -o origin_visit

I used the following configuration file:

search.yml
search:
  cls: elasticsearch
  hosts:
    - http://localhost:9200

journal:
  brokers:
    - kafka1.internal.softwareheritage.org
    - kafka2.internal.softwareheritage.org
    - kafka3.internal.softwareheritage.org
    - kafka4.internal.softwareheritage.org
    
  prefix: swh.journal.objects
  group_id: anlambert.search

Related to T2869

Diff Detail

Repository
rDSEA Archive search
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/search/tests/test_cli.py
22–25

I put that change in a separate commit.

Build is green

Patch application report for D5064 (id=18060)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit 47d4600e5b0bed96a6f5027e15c5bd92324f5842
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit type
    
    Add a new visit_type field to elasticsearch document for origin.
    
    Add a new optional visit_type parameter to origin_search method in
    SearchInterface.
    
    Implement visit type filtering in search backends.
    
    Send origin visit type to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/74/ for more details.

swh/search/interface.py
45 ↗(On Diff #18060)

Maybe it will be better to allow the filtering on multiple visit types ?

swh/search/elasticsearch.py
104 ↗(On Diff #18060)

should be keyword, it has a very small set of possible values and doesn't need full-text search.

swh/search/interface.py
45 ↗(On Diff #18060)

I was about to suggest that too, if it's not too much trouble.

(If it it, accept a List[str] and error on len != 1, so we can easily add it in the future)

It's missing support for origins with multiple visit types. It may be enough to just replace "visit_type": visit["type"], with "visit_type": [visit["type"]], in the journal client

It's missing support for origins with multiple visit types. It may be enough to just replace "visit_type": visit["type"], with "visit_type": [visit["type"]], in the journal client

Indeed, this should work. TIL that in ES any field can contain zero or more values by default (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html#types-array-handling).

swh/search/elasticsearch.py
104 ↗(On Diff #18060)

ack

swh/search/interface.py
45 ↗(On Diff #18060)

This seems pretty easy to implement so go for it.

This revision now requires changes to proceed.Thu, Feb 11, 5:37 PM

Handling multiple visit types was a little bit more subtle that I thought
but I managed to find a good solution I think.

Now a list of visit types for an origin will be maintained without duplicates
in elasticsearch.

When searching origins with multiple visit types, an origin will be returned if
it has any of the requested visit types.

anlambert retitled this revision from Enable to filter searched origins by visit type to Enable to filter searched origins by visit types.Thu, Feb 11, 6:14 PM
anlambert edited the summary of this revision. (Show Details)

Build is green

Patch application report for D5064 (id=18075)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit 391fb080311998a230ec53e465341c768a1e76e8
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/75/ for more details.

Remove not needed line in painless script code.

Build is green

Patch application report for D5064 (id=18076)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit 0147ee7eeee8270bb549730bd70e968ba0568817
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/76/ for more details.

It's lame that we need to rewrite the handling of has_visits just so that we can update visit_types, but I guess we don't have a choice :(

swh/search/elasticsearch.py
135–139 ↗(On Diff #18076)

IMO that's easier to read

swh/search/tests/test_search.py
131–138 ↗(On Diff #18076)

The function name should mention updates (as it's what it's really testing); and there should be a separate test for updating the origins in a separate step (ie. self.search.origin_update([{"url": "http://foobar.baz", "visit_types": ["git", "svn"]}]) + self.search.origin_update([{"url": "http://foobar.baz", "visit_types": ["hg"]}]))

149–169 ↗(On Diff #18076)

What does it add to test_origin_with_multiple_visit_types_search?

This revision now requires changes to proceed.Fri, Feb 12, 10:56 AM

It also misses a test where an origin is initially inserted with visit_types, then visit_types is added

swh/search/elasticsearch.py
135–139 ↗(On Diff #18076)

ack, been a a while since I did not write java-ish code.

swh/search/tests/test_search.py
131–138 ↗(On Diff #18076)

ack

149–169 ↗(On Diff #18076)

It tests combination of visit types when searching but I guess I could merge that test code in the one above.

It also misses a test where an origin is initially inserted with visit_types, then visit_types is added

and perhaps also one with an origin without a visit types at all, because we will have the case in production where the origin_visit_status contains former visits without the type.

The case can also exists if an origin message is indexed before its visits or during the migration

Update: Address review comments.

Build is green

Patch application report for D5064 (id=18088)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit efdcf8aca2178f549f1caa575026e816f8c1b609
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/77/ for more details.

It also misses a test where an origin is initially inserted with visit_types, then visit_types is added

and perhaps also one with an origin without a visit types at all

Oh yes indeed, I meant "inserted WITHOUT visit_types" ;)

Intrinsic metadata update must also be handled by the painless script and tested.

If that's too much trouble, we could do it in two steps. One "painless" for the visit types, and one "doc_as_upsert" for everything else.

It nearly doubles the query time, but it would keep the code simple and less error prone

Update:

  • Simplify painless script for updating an origin document
  • Update tests
  • Add a new test for intrinsic metadata update in a separate commit

Build is green

Patch application report for D5064 (id=18091)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit 132552f278a149f06ff30071a46fd4f26b88348d
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 12 15:19:14 2021 +0100

    test_search: Add test for intrinsic metadata update

commit 234f9943a29f3aa712a858859a7a9cd3fe12b650
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/79/ for more details.

Build is green

Patch application report for D5064 (id=18092)

Rebasing onto 47db624364...

Current branch diff-target is up to date.
Changes applied before test
commit cc72df622f46fb8e07b9c3e1091f452b8d4b33fe
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit ebd19ea7a298094c6b1d58381921eaca0ba98ccb
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 12 15:19:14 2021 +0100

    test_search: Add test for intrinsic metadata update

commit b7837c4112d5d29c0908c354023702e0ecada0c3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/80/ for more details.

If that's too much trouble, we could do it in two steps. One "painless" for the visit types, and one "doc_as_upsert" for everything else.

It nearly doubles the query time, but it would keep the code simple and less error prone

I managed to update the painless script to avoid sending two queries when visit types need to be updated.

I added a new test updating intrinsic metadata in a separate commit to ensure no regression.

It is missing a test of calling origin_search with a superset of the visits an origin has (ie. what happens when we do origin_update({"url": "foo", "visit_types": ["git"]}) then origin_search(url_pattern="foo", visit_types=["git", "hg"]).

This raises a bigger issue about the semantics of that operation; I can't guess what ES does, but for the in-mem it would return nothing. So the current semantics would be "return origins that have all the given visit types". Is this why we want? If yes, it should be clearly documented in the docstring.

swh/search/elasticsearch.py
135–139 ↗(On Diff #18076)

Could you add comments? It's not a bit hard to follow.

swh/search/in_memory.py
44–49 ↗(On Diff #18092)

store a set in document["visit_types"], it will make origin_search more readable

105–107 ↗(On Diff #18092)

Looks like this is equivalent.

(but still probably not what we want, see my comment about the semantics of origin_search)

swh/search/tests/test_search.py
160–173 ↗(On Diff #18092)

I find this to be harder to read than the code being tested.

Could you keep it simpler, by removing all the loops and writing the origin types explicitly every time?

166 ↗(On Diff #18092)

IMO this is clearer

389–413 ↗(On Diff #18092)

why this new test?

This revision now requires changes to proceed.Mon, Feb 15, 10:24 AM

This raises a bigger issue about the semantics of that operation; I can't guess what ES does, but for the in-mem it would return nothing.
So the current semantics would be "return origins that have all the given visit types". Is this why we want? If yes, it should be clearly documented in the docstring.

ES will return an origin if it has any of the provided visit types (terms query). That's also the behavior I originally wanted to implement
in the memory backend but I made a mistake: I should have use a set intersection instead of a set difference.

Still some tests to update and I will push a new diff revision.

swh/search/tests/test_search.py
160–173 ↗(On Diff #18092)

Ack, will simplify a bit

389–413 ↗(On Diff #18092)

As I have modified the way document gets updated in elasticsearch, I added a new commit with that test to ensure no update regression for metadata.

ES will return an origin if it has any of the provided visit types (terms query). That's also the behavior I originally wanted to implement
in the memory backend

Cool!

anlambert marked an inline comment as done.

Rebase and address latest @vlorentz comments.

Build is green

Patch application report for D5064 (id=18242)

Rebasing onto 589ef4d8c4...

Current branch diff-target is up to date.
Changes applied before test
commit bac5fd87f612647174beedbee5965b554466dd94
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit d1b86802952bc35e795e7e3bb4f8866b7fd0c61c
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 12 15:19:14 2021 +0100

    test_search: Add test for intrinsic metadata update

commit 91612192bca214fb83c52b9c99f7a1beadcd1c91
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/87/ for more details.

Looks good

Two more nitpicks:

swh/search/elasticsearch.py
135–141 ↗(On Diff #18242)

for readability

swh/search/tests/test_search.py
152–165 ↗(On Diff #18242)

could you split this into two functions? eg. add_visit_type and check_visit_types.

This revision now requires changes to proceed.Thu, Feb 18, 2:10 PM

Build is green

Patch application report for D5064 (id=18244)

Rebasing onto 589ef4d8c4...

Current branch diff-target is up to date.
Changes applied before test
commit 29c634c69e37539776e347c4edf9efaf121544ba
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:48:53 2021 +0100

    Enable to filter searched origins by visit types
    
    Add a new visit_types field to elasticsearch document for origin.
    
    Add a new optional visit_types parameter to origin_search method in
    SearchInterface.
    
    Implement visit types filtering in search backends, an origin wil be
    returned if it has any of the requested visit types.
    
    Send origin visit types to elasticsearch when processing origin visits
    in journal client.
    
    Related to T2869

commit d1b86802952bc35e795e7e3bb4f8866b7fd0c61c
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 12 15:19:14 2021 +0100

    test_search: Add test for intrinsic metadata update

commit 91612192bca214fb83c52b9c99f7a1beadcd1c91
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Feb 11 13:59:38 2021 +0100

    test_cli: Fix deprecation warnings

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/88/ for more details.

This revision is now accepted and ready to land.Thu, Feb 18, 2:58 PM