Page MenuHomeSoftware Heritage

origin_search: Filter for instrinsic_metadata language and license
ClosedPublic

Authored by KShivendu on Jun 29 2021, 7:58 PM.

Details

Summary

Instrinsic_metadata often contains programmingLanguage and license fields
which are very useful as search filters. These values can be used until
license and language mined by swh.indexer aren't ready to use.

Diff Detail

Repository
rDSEA Archive search
Branch
intrinsic-metadata
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22355
Build 34807: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 34806: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5949 (id=21359)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit 50567315ecd92439477b18875d2845a73b6147e3
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    journal_client: Store language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/173/ for more details.

swh/search/elasticsearch.py
434

I'll add the tests for license and programming_language based on origin_search in the next draft.

437

should it be "term" instead of "match" ?

swh/search/journal_client.py
19 ↗(On Diff #21359)

I'll add documentation for this in the next draft.

You should use the fields in intrinsic_metadata instead of duplicating with new fields

@vlorentz, just in case you missed it, the values don't get duplicated. I'm popping them out of instrinsic_metdadata.
I was trying to avoid nested documents as I've read at many places that they slow down searches. (not sure by how much)

Think/research about it once. If you still want me to remove it, I'm totally fine with it. Just comment again.

But they are nested for a reason, being in the intrinsic_metadata field indicates their origin. Otherwise we would just move *all* the fields at the root instead of moving a couple fields and reindexing every time we want to use one.

If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?

If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?

Perfect ! I was thinking about asking for something like this :)

If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?

Perfect ! I was thinking about asking for something like this :)

+1.

More generally, don't try to compensate/optimize for something that could be.
Just implement the simplest approach and voice the concern about
possible caveats like the slowness here.

Then new tasks should eventually pop up about measuring performance.

  • origin_search: Expose language and license from instrinsic_metadata
  • test_search: Add test for language and license

Build is green

Patch application report for D5949 (id=21372)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/174/ for more details.

Thanks. A few more comments

swh/search/elasticsearch.py
433–465

Can you deduplicate these two?

437

What are the differences?

swh/search/in_memory.py
271–281

for consistent typing

swh/search/tests/test_search.py
452–471

license should be an URI

475–485

This tests only checks either programming_language or license filtering works; because it would still succeed if either didn't filter anything.

You should write separate tests for each.

This revision now requires changes to proceed.Jun 30 2021, 12:53 PM
  • origin_search: Allow search for multiple licenses and languages at once
  • test_search: Separate tests for programming_language and license

Build has FAILED

Patch application report for D5949 (id=21373)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/175/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/175/console

swh/search/elasticsearch.py
437

Afaik,

"term" = Exact match
"match" = Applies some preprocessing(analyzer) steps before matching

For example, "term" should be case-sensitive but not "match".

Also, "terms" = Exact match with elements of the given list

swh/search/in_memory.py
271–281

I didn't understand. Why [] ?

I believe programmingLanguage and license are string, not lists ( atleast when taken from instrinsic_metadata )

Can you please elaborate ?

swh/search/tests/test_search.py
452–471

Okay. I'll change that.

But, I "saved" https://pypi.org/project/Django/ in my self hosted swh instance. Its license field is "BSD-3-Clause" which isn't a URI.

While https://github.com/Edyta2801/Ex_2_Webpage has license "https://spdx.org/licenses/ISC"

So I think license isn't always a URI.

473–484

This test passes for Elasticsearch if I use

page = self.search.origin_search(url_pattern="foobar", license=["mit"])
....
page = self.search.origin_search(url_pattern="foobar", license=["bsd"])
...
page = self.search.origin_search(url_pattern="foobar", license=["mit", "clause"])

Because of this

I'll have to apply the analyzer which is used for license field (which is of type text).
But I couldn't find anything that directly applies the analyzer on the query term for nested fields when using terms.

This is one such way to apply the analyzer on the query but it's only available for terms

Any suggestions ?

473–484

This is one such way to apply the analyzer on the query but it's only available for terms

*only available for match

  • Use analyzer on the list of licenses and langauges

Build is green

Patch application report for D5949 (id=21374)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit feb9effff6fb359c86e590e6e984637bbe13de61
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 14:19:52 2021 +0000

    Use analyzer on the list of licenses and langauges

commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/176/ for more details.

KShivendu marked an inline comment as not done.Jun 30 2021, 4:28 PM
KShivendu added inline comments.
swh/search/tests/test_search.py
473–484

I fixed this with "should" and "match". You may ignore this thread.

Can you add a test with both programming_languages and licenses (and possibly an extra field that we already support), to make sure the search actually uses AND operators and not OR (ie. that it doesn't return results that match one one of the criteria)?

swh/search/elasticsearch.py
434–435

do you need to make this conditional?

437–462

no need for if here

464–471

Is this an OR or an AND operator used here? (please write it in a comment for readers unfamiliar with ES)

swh/search/in_memory.py
271–281

When unnormalized, they can be either a string, a list of strings, an object with an @id attribute, or a list of such objects.

swh/search/elasticsearch.py normalizes them to always be the latter; and swh/search/in_memory.py should do the same, but it looks like it does not. Could you fix this at the same time?

swh/search/tests/test_search.py
452–471

Then it's a bug in the indexer. According to https://www.w3.org/TR/json-ld/#node-objects :

If the node object contains the @id key, its value MUST be an IRI reference, or a compact IRI (including blank node identifiers).

KShivendu marked an inline comment as done.
  • test_search: Test for search on multiple instrinsic_metadata ields

Build is green

Patch application report for D5949 (id=21376)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit 3831143886ca282e1498de849a6f801b00090bd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 19:01:52 2021 +0000

    test_search: Test for search on multiple instrinsic_metadata ields

commit feb9effff6fb359c86e590e6e984637bbe13de61
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 14:19:52 2021 +0000

    Use analyzer on the list of licenses and langauges

commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/177/ for more details.

  • in_memory: Use expanded instrinsic_metadata

Build is green

Patch application report for D5949 (id=21377)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 21:09:21 2021 +0000

    in_memory: Use expanded instrinsic_metadata

commit 3831143886ca282e1498de849a6f801b00090bd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 19:01:52 2021 +0000

    test_search: Test for search on multiple instrinsic_metadata ields

commit feb9effff6fb359c86e590e6e984637bbe13de61
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 14:19:52 2021 +0000

    Use analyzer on the list of licenses and langauges

commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/178/ for more details.

swh/search/elasticsearch.py
434–435

I can remove it. But that would add this (even for requests without license or programmingLanguage):

{
    "nested": {
        "path": "intrinsic_metadata",
        "query": {"bool": {"should": [],}}, 
    }
}

Which is unnecessary code that bloats the body of every request to ES.

swh/search/tests/test_search.py
575

I should move this to test_in_memory.py.

  • test_in_memory: Add test for _nested_get

Build is green

Patch application report for D5949 (id=21390)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit bfc8a77bdb6ba88c34f7b08161bb319cd5508d0f
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jul 1 11:46:44 2021 +0000

    test_in_memory: Add test for _nested_get

commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 21:09:21 2021 +0000

    in_memory: Use expanded instrinsic_metadata

commit 3831143886ca282e1498de849a6f801b00090bd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 19:01:52 2021 +0000

    test_search: Test for search on multiple instrinsic_metadata ields

commit feb9effff6fb359c86e590e6e984637bbe13de61
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 14:19:52 2021 +0000

    Use analyzer on the list of licenses and langauges

commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/179/ for more details.

swh/search/in_memory.py
46–59

Use the doctest syntax, not markdown: https://docs.python.org/3/library/doctest.html

157–162

why do you lower them?

And this will crash if metadata["license"] or metadata["programmingLanguage"] is a list or an object. (instead, you can transform *after* using the codemeta.expand() normalization)

Could you also add tests for what they are a list/object?

  • in_memory: Allow list of licenses and programmingLanguages

Build is green

Patch application report for D5949 (id=21414)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit d6320db10b591d049dc307803612766199ceb6a1
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jul 1 18:12:26 2021 +0000

    in_memory: Allow list of licenses and programmingLanguages

commit bfc8a77bdb6ba88c34f7b08161bb319cd5508d0f
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jul 1 11:46:44 2021 +0000

    test_in_memory: Add test for _nested_get

commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 21:09:21 2021 +0000

    in_memory: Use expanded instrinsic_metadata

commit 3831143886ca282e1498de849a6f801b00090bd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 19:01:52 2021 +0000

    test_search: Test for search on multiple instrinsic_metadata ields

commit feb9effff6fb359c86e590e6e984637bbe13de61
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 14:19:52 2021 +0000

    Use analyzer on the list of licenses and langauges

commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 12:57:56 2021 +0000

    origin_search: Allow search for multiple licenses and languages at once

commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 30 10:32:17 2021 +0000

    test_search: Add test for language and license

commit 0c4bcca6f3439527eb2a71776701a4c9989644cd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Expose language and license from instrinsic_metadata

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/180/ for more details.

swh/search/in_memory.py
157–162

why do you lower them?

To emulate the behaviour of analyzers in Elasticsearch.

you can transform *after* using the codemeta.expand() normalization

Ahh.. Perfect!

Could you also add tests for what they are a list/object?

I didn't understand. Please elaborate.

Thanks, and sorry for being so annoying, but JSON-LD is has a lot of hidden complexity.

swh/search/elasticsearch.py
472–480

perfect!

swh/search/in_memory.py
40–58

You should be able to skip the asserts like this ^

And add --doctest-modules in tox.ini so these tests are run (see other tox.ini as examples)

157–162

I didn't understand. Please elaborate.

origin_update accepts 1 strings, 2 list of strings, 3 objects, and 4 lists of objects, but your tests only check case 1

Well, I just checked again and it looks like you're testing case 2 as well.

I guess that's good enough since we're normalizing anyway; so forget it.

This revision is now accepted and ready to land.Jul 1 2021, 9:34 PM
  • Improve doctest for _nested_get
  • Squash

Build has FAILED

Patch application report for D5949 (id=21417)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit e134a624cb4be71970951d5536500355e14e0b92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Filter for instrinsic_metadata language and license

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/181/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/181/console

  • Fix failing doctest
  • Add commit description

Build is green

Patch application report for D5949 (id=21420)

Rebasing onto 6b1d563d42...

Current branch diff-target is up to date.
Changes applied before test
commit 2e1fb863871c82c5561b6a1cbd94c7064a1a392d
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jun 29 17:53:20 2021 +0000

    origin_search: Filter for instrinsic_metadata language and license
    
    Instrinsic_metadata often contains programmingLanguage and license fields
    which are very useful as search filters. These values can be used until
    license and language mined by swh.indexer aren't ready to use.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/182/ for more details.

KShivendu retitled this revision from journal_client: Store language and license from instrinsic_metadata to origin_search: Filter for instrinsic_metadata language and license.Jul 2 2021, 10:31 AM