Instrinsic_metadata often contains programmingLanguage and license fields
which are very useful as search filters. These values can be used until
license and language mined by swh.indexer aren't ready to use.
Details
- Reviewers
vlorentz - Group Reviewers
Reviewers - Commits
- rDSEA2e1fb863871c: origin_search: Filter for instrinsic_metadata language and license
Diff Detail
- Repository
- rDSEA Archive search
- Branch
- intrinsic-metadata
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 22356 Build 34809: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 34808: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D5949 (id=21359)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit 50567315ecd92439477b18875d2845a73b6147e3 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 journal_client: Store language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/173/ for more details.
You should use the fields in intrinsic_metadata instead of duplicating with new fields
@vlorentz, just in case you missed it, the values don't get duplicated. I'm popping them out of instrinsic_metdadata.
I was trying to avoid nested documents as I've read at many places that they slow down searches. (not sure by how much)
Think/research about it once. If you still want me to remove it, I'm totally fine with it. Just comment again.
But they are nested for a reason, being in the intrinsic_metadata field indicates their origin. Otherwise we would just move *all* the fields at the root instead of moving a couple fields and reindexing every time we want to use one.
If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?
If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?
Perfect ! I was thinking about asking for something like this :)
If the only issue is the slowdown, can we keep them nested for now, and benchmark later to see if un-nesting is worth it?
Perfect ! I was thinking about asking for something like this :)
+1.
More generally, don't try to compensate/optimize for something that could be.
Just implement the simplest approach and voice the concern about
possible caveats like the slowness here.
Then new tasks should eventually pop up about measuring performance.
- origin_search: Expose language and license from instrinsic_metadata
- test_search: Add test for language and license
Build is green
Patch application report for D5949 (id=21372)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/174/ for more details.
Thanks. A few more comments
swh/search/elasticsearch.py | ||
---|---|---|
433–465 | Can you deduplicate these two? | |
437 | What are the differences? | |
swh/search/in_memory.py | ||
325–335 | for consistent typing | |
swh/search/tests/test_search.py | ||
452–471 | license should be an URI | |
475–485 | This tests only checks either programming_language or license filtering works; because it would still succeed if either didn't filter anything. You should write separate tests for each. |
- origin_search: Allow search for multiple licenses and languages at once
- test_search: Separate tests for programming_language and license
Build has FAILED
Patch application report for D5949 (id=21373)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/175/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/175/console
swh/search/elasticsearch.py | ||
---|---|---|
437 | Afaik, "term" = Exact match For example, "term" should be case-sensitive but not "match". Also, "terms" = Exact match with elements of the given list | |
swh/search/in_memory.py | ||
325–335 | I didn't understand. Why [] ? I believe programmingLanguage and license are string, not lists ( atleast when taken from instrinsic_metadata ) Can you please elaborate ? | |
swh/search/tests/test_search.py | ||
452–471 | Okay. I'll change that. But, I "saved" https://pypi.org/project/Django/ in my self hosted swh instance. Its license field is "BSD-3-Clause" which isn't a URI. While https://github.com/Edyta2801/Ex_2_Webpage has license "https://spdx.org/licenses/ISC" So I think license isn't always a URI. | |
473–484 | This test passes for Elasticsearch if I use page = self.search.origin_search(url_pattern="foobar", license=["mit"]) .... page = self.search.origin_search(url_pattern="foobar", license=["bsd"]) ... page = self.search.origin_search(url_pattern="foobar", license=["mit", "clause"]) Because of this I'll have to apply the analyzer which is used for license field (which is of type text). This is one such way to apply the analyzer on the query but it's only available for terms Any suggestions ? | |
473–484 |
*only available for match |
Build is green
Patch application report for D5949 (id=21374)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit feb9effff6fb359c86e590e6e984637bbe13de61 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 14:19:52 2021 +0000 Use analyzer on the list of licenses and langauges commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/176/ for more details.
swh/search/tests/test_search.py | ||
---|---|---|
473–484 | I fixed this with "should" and "match". You may ignore this thread. |
Can you add a test with both programming_languages and licenses (and possibly an extra field that we already support), to make sure the search actually uses AND operators and not OR (ie. that it doesn't return results that match one one of the criteria)?
swh/search/elasticsearch.py | ||
---|---|---|
434–435 | do you need to make this conditional? | |
437–462 | no need for if here | |
464–471 | Is this an OR or an AND operator used here? (please write it in a comment for readers unfamiliar with ES) | |
swh/search/in_memory.py | ||
325–335 | When unnormalized, they can be either a string, a list of strings, an object with an @id attribute, or a list of such objects. swh/search/elasticsearch.py normalizes them to always be the latter; and swh/search/in_memory.py should do the same, but it looks like it does not. Could you fix this at the same time? | |
swh/search/tests/test_search.py | ||
452–471 | Then it's a bug in the indexer. According to https://www.w3.org/TR/json-ld/#node-objects :
|
Build is green
Patch application report for D5949 (id=21376)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit 3831143886ca282e1498de849a6f801b00090bd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 19:01:52 2021 +0000 test_search: Test for search on multiple instrinsic_metadata ields commit feb9effff6fb359c86e590e6e984637bbe13de61 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 14:19:52 2021 +0000 Use analyzer on the list of licenses and langauges commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/177/ for more details.
Build is green
Patch application report for D5949 (id=21377)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 21:09:21 2021 +0000 in_memory: Use expanded instrinsic_metadata commit 3831143886ca282e1498de849a6f801b00090bd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 19:01:52 2021 +0000 test_search: Test for search on multiple instrinsic_metadata ields commit feb9effff6fb359c86e590e6e984637bbe13de61 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 14:19:52 2021 +0000 Use analyzer on the list of licenses and langauges commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/178/ for more details.
swh/search/elasticsearch.py | ||
---|---|---|
434–435 | I can remove it. But that would add this (even for requests without license or programmingLanguage): { "nested": { "path": "intrinsic_metadata", "query": {"bool": {"should": [],}}, } } Which is unnecessary code that bloats the body of every request to ES. | |
swh/search/tests/test_search.py | ||
576 | I should move this to test_in_memory.py. |
Build is green
Patch application report for D5949 (id=21390)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit bfc8a77bdb6ba88c34f7b08161bb319cd5508d0f Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Jul 1 11:46:44 2021 +0000 test_in_memory: Add test for _nested_get commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 21:09:21 2021 +0000 in_memory: Use expanded instrinsic_metadata commit 3831143886ca282e1498de849a6f801b00090bd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 19:01:52 2021 +0000 test_search: Test for search on multiple instrinsic_metadata ields commit feb9effff6fb359c86e590e6e984637bbe13de61 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 14:19:52 2021 +0000 Use analyzer on the list of licenses and langauges commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/179/ for more details.
swh/search/in_memory.py | ||
---|---|---|
47–60 | Use the doctest syntax, not markdown: https://docs.python.org/3/library/doctest.html | |
201–206 | why do you lower them? And this will crash if metadata["license"] or metadata["programmingLanguage"] is a list or an object. (instead, you can transform *after* using the codemeta.expand() normalization) Could you also add tests for what they are a list/object? |
Build is green
Patch application report for D5949 (id=21414)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit d6320db10b591d049dc307803612766199ceb6a1 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Jul 1 18:12:26 2021 +0000 in_memory: Allow list of licenses and programmingLanguages commit bfc8a77bdb6ba88c34f7b08161bb319cd5508d0f Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Jul 1 11:46:44 2021 +0000 test_in_memory: Add test for _nested_get commit f5bfdb4f3838027b0bfcea347f8d9e6eed9a4fd2 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 21:09:21 2021 +0000 in_memory: Use expanded instrinsic_metadata commit 3831143886ca282e1498de849a6f801b00090bd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 19:01:52 2021 +0000 test_search: Test for search on multiple instrinsic_metadata ields commit feb9effff6fb359c86e590e6e984637bbe13de61 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 14:19:52 2021 +0000 Use analyzer on the list of licenses and langauges commit 2e2aefee0cbfbdcf67c199a66d7b324f05698314 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 12:57:56 2021 +0000 origin_search: Allow search for multiple licenses and languages at once commit 0ca2683a9504b8b3c6fe595148bb63280d09bcd4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jun 30 10:32:17 2021 +0000 test_search: Add test for language and license commit 0c4bcca6f3439527eb2a71776701a4c9989644cd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Expose language and license from instrinsic_metadata
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/180/ for more details.
swh/search/in_memory.py | ||
---|---|---|
201–206 |
To emulate the behaviour of analyzers in Elasticsearch.
Ahh.. Perfect!
I didn't understand. Please elaborate. |
Thanks, and sorry for being so annoying, but JSON-LD is has a lot of hidden complexity.
swh/search/elasticsearch.py | ||
---|---|---|
472–480 | perfect! | |
swh/search/in_memory.py | ||
41–59 | You should be able to skip the asserts like this ^ And add --doctest-modules in tox.ini so these tests are run (see other tox.ini as examples) | |
201–206 |
origin_update accepts 1 strings, 2 list of strings, 3 objects, and 4 lists of objects, but your tests only check case 1 Well, I just checked again and it looks like you're testing case 2 as well. I guess that's good enough since we're normalizing anyway; so forget it. |
Build has FAILED
Patch application report for D5949 (id=21417)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit e134a624cb4be71970951d5536500355e14e0b92 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Filter for instrinsic_metadata language and license
Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/181/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/181/console
Build is green
Patch application report for D5949 (id=21420)
Rebasing onto 6b1d563d42...
Current branch diff-target is up to date.
Changes applied before test
commit 2e1fb863871c82c5561b6a1cbd94c7064a1a392d Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jun 29 17:53:20 2021 +0000 origin_search: Filter for instrinsic_metadata language and license Instrinsic_metadata often contains programmingLanguage and license fields which are very useful as search filters. These values can be used until license and language mined by swh.indexer aren't ready to use.
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/182/ for more details.
Why this change? I don't think it will play well with a non-remote storage, because they return iterators.