intrinsic_metadata often contains date_{created,modified,published} which
can be used as sorting options as well as filters.
Details
- Reviewers
vlorentz - Group Reviewers
Reviewers - Commits
- rDSEAfe7640f71024: origin_search: Filters and sorting for date_{created,modified,published}
Diff Detail
- Repository
- rDSEA Archive search
- Branch
- metadata-description
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 22478 Build 35014: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 35013: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D5964 (id=21453)
Rebasing onto 2e1fb86387...
Current branch diff-target is up to date.
Changes applied before test
commit 85dd6f27068962df6330a87239989e1c82f25d18 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Fri Jul 2 15:15:04 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/186/ for more details.
Can you either add tests, or deduplicate this code so we don't need to test every field?
swh/search/in_memory.py | ||
---|---|---|
384–386 | Why is this one different from the others? | |
swh/search/interface.py | ||
27–55 | Can you move this to a new module, instead of the "interface"? | |
73–80 | er, sorry for not noticing sooner, but these should be datetime.datetime, not str. Can you change this in the next diff? |
- Move get_expansion to utils.py
- Add tests filters as well as sorting options
- Polish existing code
Build has FAILED
Patch application report for D5964 (id=21499)
Rebasing onto f378a989e9...
Current branch diff-target is up to date.
Changes applied before test
commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/console
By "deduplicate this code so we don't need to test every field" you mean common code and tests for all the date fields (last_visit, last_release, ... datePublished, dateModified, ...) ?
swh/search/elasticsearch.py | ||
---|---|---|
186–198 | Afaik, It's important to define date type for date{Published,Modified,Created} for filters/sorting options to work since we've used "date_detection": False, (Automatic date detection won't work) But, If I use uncomment these lines of code, most of the test_elasticsearch.py tests start to fail. The error that gets thrown in that case looks like : https://forge.softwareheritage.org/P1089. | |
swh/search/in_memory.py | ||
384–386 | Fixed | |
swh/search/interface.py | ||
27–55 | Moved to utils.py ? | |
73–80 | Sure. |
swh/search/elasticsearch.py | ||
---|---|---|
186–198 | Actually, all the fields in the nested document need to be strings. Some tests will need to be adapted (test_origin_intrinsic_metadata_string_mapping for example) as they are trying to set random text on the dateCreated field and it will fail with the new mapping. |
- elasticsearch.py: Use "linient: true"
- origin_search: Validate intrinsic_metadata date field format before storing
- test_search: Fix failing tests
Build is green
Patch application report for D5964 (id=21505)
Rebasing onto f378a989e9...
Current branch diff-target is up to date.
Changes applied before test
commit 976c7229d2221cbdc82517ab7a24d121ad0ced62 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 6 09:44:27 2021 +0000 origin_search: Validate intrinsic_metadata date field format before storing commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/191/ for more details.
swh/search/elasticsearch.py | ||
---|---|---|
69–79 | Can you document this new behavior in a comment? |
Build is green
Patch application report for D5964 (id=21531)
Rebasing onto f378a989e9...
Current branch diff-target is up to date.
Changes applied before test
commit dc52c981fb953add9be87f464a0921ea6601bc02 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jul 7 11:27:35 2021 +0000 origin_update: Document rejection of metadata date fields if not parsable commit 976c7229d2221cbdc82517ab7a24d121ad0ced62 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 6 09:44:27 2021 +0000 origin_search: Validate intrinsic_metadata date field format before storing commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/192/ for more details.
swh/search/tests/test_search.py | ||
---|---|---|
989 | I don't understand. Are we rejecting the *entire* document because one of its values was badly formatted? | |
swh/search/utils.py | ||
1 | copyright header | |
52 | this does not validate it is in "the standard ISO format": https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat You should use the iso8601 library instead. |
swh/search/tests/test_search.py | ||
---|---|---|
989 | No. We just reject that field (pop it out of intrinsic_metadata before expanding and storing the intrinsic_metadata) Also, this rejection this is only for date{Created,Modified,Published} | |
swh/search/utils.py | ||
52 | Ohh. I see. Thoughts ? |
swh/search/tests/test_search.py | ||
---|---|---|
989 | nvm, I misunderstood the test |
swh/search/in_memory.py | ||
---|---|---|
128–130 | It looks like you don't have a test for this; can you add one? (also, could you deduplicate it?) | |
swh/search/utils.py | ||
52 | I would rather use a formally defined format, so we don't depend on ElasticSearch-specific format (in case we want to use something else in the future) |
- Add test for sort_by : ["date_created"]
- Deduplicate calculation of some variables in _get_sorting_key
- Use iso8601 library to validate date format in instrinsic_metadata fields
Build is green
Patch application report for D5964 (id=21612)
Rebasing onto f378a989e9...
Current branch diff-target is up to date.
Changes applied before test
commit fe7640f71024084554ab4d36209f6da5d1c76267 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 13 14:59:53 2021 +0530 origin_search: Filters and sorting for date_{created,modified,published} intrinsic_metadata often contains date_{created,modified,published} which can be used as sorting options as well as filters.
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/193/ for more details.