Page MenuHomeSoftware Heritage

origin_search: Filters and sorting for date_{created,modified,published}
ClosedPublic

Authored by KShivendu on Jul 2 2021, 5:17 PM.

Details

Summary

intrinsic_metadata often contains date_{created,modified,published} which
can be used as sorting options as well as filters.

Diff Detail

Repository
rDSEA Archive search
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5964 (id=21453)

Rebasing onto 2e1fb86387...

Current branch diff-target is up to date.
Changes applied before test
commit 85dd6f27068962df6330a87239989e1c82f25d18
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jul 2 15:15:04 2021 +0000

    origin_search: Filters and sorting for date_{created,modified,published}

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/186/ for more details.

Can you either add tests, or deduplicate this code so we don't need to test every field?

swh/search/in_memory.py
385–387

Why is this one different from the others?

swh/search/interface.py
27–55

Can you move this to a new module, instead of the "interface"?

73–80

er, sorry for not noticing sooner, but these should be datetime.datetime, not str. Can you change this in the next diff?

KShivendu marked an inline comment as done.
  • Move get_expansion to utils.py
  • Add tests filters as well as sorting options
  • Polish existing code

Build has FAILED

Patch application report for D5964 (id=21499)

Rebasing onto f378a989e9...

Current branch diff-target is up to date.
Changes applied before test
commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jul 5 16:17:23 2021 +0000

    origin_search: Filters and sorting for date_{created,modified,published}

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/console

Can you either add tests, or deduplicate this code so we don't need to test every field?

By "deduplicate this code so we don't need to test every field" you mean common code and tests for all the date fields (last_visit, last_release, ... datePublished, dateModified, ...) ?

swh/search/elasticsearch.py
186–198

Afaik, It's important to define date type for date{Published,Modified,Created} for filters/sorting options to work since we've used "date_detection": False, (Automatic date detection won't work)

But, If I use uncomment these lines of code, most of the test_elasticsearch.py tests start to fail.

The error that gets thrown in that case looks like : https://forge.softwareheritage.org/P1089.

swh/search/in_memory.py
385–387

Fixed

swh/search/interface.py
27–55

Moved to utils.py ?

73–80

Sure.

vsellier added inline comments.
swh/search/elasticsearch.py
186–198

Actually, all the fields in the nested document need to be strings.
It seems it's possible to configure the mapping as you want and to add the lenient[1] parameter to the search query.

Some tests will need to be adapted (test_origin_intrinsic_metadata_string_mapping for example) as they are trying to set random text on the dateCreated field and it will fail with the new mapping.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#match-field-params

KShivendu marked an inline comment as done.
  • elasticsearch.py: Use "linient: true"
  • origin_search: Validate intrinsic_metadata date field format before storing
  • test_search: Fix failing tests

Build is green

Patch application report for D5964 (id=21505)

Rebasing onto f378a989e9...

Current branch diff-target is up to date.
Changes applied before test
commit 976c7229d2221cbdc82517ab7a24d121ad0ced62
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jul 6 09:44:27 2021 +0000

    origin_search: Validate intrinsic_metadata date field format before storing

commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jul 5 16:17:23 2021 +0000

    origin_search: Filters and sorting for date_{created,modified,published}

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/191/ for more details.

swh/search/elasticsearch.py
69–80

Can you document this new behavior in a comment?

  • origin_update: Document rejection of metadata date fields if not parsable

Build is green

Patch application report for D5964 (id=21531)

Rebasing onto f378a989e9...

Current branch diff-target is up to date.
Changes applied before test
commit dc52c981fb953add9be87f464a0921ea6601bc02
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jul 7 11:27:35 2021 +0000

    origin_update: Document rejection of metadata date fields if not parsable

commit 976c7229d2221cbdc82517ab7a24d121ad0ced62
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jul 6 09:44:27 2021 +0000

    origin_search: Validate intrinsic_metadata date field format before storing

commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jul 5 16:17:23 2021 +0000

    origin_search: Filters and sorting for date_{created,modified,published}

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/192/ for more details.

swh/search/tests/test_search.py
992

I don't understand. Are we rejecting the *entire* document because one of its values was badly formatted?

swh/search/utils.py
1

copyright header

52

this does not validate it is in "the standard ISO format": https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat

You should use the iso8601 library instead.

swh/search/tests/test_search.py
992

No. We just reject that field (pop it out of intrinsic_metadata before expanding and storing the intrinsic_metadata)

Also, this rejection this is only for date{Created,Modified,Published}

swh/search/utils.py
52

Ohh. I see.
But, now I'm thinking that it's better if it can parse strings which don't exactly follow ISO ( as long as they are valid dates and can be parsed by Elasticsearch without errors )
This will allow more origins to be searchable through date fields.

Thoughts ?

swh/search/tests/test_search.py
992

nvm, I misunderstood the test

swh/search/in_memory.py
129–131

It looks like you don't have a test for this; can you add one?

(also, could you deduplicate it?)

swh/search/utils.py
52

I would rather use a formally defined format, so we don't depend on ElasticSearch-specific format (in case we want to use something else in the future)

  • Add test for sort_by : ["date_created"]
  • Deduplicate calculation of some variables in _get_sorting_key
  • Use iso8601 library to validate date format in instrinsic_metadata fields

Build is green

Patch application report for D5964 (id=21612)

Rebasing onto f378a989e9...

Current branch diff-target is up to date.
Changes applied before test
commit fe7640f71024084554ab4d36209f6da5d1c76267
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Tue Jul 13 14:59:53 2021 +0530

    origin_search: Filters and sorting for date_{created,modified,published}
    
    intrinsic_metadata often contains date_{created,modified,published} which
    can be used as sorting options as well as filters.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/193/ for more details.

This revision is now accepted and ready to land.Jul 13 2021, 12:03 PM