Details
Details
- Reviewers
anlambert - Group Reviewers
Reviewers - Maniphest Tasks
- T4654: swh-indexer produces dates not supported by swh-search/ElasticSearch
- Commits
- rDCIDX3bad41489c4b: codemeta: Fix malformed dates that used to be allowed by the deposit
Diff Detail
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- namespaces
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 32598 Build 51066: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 51065: arc lint + arc unit
Event Timeline
Comment Actions
Build is green
Patch application report for D8779 (id=31645)
Could not rebase; Attempt merge onto a51cbf3965...
Updating a51cbf3..3bad414 Fast-forward mypy.ini | 3 ++ requirements.txt | 1 + swh/indexer/metadata_dictionary/base.py | 25 +++++++------ swh/indexer/metadata_dictionary/cff.py | 7 +++- swh/indexer/metadata_dictionary/codemeta.py | 32 +++++++++++------ swh/indexer/metadata_dictionary/github.py | 13 ++++--- swh/indexer/metadata_dictionary/maven.py | 11 +++--- swh/indexer/metadata_dictionary/npm.py | 16 ++------- swh/indexer/metadata_dictionary/nuget.py | 4 +-- swh/indexer/metadata_dictionary/utils.py | 42 +++++++++++++++++++++- .../tests/metadata_dictionary/test_codemeta.py | 33 +++++++++++++++-- swh/indexer/tests/metadata_dictionary/test_npm.py | 11 ++++++ 12 files changed, 144 insertions(+), 54 deletions(-)
Changes applied before test
commit 3bad41489c4b5412fbf250d7dd53c3b188956f65 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Oct 26 14:19:26 2022 +0200 codemeta: Fix malformed dates that used to be allowed by the deposit commit c0052f8e48fa4cf2c0034c48d2e66355558af62a Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Oct 26 14:08:33 2022 +0200 codemeta: Fix incorrect output namespace for dates and URLs Codemeta reexports schema:url, schema:dateCreated, ... with `"@type": "@id"` and `"type": "schema:Date"` so that
{ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "url": "http://example.org", "dateCreated": "2022-10-26" } ``` expands to: ``` { "http://schema.org/url": { "@type": "@id", "@value": "http://example.org" }, "dateCreated": { "@type": "http://schema.org/Date", "@value": "2022-10-26" } } ``` However, our translation tried to translate directly to a partially expanded form, like this: ``` { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "url": { "@value": "http://example.org" }, "dateCreated": { "@value": "2022-10-26" } } ``` which prevents the compaction and expansion algorithms from adding a type themselves, causing the document to be compacted to: ``` { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "schema:url": "http://example.org" "schema:dateCreated": "2022-10-26" } ``` or expanded to: ``` { "http://schema.org/url": { "@value": "http://example.org" }, "http://schema.org/dateCreated": { "@value": "2022-10-26" } } ``` which are not what we want. This commit replaces the hack for `@type` with the right solution that works for all properties.
commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200
metadata_dictionary: Systematically check input URLs before adding to graph This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/521/ for more details.
Comment Actions
LGTM
swh/indexer/metadata_dictionary/codemeta.py | ||
---|---|---|
97 | maybe add extra condition on string length to avoid useless reformatting ? and len(json_child) < 10 |
swh/indexer/metadata_dictionary/codemeta.py | ||
---|---|---|
97 | I don't think it matters, reformatting is fast: In [4]: %timeit iso8601.parse_date("2022-10-26").date().isoformat() 4.56 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |