Details
Details
- Reviewers
anlambert - Group Reviewers
Reviewers - Maniphest Tasks
- T4654: swh-indexer produces dates not supported by swh-search/ElasticSearch
- Commits
- rDCIDX3bad41489c4b: codemeta: Fix malformed dates that used to be allowed by the deposit
Diff Detail
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Comment Actions
Build is green
Patch application report for D8779 (id=31645)
Could not rebase; Attempt merge onto a51cbf3965...
Updating a51cbf3..3bad414 Fast-forward mypy.ini | 3 ++ requirements.txt | 1 + swh/indexer/metadata_dictionary/base.py | 25 +++++++------ swh/indexer/metadata_dictionary/cff.py | 7 +++- swh/indexer/metadata_dictionary/codemeta.py | 32 +++++++++++------ swh/indexer/metadata_dictionary/github.py | 13 ++++--- swh/indexer/metadata_dictionary/maven.py | 11 +++--- swh/indexer/metadata_dictionary/npm.py | 16 ++------- swh/indexer/metadata_dictionary/nuget.py | 4 +-- swh/indexer/metadata_dictionary/utils.py | 42 +++++++++++++++++++++- .../tests/metadata_dictionary/test_codemeta.py | 33 +++++++++++++++-- swh/indexer/tests/metadata_dictionary/test_npm.py | 11 ++++++ 12 files changed, 144 insertions(+), 54 deletions(-)
Changes applied before test
commit 3bad41489c4b5412fbf250d7dd53c3b188956f65
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Wed Oct 26 14:19:26 2022 +0200
codemeta: Fix malformed dates that used to be allowed by the deposit
commit c0052f8e48fa4cf2c0034c48d2e66355558af62a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Wed Oct 26 14:08:33 2022 +0200
codemeta: Fix incorrect output namespace for dates and URLs
Codemeta reexports schema:url, schema:dateCreated, ... with
`"@type": "@id"` and `"type": "schema:Date"` so that{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"url": "http://example.org",
"dateCreated": "2022-10-26"
}
```
expands to:
```
{
"http://schema.org/url": {
"@type": "@id",
"@value": "http://example.org"
},
"dateCreated": {
"@type": "http://schema.org/Date",
"@value": "2022-10-26"
}
}
```
However, our translation tried to translate directly to a partially expanded
form, like this:
```
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"url": {
"@value": "http://example.org"
},
"dateCreated": {
"@value": "2022-10-26"
}
}
```
which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:
```
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"schema:url": "http://example.org"
"schema:dateCreated": "2022-10-26"
}
```
or expanded to:
```
{
"http://schema.org/url": {
"@value": "http://example.org"
},
"http://schema.org/dateCreated": {
"@value": "2022-10-26"
}
}
```
which are not what we want.
This commit replaces the hack for `@type` with the right solution that
works for all properties.commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200
metadata_dictionary: Systematically check input URLs before adding to graph This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/521/ for more details.
Comment Actions
LGTM
| swh/indexer/metadata_dictionary/codemeta.py | ||
|---|---|---|
| 96 | maybe add extra condition on string length to avoid useless reformatting ? and len(json_child) < 10 | |
| swh/indexer/metadata_dictionary/codemeta.py | ||
|---|---|---|
| 96 | I don't think it matters, reformatting is fast: In [4]: %timeit iso8601.parse_date("2022-10-26").date().isoformat()
4.56 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) | |