Page MenuHomeSoftware Heritage

codemeta: Fix malformed dates that used to be allowed by the deposit
ClosedPublic

Authored by vlorentz on Oct 26 2022, 2:20 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8779 (id=31645)

Could not rebase; Attempt merge onto a51cbf3965...

Updating a51cbf3..3bad414
Fast-forward
 mypy.ini                                           |  3 ++
 requirements.txt                                   |  1 +
 swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
 swh/indexer/metadata_dictionary/cff.py             |  7 +++-
 swh/indexer/metadata_dictionary/codemeta.py        | 32 +++++++++++------
 swh/indexer/metadata_dictionary/github.py          | 13 ++++---
 swh/indexer/metadata_dictionary/maven.py           | 11 +++---
 swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
 swh/indexer/metadata_dictionary/nuget.py           |  4 +--
 swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
 .../tests/metadata_dictionary/test_codemeta.py     | 33 +++++++++++++++--
 swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
 12 files changed, 144 insertions(+), 54 deletions(-)
Changes applied before test
commit 3bad41489c4b5412fbf250d7dd53c3b188956f65
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:19:26 2022 +0200

    codemeta: Fix malformed dates that used to be allowed by the deposit

commit c0052f8e48fa4cf2c0034c48d2e66355558af62a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:08:33 2022 +0200

    codemeta: Fix incorrect output namespace for dates and URLs
    
    Codemeta reexports schema:url, schema:dateCreated, ... with
    `"@type": "@id"` and `"type": "schema:Date"` so that
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200

metadata_dictionary: Systematically check input URLs before adding to graph

This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/521/ for more details.
anlambert added a subscriber: anlambert.

LGTM

swh/indexer/metadata_dictionary/codemeta.py
96

maybe add extra condition on string length to avoid useless reformatting ?

and len(json_child) < 10
This revision is now accepted and ready to land.Oct 26 2022, 3:22 PM
swh/indexer/metadata_dictionary/codemeta.py
96

I don't think it matters, reformatting is fast:

In [4]: %timeit iso8601.parse_date("2022-10-26").date().isoformat()
4.56 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)