Page MenuHomeSoftware Heritage

codemeta: Fix incorrect output namespace for dates and URLs
ClosedPublic

Authored by vlorentz on Oct 26 2022, 2:15 PM.

Details

Summary

Codemeta reexports schema:url, schema:dateCreated, ... with
"@type": "@id" and "type": "schema:Date" so that

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}

expands to:

{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}

However, our translation tried to translate directly to a partially expanded
form, like this:

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}

or expanded to:

{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}

which are not what we want.

This commit replaces the hack for @type with the right solution that
works for all properties.

I noticed this issue while writing tests for the diff that will resolve T4654.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8778 (id=31643)

Could not rebase; Attempt merge onto a51cbf3965...

Updating a51cbf3..5b8b04a
Fast-forward
 swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
 swh/indexer/metadata_dictionary/cff.py             |  7 +++-
 swh/indexer/metadata_dictionary/codemeta.py        | 16 ++++-----
 swh/indexer/metadata_dictionary/github.py          | 13 ++++---
 swh/indexer/metadata_dictionary/maven.py           | 11 +++---
 swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
 swh/indexer/metadata_dictionary/nuget.py           |  4 +--
 swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
 .../tests/metadata_dictionary/test_codemeta.py     |  4 ++-
 swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
 10 files changed, 96 insertions(+), 53 deletions(-)
Changes applied before test
commit 5b8b04ab55eb73fdd506ae6b00b21f51e183d883
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:08:33 2022 +0200

    codemeta: Fix incorrect output namespace for dates and URLs
    
    Codemeta reexports schema:url, schema:dateCreated, ... with
    `"@type": "@id"` and `"type": "schema:Date"` so that
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200

metadata_dictionary: Systematically check input URLs before adding to graph

This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/519/ for more details.

Build is green

Patch application report for D8778 (id=31644)

Could not rebase; Attempt merge onto a51cbf3965...

Updating a51cbf3..c0052f8
Fast-forward
 swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
 swh/indexer/metadata_dictionary/cff.py             |  7 +++-
 swh/indexer/metadata_dictionary/codemeta.py        | 16 ++++-----
 swh/indexer/metadata_dictionary/github.py          | 13 ++++---
 swh/indexer/metadata_dictionary/maven.py           | 11 +++---
 swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
 swh/indexer/metadata_dictionary/nuget.py           |  4 +--
 swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
 .../tests/metadata_dictionary/test_codemeta.py     | 11 ++++--
 swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
 10 files changed, 102 insertions(+), 54 deletions(-)
Changes applied before test
commit c0052f8e48fa4cf2c0034c48d2e66355558af62a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:08:33 2022 +0200

    codemeta: Fix incorrect output namespace for dates and URLs
    
    Codemeta reexports schema:url, schema:dateCreated, ... with
    `"@type": "@id"` and `"type": "schema:Date"` so that
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200

metadata_dictionary: Systematically check input URLs before adding to graph

This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/520/ for more details.
This revision is now accepted and ready to land.Oct 26 2022, 3:15 PM