Page MenuHomeSoftware Heritage

codemeta: Fix incorrect output namespace for dates and URLs
ClosedPublic

Authored by vlorentz on Oct 26 2022, 2:15 PM.

Details

Summary

Codemeta reexports schema:url, schema:dateCreated, ... with
"@type": "@id" and "type": "schema:Date" so that

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}

expands to:

{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}

However, our translation tried to translate directly to a partially expanded
form, like this:

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}

or expanded to:

{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}

which are not what we want.

This commit replaces the hack for @type with the right solution that
works for all properties.

I noticed this issue while writing tests for the diff that will resolve T4654.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
namespaces
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32597
Build 51064: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51063: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8778 (id=31643)

Could not rebase; Attempt merge onto a51cbf3965...

Updating a51cbf3..5b8b04a
Fast-forward
 swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
 swh/indexer/metadata_dictionary/cff.py             |  7 +++-
 swh/indexer/metadata_dictionary/codemeta.py        | 16 ++++-----
 swh/indexer/metadata_dictionary/github.py          | 13 ++++---
 swh/indexer/metadata_dictionary/maven.py           | 11 +++---
 swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
 swh/indexer/metadata_dictionary/nuget.py           |  4 +--
 swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
 .../tests/metadata_dictionary/test_codemeta.py     |  4 ++-
 swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
 10 files changed, 96 insertions(+), 53 deletions(-)
Changes applied before test
commit 5b8b04ab55eb73fdd506ae6b00b21f51e183d883
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:08:33 2022 +0200

    codemeta: Fix incorrect output namespace for dates and URLs
    
    Codemeta reexports schema:url, schema:dateCreated, ... with
    `"@type": "@id"` and `"type": "schema:Date"` so that
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200

metadata_dictionary: Systematically check input URLs before adding to graph

This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/519/ for more details.

Build is green

Patch application report for D8778 (id=31644)

Could not rebase; Attempt merge onto a51cbf3965...

Updating a51cbf3..c0052f8
Fast-forward
 swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
 swh/indexer/metadata_dictionary/cff.py             |  7 +++-
 swh/indexer/metadata_dictionary/codemeta.py        | 16 ++++-----
 swh/indexer/metadata_dictionary/github.py          | 13 ++++---
 swh/indexer/metadata_dictionary/maven.py           | 11 +++---
 swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
 swh/indexer/metadata_dictionary/nuget.py           |  4 +--
 swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
 .../tests/metadata_dictionary/test_codemeta.py     | 11 ++++--
 swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
 10 files changed, 102 insertions(+), 54 deletions(-)
Changes applied before test
commit c0052f8e48fa4cf2c0034c48d2e66355558af62a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 26 14:08:33 2022 +0200

    codemeta: Fix incorrect output namespace for dates and URLs
    
    Codemeta reexports schema:url, schema:dateCreated, ... with
    `"@type": "@id"` and `"type": "schema:Date"` so that
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Oct 25 16:02:16 2022 +0200

metadata_dictionary: Systematically check input URLs before adding to graph

This is hopefully the definitive workaround for the PyLD issue.
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/520/ for more details.
This revision is now accepted and ready to land.Oct 26 2022, 3:15 PM