Page MenuHomeSoftware Heritage

swh-indexer : Add mapping for CITATION.cff files
ClosedPublic

Authored by KShivendu on Mar 18 2021, 6:16 AM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
index-citation-cff
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 20035
Build 31104: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 31103: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5273 (id=18890)

Rebasing onto 4831012290...

First, rewinding head to replay your work on top of it...
Applying: Add mapping for CITATION.cff
Changes applied before test
commit 61ea8a46b8c8bf8d3c44e7dfaaf64fd60f8dc4c2
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 18 2021, 6:18 AM
Harbormaster failed remote builds in B19979: Diff 18890!

Updating D5273: Fix test cases according to changes

Build is green

Patch application report for D5273 (id=18891)

Rebasing onto 4831012290...

First, rewinding head to replay your work on top of it...
Applying: Add mapping for CITATION.cff
Changes applied before test
commit 6e88d9b342da3df7f1c78686011a9471056e18dd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/154/ for more details.

Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.

Details should go both in the commit message and the diff description [1]

Also, please have a look at the coverage. When the build is done, this displays colors
on the code diff-ed, on the right. Blue means covered, oranage not covered.

You should do your best to cover the most cases. Uncovered should be the exception (it
can "justifiably" happen but not that often).

[1] arcanist (arc cli) should take your commit message and uses as default diff
description when you create the diff so you shall not repeat yourself too much.

vlorentz added a subscriber: vlorentz.

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

And you also made changes to the NodeJS mapping, these should go their own diff

This revision now requires changes to proceed.Mar 18 2021, 9:45 AM

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

I have directly downloaded the crosswalk.csv file from the official repository. You can verify sha1 hashes(after appending an empty line in my file).

And you also made changes to the NodeJS mapping, these should go their own diff

I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first. Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?

Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.

Thank you very much for your feedback. I will keep this in mind.

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

I have directly downloaded the crosswalk.csv file from the official repository. You can verify sha1 hashes(after appending an empty line in my file).

Oh, ok. (Undo the line break change, though)

And you also made changes to the NodeJS mapping, these should go their own diff

I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first.

Yes please

Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?

processorRequirements -> runtimePlatform is a minor change, it will be fine.

contributor -> contributors is a bug fix, so it won't break anything

Updating D5273: Add mapping for CITATION.cff

Build is green

Patch application report for D5273 (id=18955)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.
Changes applied before test
commit d46c74a80d0048cccab7e8646545b7ba306d817a
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/161/ for more details.

KShivendu retitled this revision from Add mapping for CITATION.cff to swh-indexer : Add mapping for CITATION.cff files.Mar 20 2021, 10:05 AM
KShivendu edited the test plan for this revision. (Show Details)

Could you add tests?

swh/indexer/metadata_dictionary/citation.py
31–33

You can remove these comments (they are only useful in the example)

37–46

This is technically correct IIRC, but I would rather not have a field than have it with a None value

This revision is now accepted and ready to land.Mar 20 2021, 11:48 AM
This revision now requires changes to proceed.Mar 20 2021, 11:48 AM

Hey @vlorentz

I am writing tests for CITATION.cff metadata mapping and I have 2 doubts :
(1) Are CITATION and CITATION.cff the same? (If so, cff-converter isn't able to parse our swh/indexer/data/codemeta/CITATION file maybe because it doesn't look like a YAML file)
(2) cff-converter's output is :

{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "http://schema.org"
    ],
    "@type": "SoftwareSourceCode",
    "author": [
        {
            "@id": "https://orcid.org/0000-0002-7064-4069",
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Netherlands eScience Center"
            },
            "familyName": "Spaaks",
            "givenName": "Jurriaan H."
        },
        {
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Netherlands eScience Center"
            },
            "familyName": "Klaver",
            "givenName": "Tom"
        }
    ],
    "codeRepository": "https://github.com/citation-file-format/cff-converter-python",
    "datePublished": "2019-11-12",
    "identifier": "https://doi.org/10.5281/zenodo.1162057",
    "keywords": [
        "citation",
        "bibliography",
        "cff",
        "CITATION.cff"
    ],
    "license": "http://www.apache.org/licenses/LICENSE-2.0",
    "name": "cffconvert",
    "version": "1.4.0-alpha0"
}

While in all of our translated outputs (described in test_metadata.py for other mappings)

  • I see only one value in @context and it is https://doi.org/10.5063/schema/codemeta-2.0 only (not http://schema.org)
  • '@type' is written as 'type' (in the tests)
  • author has only 'type', 'name' and 'email'
  • 'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)
  • We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field

I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.

(1) Are CITATION and CITATION.cff the same? (If so, cff-converter isn't able to parse our swh/indexer/data/codemeta/CITATION file maybe because it doesn't look like a YAML file)

No, CITATION is just a common name for plain text files

It's because cff-converter outputs some names that are in the http://schema.org namespace but not in https://doi.org/10.5063/schema/codemeta-2.0, such as legalName.

  • '@type' is written as 'type' (in the tests)

Semantically equivalent because of this: https://github.com/codemeta/codemeta/blob/master/codemeta.jsonld#L3

  • author has only 'type', 'name' and 'email'
  • 'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)

doesn't mean new mappings can't use other fields. Any valid Codemeta (or even schema.org) is fine.

  • We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field

That's only for Apache licenses, right?

Either way; cff-convert's output is correct, but we should normalize the license to an SPDX URI so that it's actually queryable.

I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.

Other than the license, cff-converter seems fine.

Although, I can already see a mistake in cff-converter's output, they should use name instead of legalName to translate the affiliation attribute, because nothing in the CFF spec requires it to be a legal name

swh/indexer/metadata_dictionary/__init__.py
5

slight comment following your question on citation vs citation.cff:

  • it would be preferable to name the mapping cff, which is a more unambiguous name to this vocabulary
swh/indexer/metadata_dictionary/__init__.py
5

indeed!

KShivendu marked 3 inline comments as done.

Updating D5273: swh-indexer : Add mapping for CITATION.cff files

Any valid Codemeta (or even schema.org) is fine

So how does a parser decide which one to use for interpretation?

Build has FAILED

Patch application report for D5273 (id=19072)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.
Changes applied before test
commit 727a9f993e22b444a7f4fef105ad099ede76d474
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation is made for license and organisation.legalName(replaced with name)

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/console

Updating D5273: swh-indexer : Fix failing tests

Build is green

Patch application report for D5273 (id=19073)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.
Changes applied before test
commit 2419ee1de062077cdc458616a1c8b63f25b1db62
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation is made for license and organisation.legalName(replaced with name)

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/164/ for more details.

Any valid Codemeta (or even schema.org) is fine

So how does a parser decide which one to use for interpretation?

So how does a parser decide which one to use for interpretation?

Using @context. See https://codemeta.github.io/jsonld/ or https://www.w3.org/TR/json-ld/#the-context for details

And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.

And please keep the examples in test short; for example we don't need all five authors.

This revision now requires changes to proceed.Mar 25 2021, 9:50 AM

And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.

Please correct me if I am wrong but I see codemeta.jsonld defines 'schema' as well.
Also, I won't have to specifically mention 'http://schema.org/' in the @context array.

I tested it and everything seems correct to me. You can check it out using this Permalink and let me know if it is actually correct or not.

Indeed, you're right.

Nonetheless, the properties are also defined by Codemeta (even if mostly are aliases of schema.org), so we should use them

Updating D5273: Remove 'schema:' and limit test to two authors

Build is green

Patch application report for D5273 (id=19286)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.
Changes applied before test
commit 01aac465d321505f37b643225547a65057f0786f
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation has been made for license and 'organisation.legalName'(replaced with 'organisation.name')

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/166/ for more details.

Looking good. Last comments:

swh/indexer/metadata_dictionary/cff.py
34–52 ↗(On Diff #19286)

could you move this part to a method? (normalize_authors iirc)

54–55 ↗(On Diff #19286)

isn't this redundant with the crosswalk?

swh/indexer/tests/test_metadata.py
88 ↗(On Diff #19286)

typo

119–121 ↗(On Diff #19286)

use b""" to start the string so you don't need to .encode() it

141 ↗(On Diff #19286)

you can write the literal ä here

148 ↗(On Diff #19286)

drop the " " (Black can't do it when it folds two lines, unfortunately)

This revision now requires changes to proceed.Mar 30 2021, 4:39 PM
KShivendu added inline comments.
swh/indexer/metadata_dictionary/cff.py
34–52 ↗(On Diff #19286)

I tried but something went wrong. I went in debugging mode to track the flow and found that the normalize_translation function made the output to be :

"author": [
    {
        "type": "Person"
    },
    {
        "id": "https://orcid.org/0000-0003-4925-7248",
        "type": "Person"
    }
],

But the playground gives the correct results in compact form. Have a look at it here

54–55 ↗(On Diff #19286)

Yes, it is. But when I put these two fields in string_fields the output is

'schema:codeRepository' : '....',
'schema:datePublished' : '....'

not

'codeRepository' : '....',
'datePublished' : '....'

But this doesn't happen with the other fields. Why is that?

swh/indexer/tests/test_metadata.py
119–121 ↗(On Diff #19286)

I did that the first time but it throws an error: bytes can only contain ASCII literal characters.

swh/indexer/metadata_dictionary/cff.py
34–52 ↗(On Diff #19286)

It seems to works in npm.py. Try to find why it works for NPM but not CFF.

54–55 ↗(On Diff #19286)

It's a bug :)

I'm working on it

swh/indexer/metadata_dictionary/cff.py
54–55 ↗(On Diff #19286)

Actually it isn't. (Well, it's an arguable design decision in Codemeta, but that's how it is).

These fields are defined as:

"datePublished":  {"@id": "schema:datePublished", "@type": "schema:Date" },

and

"codeRepository": { "@id": "schema:codeRepository", "@type": "@id"},

which means "the codemeta term is like the schema.org term, but with an extra constraint on the type". But the JSON-LD document you create does not match these type definitions. So when compacting, it can't be converted to the codemeta term, because of the constraint.

So the fix should be to make them satisfy the type constraint. (Hint: the license field is defined the same way)

KShivendu marked 5 inline comments as done.

Updating D5273: swh-indexer : Fix issues related to datePublished and codeRepository fields.
Migrate code related to authors into normalize_authors function
Add .vscode/ in .gitignore to avoid tracking launch.json and other VScode config files

Build is green

Patch application report for D5273 (id=19368)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.
Changes applied before test
commit 8f1fb0f9310df6174823a149e62662881be82f87
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation has been made for license and organisation.legalName(replaced with organisation.name)

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/167/ for more details.

Great.

If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.
Or you can configure it globally so you don't need to commit it, see https://git-scm.com/docs/gitignore

If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.

I ran find */.gitignore -type f -exec grep -H '.vscode' {} + in swh-environment and got :

swh-indexer/.gitignore:.vscode/ -> I did this one.
swh-provenance/.gitignore:.vscode
swh-storage/.gitignore:.vscode/
swh-web/.gitignore:.vscode/

So these repositories already have .vscode in their .gitignore. And IMO, creating a different commit just to put .vscode in all the .gitignore repositories doesn't look good. Contributors can add it if required. What's your opinion?
Btw, what editor do you use?

Alright then

Btw, what editor do you use?

Personally, vim. In the rest of the team, it's mostly emacs or VS Code

This revision is now accepted and ready to land.Apr 3 2021, 3:49 PM