⚙ D5273 swh-indexer : Add mapping for CITATION.cff files

KShivendu created this revision.Mar 18 2021, 6:16 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 18 2021, 6:16 AM

Build has FAILED

Patch application report for D5273 (id=18890)

Rebasing onto 4831012290...

First, rewinding head to replay your work on top of it...
Applying: Add mapping for CITATION.cff

Changes applied before test

commit 61ea8a46b8c8bf8d3c44e7dfaaf64fd60f8dc4c2
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 18 2021, 6:18 AM

Harbormaster failed remote builds in B19979: Diff 18890!

Updating D5273: Fix test cases according to changes

Build is green

Patch application report for D5273 (id=18891)

Rebasing onto 4831012290...

First, rewinding head to replay your work on top of it...
Applying: Add mapping for CITATION.cff

Changes applied before test

commit 6e88d9b342da3df7f1c78686011a9471056e18dd
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/154/ for more details.

Harbormaster completed remote builds in B19980: Diff 18891.Mar 18 2021, 7:18 AM

KShivendu requested review of this revision.Mar 18 2021, 7:18 AM

Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.

Details should go both in the commit message and the diff description [1]

Also, please have a look at the coverage. When the build is done, this displays colors
on the code diff-ed, on the right. Blue means covered, oranage not covered.

You should do your best to cover the most cases. Uncovered should be the exception (it
can "justifiably" happen but not that often).

[1] arcanist (arc cli) should take your commit message and uses as default diff
description when you create the diff so you shall not repeat yourself too much.

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

And you also made changes to the NodeJS mapping, these should go their own diff

This revision now requires changes to proceed.Mar 18 2021, 9:45 AM

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

I have directly downloaded the crosswalk.csv file from the official repository. You can verify sha1 hashes(after appending an empty line in my file).

And you also made changes to the NodeJS mapping, these should go their own diff

I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first. Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?

Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.

Thank you very much for your feedback. I will keep this in mind.

In D5273#133952, @KShivendu wrote:

Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.

I have directly downloaded the crosswalk.csv file from the official repository. You can verify sha1 hashes(after appending an empty line in my file).

Oh, ok. (Undo the line break change, though)

And you also made changes to the NodeJS mapping, these should go their own diff

I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first.

Yes please

Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?

processorRequirements -> runtimePlatform is a minor change, it will be fine.

contributor -> contributors is a bug fix, so it won't break anything

KShivendu mentioned this in D5282: swh-indexer: sync data/codemeta with official codemeta repo.Mar 18 2021, 7:41 PM

Updating D5273: Add mapping for CITATION.cff

Build is green

Patch application report for D5273 (id=18955)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.

Changes applied before test

commit d46c74a80d0048cccab7e8646545b7ba306d817a
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    Add mapping for CITATION.cff

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/161/ for more details.

Harbormaster completed remote builds in B20035: Diff 18955.Mar 20 2021, 9:54 AM

KShivendu retitled this revision from Add mapping for CITATION.cff to swh-indexer : Add mapping for CITATION.cff files.Mar 20 2021, 10:05 AM

KShivendu edited the test plan for this revision. (Show Details)

KShivendu edited the test plan for this revision. (Show Details)Mar 20 2021, 10:14 AM

Could you add tests?

swh/indexer/metadata_dictionary/citation.py
31–33 ↗	(On Diff #18955)	You can remove these comments (they are only useful in the example)
37–46 ↗	(On Diff #18955)	This is technically correct IIRC, but I would rather not have a field than have it with a `None` value

This revision is now accepted and ready to land.Mar 20 2021, 11:48 AM

vlorentz requested changes to this revision.Mar 20 2021, 11:48 AM

This revision now requires changes to proceed.Mar 20 2021, 11:48 AM

Hey @vlorentz

I am writing tests for CITATION.cff metadata mapping and I have 2 doubts :
(1) Are CITATION and CITATION.cff the same? (If so, cff-converter isn't able to parse our swh/indexer/data/codemeta/CITATION file maybe because it doesn't look like a YAML file)
(2) cff-converter's output is :

{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "http://schema.org"
    ],
    "@type": "SoftwareSourceCode",
    "author": [
        {
            "@id": "https://orcid.org/0000-0002-7064-4069",
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Netherlands eScience Center"
            },
            "familyName": "Spaaks",
            "givenName": "Jurriaan H."
        },
        {
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Netherlands eScience Center"
            },
            "familyName": "Klaver",
            "givenName": "Tom"
        }
    ],
    "codeRepository": "https://github.com/citation-file-format/cff-converter-python",
    "datePublished": "2019-11-12",
    "identifier": "https://doi.org/10.5281/zenodo.1162057",
    "keywords": [
        "citation",
        "bibliography",
        "cff",
        "CITATION.cff"
    ],
    "license": "http://www.apache.org/licenses/LICENSE-2.0",
    "name": "cffconvert",
    "version": "1.4.0-alpha0"
}

While in all of our translated outputs (described in test_metadata.py for other mappings)

I see only one value in @context and it is https://doi.org/10.5063/schema/codemeta-2.0 only (not http://schema.org)
'@type' is written as 'type' (in the tests)
author has only 'type', 'name' and 'email'
'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)
We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field

I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.

In D5273#134437, @KShivendu wrote:

(1) Are CITATION and CITATION.cff the same? (If so, cff-converter isn't able to parse our swh/indexer/data/codemeta/CITATION file maybe because it doesn't look like a YAML file)

No, CITATION is just a common name for plain text files

I see only one value in @context and it is https://doi.org/10.5063/schema/codemeta-2.0 only (not http://schema.org)

It's because cff-converter outputs some names that are in the http://schema.org namespace but not in https://doi.org/10.5063/schema/codemeta-2.0, such as legalName.

'@type' is written as 'type' (in the tests)

Semantically equivalent because of this: https://github.com/codemeta/codemeta/blob/master/codemeta.jsonld#L3

author has only 'type', 'name' and 'email'

'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)

doesn't mean new mappings can't use other fields. Any valid Codemeta (or even schema.org) is fine.

We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field

That's only for Apache licenses, right?

Either way; cff-convert's output is correct, but we should normalize the license to an SPDX URI so that it's actually queryable.

I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.

Other than the license, cff-converter seems fine.

Although, I can already see a mistake in cff-converter's output, they should use name instead of legalName to translate the affiliation attribute, because nothing in the CFF spec requires it to be a legal name

moranegg added a subscriber: moranegg.Mar 23 2021, 5:33 PM

moranegg added inline comments.Mar 23 2021, 5:37 PM

swh/indexer/metadata_dictionary/__init__.py
5	slight comment following your question on citation vs citation.cff: it would be preferable to name the mapping cff, which is a more unambiguous name to this vocabulary

vlorentz added inline comments.Mar 23 2021, 5:42 PM

swh/indexer/metadata_dictionary/__init__.py
5	indeed!

Updating D5273: swh-indexer : Add mapping for CITATION.cff files

Any valid Codemeta (or even schema.org) is fine

So how does a parser decide which one to use for interpretation?

Build has FAILED

Patch application report for D5273 (id=19072)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.

Changes applied before test

commit 727a9f993e22b444a7f4fef105ad099ede76d474
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation is made for license and organisation.legalName(replaced with name)

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/console

Harbormaster failed remote builds in B20164: Diff 19072!Mar 24 2021, 10:17 AM

Updating D5273: swh-indexer : Fix failing tests

Build is green

Patch application report for D5273 (id=19073)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.

Changes applied before test

commit 2419ee1de062077cdc458616a1c8b63f25b1db62
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation is made for license and organisation.legalName(replaced with name)

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/164/ for more details.

Harbormaster completed remote builds in B20165: Diff 19073.Mar 24 2021, 10:26 AM

Any valid Codemeta (or even schema.org) is fine

So how does a parser decide which one to use for interpretation?

In D5273#135225, @KShivendu wrote:

So how does a parser decide which one to use for interpretation?

Using @context. See https://codemeta.github.io/jsonld/ or https://www.w3.org/TR/json-ld/#the-context for details

And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.

And please keep the examples in test short; for example we don't need all five authors.

This revision now requires changes to proceed.Mar 25 2021, 9:50 AM

And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.

Please correct me if I am wrong but I see codemeta.jsonld defines 'schema' as well.
Also, I won't have to specifically mention 'http://schema.org/' in the @context array.

I tested it and everything seems correct to me. You can check it out using this Permalink and let me know if it is actually correct or not.

Indeed, you're right.

Nonetheless, the properties are also defined by Codemeta (even if mostly are aliases of schema.org), so we should use them

Updating D5273: Remove 'schema:' and limit test to two authors

Build is green

Patch application report for D5273 (id=19286)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.

Changes applied before test

commit 01aac465d321505f37b643225547a65057f0786f
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation has been made for license and 'organisation.legalName'(replaced with 'organisation.name')

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/166/ for more details.

Harbormaster completed remote builds in B20355: Diff 19286.Mar 30 2021, 4:29 PM

Looking good. Last comments:

swh/indexer/metadata_dictionary/cff.py
34–52	could you move this part to a method? (`normalize_authors` iirc)
54–55	isn't this redundant with the crosswalk?
swh/indexer/tests/test_metadata.py
87	typo
118–120	use `b"""` to start the string so you don't need to `.encode()` it
140	you can write the literal `ä` here
147	drop the `" "` (Black can't do it when it folds two lines, unfortunately)

This revision now requires changes to proceed.Mar 30 2021, 4:39 PM

KShivendu marked 3 inline comments as done.Apr 1 2021, 2:53 PM

KShivendu added inline comments.

swh/indexer/metadata_dictionary/cff.py
34–52	I tried but something went wrong. I went in debugging mode to track the flow and found that the `normalize_translation` function made the output to be : "author": [ { "type": "Person" }, { "id": "https://orcid.org/0000-0003-4925-7248", "type": "Person" } ], But the playground gives the correct results in compact form. Have a look at it here
54–55	Yes, it is. But when I put these two fields in `string_fields` the output is 'schema:codeRepository' : '....', 'schema:datePublished' : '....' not 'codeRepository' : '....', 'datePublished' : '....' But this doesn't happen with the other fields. Why is that?
swh/indexer/tests/test_metadata.py
118–120	I did that the first time but it throws an error: `bytes can only contain ASCII literal characters.`

vlorentz added inline comments.Apr 1 2021, 3:38 PM

swh/indexer/metadata_dictionary/cff.py
34–52	It seems to works in `npm.py`. Try to find why it works for NPM but not CFF.
54–55	It's a bug :) I'm working on it

vlorentz added inline comments.Apr 1 2021, 3:57 PM

swh/indexer/metadata_dictionary/cff.py
54–55	Actually it isn't. (Well, it's an arguable design decision in Codemeta, but that's how it is). These fields are defined as: "datePublished": {"@id": "schema:datePublished", "@type": "schema:Date" }, and "codeRepository": { "@id": "schema:codeRepository", "@type": "@id"}, which means "the codemeta term is like the schema.org term, but with an extra constraint on the type". But the JSON-LD document you create does not match these type definitions. So when compacting, it can't be converted to the codemeta term, because of the constraint. So the fix should be to make them satisfy the type constraint. (Hint: the license field is defined the same way)

Updating D5273: swh-indexer : Fix issues related to datePublished and codeRepository fields.
Migrate code related to authors into normalize_authors function
Add .vscode/ in .gitignore to avoid tracking launch.json and other VScode config files

Build is green

Patch application report for D5273 (id=19368)

Rebasing onto 8e046c1900...

Current branch diff-target is up to date.

Changes applied before test

commit 8f1fb0f9310df6174823a149e62662881be82f87
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Mar 18 10:43:51 2021 +0530

    metadata_dictionary: Add mapping for CITATION.cff
    
    The implementation is based on output from cff-converter python library.
    Deviation has been made for license and organisation.legalName(replaced with organisation.name)

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/167/ for more details.

Harbormaster completed remote builds in B20425: Diff 19368.Apr 2 2021, 10:51 PM

Great.

If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.
Or you can configure it globally so you don't need to commit it, see https://git-scm.com/docs/gitignore

If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.

I ran find */.gitignore -type f -exec grep -H '.vscode' {} + in swh-environment and got :

swh-indexer/.gitignore:.vscode/ -> I did this one.
swh-provenance/.gitignore:.vscode
swh-storage/.gitignore:.vscode/
swh-web/.gitignore:.vscode/

So these repositories already have .vscode in their .gitignore. And IMO, creating a different commit just to put .vscode in all the .gitignore repositories doesn't look good. Contributors can add it if required. What's your opinion?
Btw, what editor do you use?

Alright then

Btw, what editor do you use?

Personally, vim. In the rest of the team, it's mostly emacs or VS Code

This revision is now accepted and ready to land.Apr 3 2021, 3:49 PM

Closed by commit rDCIDX8f1fb0f9310d: metadata_dictionary: Add mapping for CITATION.cff (authored by KShivendu). · Explain WhyApr 3 2021, 6:26 PM

This revision was automatically updated to reflect the committed changes.

KShivendu added a commit: rDCIDX8f1fb0f9310d: metadata_dictionary: Add mapping for CITATION.cff.

swh-indexer : Add mapping for CITATION.cff files
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D5273 (id=18890)

Changes applied before test

Patch application report for D5273 (id=18891)

Changes applied before test

Patch application report for D5273 (id=18955)

Changes applied before test

Patch application report for D5273 (id=19072)

Changes applied before test

Patch application report for D5273 (id=19073)

Changes applied before test

Patch application report for D5273 (id=19286)

Changes applied before test

Patch application report for D5273 (id=19368)

Changes applied before test

Revision Contents
Changeset List

Diff 19369

.gitignore

swh/indexer/metadata_dictionary/init.py

swh/indexer/metadata_dictionary/cff.py

swh/indexer/storage/init.py

swh/indexer/tests/storage/test_storage.py

swh/indexer/tests/test_cli.py

swh/indexer/tests/test_metadata.py

swh-indexer : Add mapping for CITATION.cff filesClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D5273 (id=18890)

Changes applied before test

Patch application report for D5273 (id=18891)

Changes applied before test

Patch application report for D5273 (id=18955)

Changes applied before test

Patch application report for D5273 (id=19072)

Changes applied before test

Patch application report for D5273 (id=19073)

Changes applied before test

Patch application report for D5273 (id=19286)

Changes applied before test

Patch application report for D5273 (id=19368)

Changes applied before test

Revision ContentsChangeset List

Diff 19369

.gitignore

swh/indexer/metadata_dictionary/__init__.py

swh/indexer/metadata_dictionary/cff.py

swh/indexer/storage/__init__.py

swh/indexer/tests/storage/test_storage.py

swh/indexer/tests/test_cli.py

swh/indexer/tests/test_metadata.py

swh-indexer : Add mapping for CITATION.cff files
ClosedPublic
Actions

Revision Contents
Changeset List

swh/indexer/metadata_dictionary/init.py

swh/indexer/storage/init.py