Closes T3078
Details
- Reviewers
vlorentz - Group Reviewers
Reviewers - Maniphest Tasks
- T3078: Index CITATION.cff files
- Commits
- rDCIDX8f1fb0f9310d: metadata_dictionary: Add mapping for CITATION.cff
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- index-citation-cff
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 20035 Build 31104: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 31103: arc lint + arc unit
Event Timeline
Build has FAILED
Patch application report for D5273 (id=18890)
Rebasing onto 4831012290...
First, rewinding head to replay your work on top of it... Applying: Add mapping for CITATION.cff
Changes applied before test
commit 61ea8a46b8c8bf8d3c44e7dfaaf64fd60f8dc4c2 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 Add mapping for CITATION.cff
Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/153/console
Build is green
Patch application report for D5273 (id=18891)
Rebasing onto 4831012290...
First, rewinding head to replay your work on top of it... Applying: Add mapping for CITATION.cff
Changes applied before test
commit 6e88d9b342da3df7f1c78686011a9471056e18dd Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 Add mapping for CITATION.cff
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/154/ for more details.
Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.
Details should go both in the commit message and the diff description [1]
Also, please have a look at the coverage. When the build is done, this displays colors
on the code diff-ed, on the right. Blue means covered, oranage not covered.
You should do your best to cover the most cases. Uncovered should be the exception (it
can "justifiably" happen but not that often).
[1] arcanist (arc cli) should take your commit message and uses as default diff
description when you create the diff so you shall not repeat yourself too much.
Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.
And you also made changes to the NodeJS mapping, these should go their own diff
Please don't update crosswalk.csv directly here, it should only be imported from Codemeta. Make sure you read their CONTRIBUTING.md first.
I have directly downloaded the crosswalk.csv file from the official repository. You can verify sha1 hashes(after appending an empty line in my file).
And you also made changes to the NodeJS mapping, these should go their own diff
I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first. Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?
Please give some more details on what problems this diff solves. While adding the
reference task is good practice (thanks), it's not enough.
Thank you very much for your feedback. I will keep this in mind.
Oh, ok. (Undo the line break change, though)
And you also made changes to the NodeJS mapping, these should go their own diff
I think I should make a new diff that covers (1) updating this crosswalk file and (2) the changes in the NodeJS mapping test and get it accepted first.
Yes please
Also, I am concerned if this change in the crosswalk file will lead to any kind of incompatibility with previously stored data. What do you think about this @vlorentz ?
processorRequirements -> runtimePlatform is a minor change, it will be fine.
contributor -> contributors is a bug fix, so it won't break anything
Build is green
Patch application report for D5273 (id=18955)
Rebasing onto 8e046c1900...
Current branch diff-target is up to date.
Changes applied before test
commit d46c74a80d0048cccab7e8646545b7ba306d817a Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 Add mapping for CITATION.cff
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/161/ for more details.
Hey @vlorentz
I am writing tests for CITATION.cff metadata mapping and I have 2 doubts :
(1) Are CITATION and CITATION.cff the same? (If so, cff-converter isn't able to parse our swh/indexer/data/codemeta/CITATION file maybe because it doesn't look like a YAML file)
(2) cff-converter's output is :
{ "@context": [ "https://doi.org/10.5063/schema/codemeta-2.0", "http://schema.org" ], "@type": "SoftwareSourceCode", "author": [ { "@id": "https://orcid.org/0000-0002-7064-4069", "@type": "Person", "affiliation": { "@type": "Organization", "legalName": "Netherlands eScience Center" }, "familyName": "Spaaks", "givenName": "Jurriaan H." }, { "@type": "Person", "affiliation": { "@type": "Organization", "legalName": "Netherlands eScience Center" }, "familyName": "Klaver", "givenName": "Tom" } ], "codeRepository": "https://github.com/citation-file-format/cff-converter-python", "datePublished": "2019-11-12", "identifier": "https://doi.org/10.5281/zenodo.1162057", "keywords": [ "citation", "bibliography", "cff", "CITATION.cff" ], "license": "http://www.apache.org/licenses/LICENSE-2.0", "name": "cffconvert", "version": "1.4.0-alpha0" }
While in all of our translated outputs (described in test_metadata.py for other mappings)
- I see only one value in @context and it is https://doi.org/10.5063/schema/codemeta-2.0 only (not http://schema.org)
- '@type' is written as 'type' (in the tests)
- author has only 'type', 'name' and 'email'
- 'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)
- We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field
I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.
No, CITATION is just a common name for plain text files
- I see only one value in @context and it is https://doi.org/10.5063/schema/codemeta-2.0 only (not http://schema.org)
It's because cff-converter outputs some names that are in the http://schema.org namespace but not in https://doi.org/10.5063/schema/codemeta-2.0, such as legalName.
- '@type' is written as 'type' (in the tests)
Semantically equivalent because of this: https://github.com/codemeta/codemeta/blob/master/codemeta.jsonld#L3
- author has only 'type', 'name' and 'email'
- 'datePublished' field doesn't exist in any of our tests(but it does in the crosswalk.csv file)
doesn't mean new mappings can't use other fields. Any valid Codemeta (or even schema.org) is fine.
- We use 'https://spdx.org' in the license but cff-convert uses 'http://www.apache.org' in the 'license' field
That's only for Apache licenses, right?
Either way; cff-convert's output is correct, but we should normalize the license to an SPDX URI so that it's actually queryable.
I am mentioning these to just confirm that it's okay to deviate from the cff-converter format and adjust it to fit with our existing format.
cff-converter claims to convert the CITATION file into codemeta.json format but this seems slightly different and that's why I need your opinion.
Other than the license, cff-converter seems fine.
Although, I can already see a mistake in cff-converter's output, they should use name instead of legalName to translate the affiliation attribute, because nothing in the CFF spec requires it to be a legal name
swh/indexer/metadata_dictionary/__init__.py | ||
---|---|---|
5 | slight comment following your question on citation vs citation.cff:
|
swh/indexer/metadata_dictionary/__init__.py | ||
---|---|---|
5 | indeed! |
Any valid Codemeta (or even schema.org) is fine
So how does a parser decide which one to use for interpretation?
Build has FAILED
Patch application report for D5273 (id=19072)
Rebasing onto 8e046c1900...
Current branch diff-target is up to date.
Changes applied before test
commit 727a9f993e22b444a7f4fef105ad099ede76d474 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 metadata_dictionary: Add mapping for CITATION.cff The implementation is based on output from cff-converter python library. Deviation is made for license and organisation.legalName(replaced with name)
Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/163/console
Build is green
Patch application report for D5273 (id=19073)
Rebasing onto 8e046c1900...
Current branch diff-target is up to date.
Changes applied before test
commit 2419ee1de062077cdc458616a1c8b63f25b1db62 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 metadata_dictionary: Add mapping for CITATION.cff The implementation is based on output from cff-converter python library. Deviation is made for license and organisation.legalName(replaced with name)
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/164/ for more details.
Any valid Codemeta (or even schema.org) is fine
So how does a parser decide which one to use for interpretation?
Using @context. See https://codemeta.github.io/jsonld/ or https://www.w3.org/TR/json-ld/#the-context for details
And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.
And please keep the examples in test short; for example we don't need all five authors.
And the expected value in your test is not valid, because it uses the schema: prefix, without defining it.
Please correct me if I am wrong but I see codemeta.jsonld defines 'schema' as well.
Also, I won't have to specifically mention 'http://schema.org/' in the @context array.
I tested it and everything seems correct to me. You can check it out using this Permalink and let me know if it is actually correct or not.
Indeed, you're right.
Nonetheless, the properties are also defined by Codemeta (even if mostly are aliases of schema.org), so we should use them
Build is green
Patch application report for D5273 (id=19286)
Rebasing onto 8e046c1900...
Current branch diff-target is up to date.
Changes applied before test
commit 01aac465d321505f37b643225547a65057f0786f Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 metadata_dictionary: Add mapping for CITATION.cff The implementation is based on output from cff-converter python library. Deviation has been made for license and 'organisation.legalName'(replaced with 'organisation.name')
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/166/ for more details.
Looking good. Last comments:
swh/indexer/metadata_dictionary/cff.py | ||
---|---|---|
34–52 ↗ | (On Diff #19286) | could you move this part to a method? (normalize_authors iirc) |
54–55 ↗ | (On Diff #19286) | isn't this redundant with the crosswalk? |
swh/indexer/tests/test_metadata.py | ||
88 ↗ | (On Diff #19286) | typo |
119–121 ↗ | (On Diff #19286) | use b""" to start the string so you don't need to .encode() it |
141 ↗ | (On Diff #19286) | you can write the literal ä here |
148 ↗ | (On Diff #19286) | drop the " " (Black can't do it when it folds two lines, unfortunately) |
swh/indexer/metadata_dictionary/cff.py | ||
---|---|---|
34–52 ↗ | (On Diff #19286) | I tried but something went wrong. I went in debugging mode to track the flow and found that the normalize_translation function made the output to be : "author": [ { "type": "Person" }, { "id": "https://orcid.org/0000-0003-4925-7248", "type": "Person" } ], But the playground gives the correct results in compact form. Have a look at it here |
54–55 ↗ | (On Diff #19286) | Yes, it is. But when I put these two fields in string_fields the output is 'schema:codeRepository' : '....', 'schema:datePublished' : '....' not 'codeRepository' : '....', 'datePublished' : '....' But this doesn't happen with the other fields. Why is that? |
swh/indexer/tests/test_metadata.py | ||
119–121 ↗ | (On Diff #19286) | I did that the first time but it throws an error: bytes can only contain ASCII literal characters. |
swh/indexer/metadata_dictionary/cff.py | ||
---|---|---|
54–55 ↗ | (On Diff #19286) | Actually it isn't. (Well, it's an arguable design decision in Codemeta, but that's how it is). These fields are defined as: "datePublished": {"@id": "schema:datePublished", "@type": "schema:Date" }, and "codeRepository": { "@id": "schema:codeRepository", "@type": "@id"}, which means "the codemeta term is like the schema.org term, but with an extra constraint on the type". But the JSON-LD document you create does not match these type definitions. So when compacting, it can't be converted to the codemeta term, because of the constraint. So the fix should be to make them satisfy the type constraint. (Hint: the license field is defined the same way) |
Updating D5273: swh-indexer : Fix issues related to datePublished and codeRepository fields.
Migrate code related to authors into normalize_authors function
Add .vscode/ in .gitignore to avoid tracking launch.json and other VScode config files
Build is green
Patch application report for D5273 (id=19368)
Rebasing onto 8e046c1900...
Current branch diff-target is up to date.
Changes applied before test
commit 8f1fb0f9310df6174823a149e62662881be82f87 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Mar 18 10:43:51 2021 +0530 metadata_dictionary: Add mapping for CITATION.cff The implementation is based on output from cff-converter python library. Deviation has been made for license and organisation.legalName(replaced with organisation.name)
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/167/ for more details.
Great.
If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.
Or you can configure it globally so you don't need to commit it, see https://git-scm.com/docs/gitignore
If you want to add .vscode to .gitignore, you should do it in its own diff, and for *all* repositories.
I ran find */.gitignore -type f -exec grep -H '.vscode' {} + in swh-environment and got :
swh-indexer/.gitignore:.vscode/ -> I did this one. swh-provenance/.gitignore:.vscode swh-storage/.gitignore:.vscode/ swh-web/.gitignore:.vscode/
So these repositories already have .vscode in their .gitignore. And IMO, creating a different commit just to put .vscode in all the .gitignore repositories doesn't look good. Contributors can add it if required. What's your opinion?
Btw, what editor do you use?
Alright then
Btw, what editor do you use?
Personally, vim. In the rest of the team, it's mostly emacs or VS Code