Page MenuHomeSoftware Heritage

Translate from pom.xml and codemeta.json.
ClosedPublic

Authored by vlorentz on Oct 30 2018, 12:07 PM.

Details

Summary

Also changes slightly the output format to provide
a @type and make it more compact.

Resolves T1289.

Test Plan

tox

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
maven-codemeta-translation
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1994
Build 2410: arc lint + arc unit

Event Timeline

Add README or CITATION to data directory with the following:
Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Daniel S. Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. doi:10.5063/schema/codemeta-2.0
swh:1:dir:39c509fd2002f9e531fb4b3a321ceb5e6994e54a;origin=https://github.com/codemeta/codemeta

swh/indexer/codemeta.py
65

The codemeta-V1 is the older version of codemeta.
The difficulty (not here but globaly) if we find a codemeta.json file, we need to use the @context attribute to see which version is used,
but this could also cause problems because of examples like this:
"@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",

So here, the column codemeta-V1 shouldn't be the canonical name for codemeta.
The codemeta vocabulary is in the codemeta.csv table under property
and to facilitate things at present, I think when encountering a codemeta.json file, the vocabulary to check should be in the property column

vlorentz added inline comments.
swh/indexer/codemeta.py
65

Yeah I understood that later, there's a fix in D620. Adding support for v1 remains to be done though

moranegg requested changes to this revision.Nov 5 2018, 1:16 PM

There is no test a case of a revision with multiple 'metadata files' which is an intriguing case - this should be tested before accepting this diff.

Global comments that are not parts of this diff:

  1. in the revision_metadata output there isn't a list of file ids used for the translation, do we want to keep it?
  2. I would suggest before continuing to the next vocabulary, adding to the detected files list the AUTHOR/AUTHORS/CONTRIBUTORS file to keep in the authors property.
  3. Other ideas would be :
    • detecting LICENSE/ COPYING/... and using the result of the fossology_license indexer
    • detecting README and using a fixed portion of it in the description property
swh/indexer/metadata_detector.py
59

I see it does become a list when multiple values are given for the same term.
Would love to see it tested :-)

swh/indexer/tests/test_metadata.py
272

I kept a property called other in the content_metadata to regroup all metadata I wasn't able to translate to CodeMeta.
I see that you deleted this property, was it problematic with the type of output- keeping a codemeta.json output?

307

When running the indexation of a metadata file with a wrong @context, even with the same url with 1.0 at the end,
it fails on the following error:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pyld/jsonld.py", line 4308, in _retrieve_context_urls
    remote_doc = load_document(url)
  File "/home/morane/Documents/code/swh-environment/swh-indexer/swh/indexer/codemeta.py", line 108, in _document_loader
    raise Exception(url)
Exception: https://doi.org/10.5063/schema/codemeta-1.0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pyld/jsonld.py", line 800, in expand
    input_, {}, options['documentLoader'], options['base'])
  File "/usr/lib/python3/dist-packages/pyld/jsonld.py", line 4315, in _retrieve_context_urls
    code='loading remote context failed', cause=cause)
pyld.jsonld.JsonLdError: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "metadata_dictionary.py", line 345, in <module>
    main()
  File "metadata_dictionary.py", line 335, in main
    result = MAPPINGS["CodemetaMapping"].translate(raw_content)
  File "metadata_dictionary.py", line 226, in translate
    return self.normalize_translation(expand(json.loads(content.decode())))
  File "/home/morane/Documents/code/swh-environment/swh-indexer/swh/indexer/codemeta.py", line 120, in expand
    options={'documentLoader': _document_loader})
  File "/usr/lib/python3/dist-packages/pyld/jsonld.py", line 171, in expand
    return JsonLdProcessor().expand(input_, options)
  File "/usr/lib/python3/dist-packages/pyld/jsonld.py", line 804, in expand
    'jsonld.ExpandError', cause=cause)
pyld.jsonld.JsonLdError: <exception str() failed>

This is an observation, I'm not saying we should fix it, but the usage of the DOI url might not be on all codemeta.json files we find.

461

I'm not sure where to write this comment, but when you have multiple detected files, what happens with the translated_metadata output?
does each property become a list?

swh/indexer/tests/test_origin_metadata.py
106

why the context name is here in the property?
looks like it's only a Mock test issue, but if so, you should change the tests to reflect exactly what output you are looking for.
Also, it should be consistent (on all properties).

swh/indexer/tests/test_utils.py
308

Here again the context is in the property, why?
is it because this property is inherited from schema.org?
or is it because in the content_get from the MockStorage it was already there

This revision now requires changes to proceed.Nov 5 2018, 1:16 PM
vlorentz added inline comments.
swh/indexer/metadata_detector.py
59

That's the same behavior as before; it's already tested in test_extract_minimal_metadata_dict.

swh/indexer/tests/test_metadata.py
272

It's just that it's not defined by codemeta's schema definition, so jsonld.compact drops it. I could add a new property with an absolute URI, though.

307

Unfortunately, it's either that or pulling untrusted schemas from the internet :/

461

does each property become a list?

Yes. Actually, they are all lists at the beginning, but JSON-LD compaction reduces them to their element.

swh/indexer/tests/test_origin_metadata.py
106

That's because codemeta defined it to be the @id: "issueTracker": { "@id":"codemeta:issueTracker", "@type": "@id" },

swh/indexer/tests/test_utils.py
308

is it because this property is inherited from schema.org?

On the contrary, it's because codemeta:author is not the same as schema:author (because they defined it as "author": { "@id": "schema:author", "@container": "@list" }, instead of "author": { "@id": "schema:author" },), so referring only to author is ambiguous.

This revision is now accepted and ready to land.Nov 5 2018, 5:30 PM
This revision was automatically updated to reflect the committed changes.