Page MenuHomeSoftware Heritage

Review the deposit of CodeMeta metadata in xml (following SWORD V2 specs)
Closed, MigratedEdits Locked

Description

At the moment deposited metadata are in xml format using a mix of ATOM and CodeMeta, as detailed in the docs:
https://docs.softwareheritage.org/devel/swh-deposit/metadata.html

There is a discrepancy, because CodeMeta vocabulary is only defined in JSON-LD and not in an XML schema.

I have discussed with @vlorentz to improve and facilitate the situation.

We have 2 options:

  1. translate the JSON-LD to an XML schema
  2. have a metadata-blob or codemeta property in the ATOM xml in which a codemeta.json file can be transferred as is, raw format.

The second option is easier to handle and might also be more appropriate to our new scenario with depositing only metadata.

This task can become a high priority, because we have the IPOL journal as a new SWORD client now. They are very eager to deposit published software artifacts.
At the moment I have explained how to create the xml, but I'm not satisfied of its unecessary complexity.

Also, HAL is planning on creating a codemeta.json export during the sprint (if it takes place), which is exactly the metadata-blob we want.

Task goal: choose an option and evaluate next steps for SWH deposit

Event Timeline

moranegg triaged this task as Normal priority.Mar 12 2020, 3:56 PM
moranegg created this task.
moranegg renamed this task from Update the deposit of metadata with a regular zip deposit to Review the deposit of CodeMeta metadata in xml (following SWORD V2 specs) .Apr 21 2020, 11:46 AM
moranegg claimed this task.
moranegg updated the task description. (Show Details)
moranegg added a project: SWORD deposit.

Options I see:

  1. keep the current format

pros: simple, in the spirit of the SWORDv2 spec (even though we use the schema.org/codemeta vocabulary instead of DublicCore)
cons: informally defined, no obvious way to encode @id, @type, etc.; so we have to figure a way to encode them (probably by defining our own tag or attribute)

  1. add a new tag metadata-blob or codemeta which contains a dump of the JSON, as mentioned above

pros: easy
cons: requires us to define a namespace, and some parsers (eg. xmltodict) doesn't preserve whitespaces so it would corrupt the data

  1. embed RDF/XML in the atom entry

pros: "the right way to do it"
cons: complex, no direct way to translate it to JSON-LD afaik

  1. add a multipart item to deposits

currently, deposits can have up to two parts: the content of the deposit (application/zip) and the atom entry which also contains metadata (application/atom+xml). We could add another possible part (application/ld+json) which contains the metadata.

pros: super easy
cons: the atom entry and the json-ld entry now serve roughly the same purpose but in slightly different formats, so it's weird

  1. switch to SWORDv3 and use BagIt

pros: another "right way to do it"
cons: lot of work

DublinCore hasn't enough properties to answer our software properties requirements.

also pros for 1:
already in production with HAL and IPOL

Concerning @id and @type I was looking at this tool:
http://rdf-translator.appspot.com/
It might give us ideas.

Anyway here is the link to the ATomPub documentation (to further my investigations):
https://www.ibm.com/developerworks/library/x-atompp1/x-atompp1-pdf.pdf

Yeah, the rdf-translator uses custom attributes (in the XHTML namespace, which I guess is a mistake, but that's fixable by creating our own namespace or finding one that already does it)

I think we can resolve this task as we agreed on staying with the xml format for the metadata-only deposit.

The conclusion would be, if we stay with SWORD V2, we should accept metadata in a xml syntax and if we want CodeMeta properties it should be sent in the xml.
Which means, @vlorentz 's option 1.

moranegg moved this task from In progress to Archived on the SWORD deposit board.