diff --git a/docs/metadata-workflow.rst b/docs/metadata-workflow.rst --- a/docs/metadata-workflow.rst +++ b/docs/metadata-workflow.rst @@ -119,3 +119,86 @@ .. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta :nostderr: + + +Adding support for additional ecosystem-specific metadata +--------------------------------------------------------- + +This section will guide you through adding code to the metadata indexer to +detect and translate new metadata formats. + +First, you should start by picking one of the `CodeMeta crosswalks`_. +Then create a new file in `swh-indexer/swh/indexer/metadata_dictionary/`, that +will contain your code, and create a new class that inherits from helper +classes, with some documentation about your indexer: + +.. code-block:: python + + from .base import DictMapping, SingleFileMapping + from swh.indexer.codemeta import CROSSWALK_TABLE + + class MyMapping(DictMapping, SingleFileMapping): + """Dedicated class for ...""" + name = 'my-mapping' + filename = b'the-filename' + mapping = CROSSWALK_TABLE['Name of the CodeMeta crosswalk'] + +.. _CodeMeta crosswalks: https://github.com/codemeta/codemeta/tree/master/crosswalks + +Then, add a `string_fields` attribute, that is the list of all keys whose +values are simple text values. For instance, to +`translate Python PKG-INFO`_, it's: + +.. code-block:: python + + string_fields = ['name', 'version', 'description', 'summary', + 'author', 'author-email'] + +.. _translate Python PKG-INFO: https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/metadata_dictionary/python.py + +Last step to get your code working: add a `translate` method that will +take a single byte string as argument, turn it into a Python dictionary, +whose keys are the ones of the input document, and pass it to +`_translate_dict`. + +For instance, if the input document is in JSON, it can be as simple as: + +.. code-block:: python + + def translate(self, raw_content): + raw_content = raw_content.decode() # bytes to str + content_dict = json.loads(raw_content) # str to dict + return self._translate_dict(content_dict) # convert to CodeMeta + +`_translate_dict` will do the heavy work of reading the crosswalk table for +each of `string_fields`, read the corresponding value in the `content_dict`, +and build a CodeMeta dictionary with the corresponding names from the +crosswalk table. + +One last thing to run your code: add it to the list in +`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the +code is aware of it. + +Now, you can run it: + +.. code-block:: shell + + python3 -m swh.indexer.metadata_dictionary MyMapping path/to/input/file + +and it will (hopefully) returns a CodeMeta object. + +If it works, well done! + +You can now improve your translation code further, by adding methods that +will do more advanced conversion. For example, if there is a field named +`license` containing an SPDX identifier, you must convert it to an URI, +like this: + +.. code-block:: python + + def normalize_license(self, s): + if isinstance(s, str): + return {"@id": "https://spdx.org/licenses/" + s} + +This method will automatically get called by `_translate_dict` when it +finds a `license` field in `content_dict`.