Changeset View
Changeset View
Standalone View
Standalone View
docs/metadata-workflow.rst
Metadata workflow | Metadata workflow | ||||
================= | ================= | ||||
Intrinsic metadata | Intrinsic metadata | ||||
------------------ | ------------------ | ||||
Indexing :term:`intrinsic metadata` requires extracting information from the | Indexing :term:`intrinsic metadata` requires extracting information from the | ||||
lowest levels of the :ref:`Merkle DAG <swh-merkle-dag>` (directories, files, | lowest levels of the :ref:`Merkle DAG <swh-merkle-dag>` (directories, files, | ||||
and content blobs) and associate them to the highest ones (origins). | and content blobs) and associate them to the highest ones (origins). | ||||
In order to deduplicate the work between origins, we split this work between | In order to deduplicate the work between origins, we split this work between | ||||
multiple indexers, which coordinate with each other and save their results | multiple indexers, which coordinate with each other and save their results | ||||
at each step in the indexer storage. | at each step in the indexer storage. | ||||
Indexer architecture | Indexer architecture | ||||
-------------------- | ^^^^^^^^^^^^^^^^^^^^ | ||||
.. thumbnail:: images/tasks-metadata-indexers.svg | .. thumbnail:: images/tasks-metadata-indexers.svg | ||||
Origin-Head Indexer | Origin-Head Indexer | ||||
___________________ | ^^^^^^^^^^^^^^^^^^^ | ||||
First, the Origin-Head indexer gets called externally, with an origin as | First, the Origin-Head indexer gets called externally, with an origin as | ||||
argument (or multiple origins, that are handled sequentially). | argument (or multiple origins, that are handled sequentially). | ||||
For now, its tasks are scheduled manually via recurring Scheduler tasks; but | For now, its tasks are scheduled manually via recurring Scheduler tasks; but | ||||
in the near future, the :term:`journal` will be used to do that. | in the near future, the :term:`journal` will be used to do that. | ||||
It first looks up the last :term:`snapshot` and determines what the main | It first looks up the last :term:`snapshot` and determines what the main | ||||
branch of origin is (the "Head branch") and what revision it points to | branch of origin is (the "Head branch") and what revision it points to | ||||
(the "Head"). | (the "Head"). | ||||
Intrinsic metadata for that origin will be extracted from that revision. | Intrinsic metadata for that origin will be extracted from that revision. | ||||
It schedules a Directory Metadata Indexer task for the root directory of | It schedules a Directory Metadata Indexer task for the root directory of | ||||
that revision. | that revision. | ||||
Directory and Content Metadata Indexers | Directory and Content Metadata Indexers | ||||
_______________________________________ | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
These two indexers do the hard part of the work. The Directory Metadata | These two indexers do the hard part of the work. The Directory Metadata | ||||
Indexer fetches the root directory associated with a revision, then extracts | Indexer fetches the root directory associated with a revision, then extracts | ||||
the metadata from that directory. | the metadata from that directory. | ||||
To do so, it lists files in that directory, and looks for known names, such | To do so, it lists files in that directory, and looks for known names, such | ||||
as :file:`codemeta.json`, :file:`package.json`, or :file:`pom.xml`. If there are any, it | as :file:`codemeta.json`, :file:`package.json`, or :file:`pom.xml`. If there are any, it | ||||
runs the Content Metadata Indexer on them, which in turn fetches their | runs the Content Metadata Indexer on them, which in turn fetches their | ||||
contents and runs them through extraction dictionaries/mappings. | contents and runs them through extraction dictionaries/mappings. | ||||
See below for details. | See below for details. | ||||
Their results are saved in a database (the indexer storage), associated with | Their results are saved in a database (the indexer storage), associated with | ||||
the content and directory hashes. | the content and directory hashes. | ||||
Origin Metadata Indexer | Origin Metadata Indexer | ||||
_______________________ | ^^^^^^^^^^^^^^^^^^^^^^^ | ||||
The job of this indexer is very simple: it takes an origin identifier and | The job of this indexer is very simple: it takes an origin identifier and | ||||
uses the Origin-Head and Directory indexers to get metadata from the head | uses the Origin-Head and Directory indexers to get metadata from the head | ||||
directory of an origin, and copies the metadata of the former to a new table, | directory of an origin, and copies the metadata of the former to a new table, | ||||
to associate it with the latter. | to associate it with the latter. | ||||
The reason for this is to be able to perform searches on metadata, and | The reason for this is to be able to perform searches on metadata, and | ||||
efficiently find out which origins matched the pattern. | efficiently find out which origins matched the pattern. | ||||
Running that search on the ``directory_metadata`` table would require either | Running that search on the ``directory_metadata`` table would require either | ||||
a reverse lookup from directories to origins, which is costly. | a reverse lookup from directories to origins, which is costly. | ||||
Translation from language-specific metadata to CodeMeta | Translation from ecosystem-specific metadata to CodeMeta | ||||
------------------------------------------------------- | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
Intrinsic metadata are extracted from files provided with a project's source | Intrinsic metadata is extracted from files provided with a project's source | ||||
code, and translated using `CodeMeta`_'s `crosswalk table`_. | code, and translated using `CodeMeta`_'s `crosswalk table`_. | ||||
All input formats supported so far are straightforward dictionaries (eg. JSON) | All input formats supported so far are straightforward dictionaries (eg. JSON) | ||||
or can be accessed as such (eg. XML); and the first part of the translation is | or can be accessed as such (eg. XML); and the first part of the translation is | ||||
to map their keys to a term in the CodeMeta vocabulary. | to map their keys to a term in the CodeMeta vocabulary. | ||||
This is done by parsing the crosswalk table's `CSV file`_ and using it as a | This is done by parsing the crosswalk table's `CSV file`_ and using it as a | ||||
map between these two vocabularies; and this does not require any | map between these two vocabularies; and this does not require any | ||||
format-specific code in the indexers. | format-specific code in the indexers. | ||||
The second part is to normalize values. As language-specific metadata files | The second part is to normalize values. As language-specific metadata files | ||||
each have their way(s) of formatting these values, we need to turn them into | each have their way(s) of formatting these values, we need to turn them into | ||||
the data type required by CodeMeta. | the data type required by CodeMeta. | ||||
This normalization makes up for most of the code of | This normalization makes up for most of the code of | ||||
:py:mod:`swh.indexer.metadata_dictionary`. | :py:mod:`swh.indexer.metadata_dictionary`. | ||||
.. _CodeMeta: https://codemeta.github.io/ | .. _CodeMeta: https://codemeta.github.io/ | ||||
.. _crosswalk table: https://codemeta.github.io/crosswalk/ | .. _crosswalk table: https://codemeta.github.io/crosswalk/ | ||||
.. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv | .. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv | ||||
Extrinsic metadata | |||||
------------------ | |||||
The :term:`extrinsic metadata` indexer works very differently from | |||||
the :term:`intrinsic metadata` indexers we saw above. | |||||
While the latter extract metadata from software artefacts (files and directories) | |||||
which are already a core part of the archive, the former extracts such data from | |||||
anlambert: s/pull/pulled/ | |||||
API calls pulled from forges and package managers, or pushed via the | |||||
:ref:`SWORD deposit <swh-deposit>`. | |||||
In order to preserve original information verbatim, the Software Heritage itself | |||||
stores the result of these calls, independently of indexers, in their own archive | |||||
as described in the :ref:`extrinsic-metadata-specification`. | |||||
In this section, we assume this information is already present in the archive, | |||||
but in the "raw extrinsic metadata" form, which needs to be translated to a common | |||||
vocabulary to be useful, as with intrinsic metadata. | |||||
The common vocabulary we chose is JSON-LD, with both CodeMeta and | |||||
`ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_) | |||||
.. _ForgeFed's vocabulary: https://forgefed.org/vocabulary.html | |||||
.. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/ | |||||
Instead of the four-step architecture above, the extrinsic-metadata indexer | |||||
is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`, | |||||
and produces new indexed entries in the database as they come. | |||||
The caveat is that, while intrinsic metadata are always unambiguously authoritative | |||||
(they are contained by their own origin repository, therefore they were added by | |||||
Done Inline Actions"owners" or owners anlambert: "owners" or owners | |||||
the origin's "owners"), extrinsic metadata can be authored by third-parties. | |||||
Support for third-party authorities is currently not implemented for this reason; | |||||
so extrinsic metadata is only indexed when provided by the same | |||||
forge/package-repository as the origin the metadata is about. | |||||
Metadata on non-origin objects (typically, directories), is also ignored for | |||||
this reason, for now. | |||||
Assuming the metadata was provided by such an authority, it is then passed | |||||
to metadata mappings; identified by a mimetype (or custom format name) | |||||
they declared rather than filenames. | |||||
Implementation status | |||||
--------------------- | |||||
Supported intrinsic metadata | Supported intrinsic metadata | ||||
---------------------------- | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
The following sources of intrinsic metadata are supported: | The following sources of intrinsic metadata are supported: | ||||
* CodeMeta's `codemeta.json`_, | * CodeMeta's `codemeta.json`_, | ||||
* Maven's `pom.xml`_, | * Maven's `pom.xml`_, | ||||
* NPM's `package.json`_, | * NPM's `package.json`_, | ||||
* Python's `PKG-INFO`_, | * Python's `PKG-INFO`_, | ||||
* Ruby's `.gemspec`_ | * Ruby's `.gemspec`_ | ||||
.. _codemeta.json: https://codemeta.github.io/terms/ | .. _codemeta.json: https://codemeta.github.io/terms/ | ||||
.. _pom.xml: https://maven.apache.org/pom.html | .. _pom.xml: https://maven.apache.org/pom.html | ||||
.. _package.json: https://docs.npmjs.com/files/package.json | .. _package.json: https://docs.npmjs.com/files/package.json | ||||
.. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/ | .. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/ | ||||
.. _.gemspec: https://guides.rubygems.org/specification-reference/ | .. _.gemspec: https://guides.rubygems.org/specification-reference/ | ||||
Supported extrinsic metadata | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
The following sources of extrinsic metadata are supported: | |||||
* GitHub's `"repo" API <https://docs.github.com/en/rest/repos/repos#get-a-repository>`__ | |||||
Supported CodeMeta terms | Supported CodeMeta terms | ||||
------------------------ | ^^^^^^^^^^^^^^^^^^^^^^^^ | ||||
The following terms may be found in the output of the metadata translation | The following terms may be found in the output of the metadata translation | ||||
(other than the `codemeta` mapping, which is the identity function, and | (other than the `codemeta` mapping, which is the identity function, and | ||||
therefore supports all terms): | therefore supports all terms): | ||||
.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta | .. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta | ||||
:nostderr: | :nostderr: | ||||
Adding support for additional ecosystem-specific metadata | |||||
--------------------------------------------------------- | |||||
Tutorials | |||||
--------- | |||||
The rest of this page is made of two tutorials: one to index | |||||
:term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball), | |||||
and one to index :term:`extrinsic metadata` (ie. obtained via external means, | |||||
Done Inline Actionss/Gitlab/GitLab/ anlambert: s/Gitlab/GitLab/ | |||||
such as GitHub's or GitLab's APIs). | |||||
Adding support for additional ecosystem-specific intrinsic metadata | |||||
------------------------------------------------------------------- | |||||
This section will guide you through adding code to the metadata indexer to | This section will guide you through adding code to the metadata indexer to | ||||
detect and translate new metadata formats. | detect and translate new metadata formats. | ||||
First, you should start by picking one of the `CodeMeta crosswalks`_. | First, you should start by picking one of the `CodeMeta crosswalks`_. | ||||
Then create a new file in :file:`swh-indexer/swh/indexer/metadata_dictionary/`, that | Then create a new file in :file:`swh-indexer/swh/indexer/metadata_dictionary/`, that | ||||
will contain your code, and create a new class that inherits from helper | will contain your code, and create a new class that inherits from helper | ||||
classes, with some documentation about your indexer: | classes, with some documentation about your indexer: | ||||
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines | |||||
.. code-block:: python | .. code-block:: python | ||||
def normalize_license(self, s): | def normalize_license(self, s): | ||||
if isinstance(s, str): | if isinstance(s, str): | ||||
return {"@id": "https://spdx.org/licenses/" + s} | return {"@id": "https://spdx.org/licenses/" + s} | ||||
This method will automatically get called by ``_translate_dict`` when it | This method will automatically get called by ``_translate_dict`` when it | ||||
finds a ``license`` field in ``content_dict``. | finds a ``license`` field in ``content_dict``. | ||||
Adding support for additional ecosystem-specific extrinsic metadata | |||||
------------------------------------------------------------------- | |||||
[this section is a work in progress] |
s/pull/pulled/