diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst new file mode 100644 --- /dev/null +++ b/docs/extrinsic-metadata-specification.rst @@ -0,0 +1,161 @@ +.. _extrinsic-metadata-specification: + +Extrinsic metadata specification +================================ + +:term:`Extrinsic metadata` is information about software that is not part +of the source code itself but still closely related to the software. +Usually it is available on the web view of a repository's forge and its API +or an external registry. + +Since they are not part of the source code, we need a separate mechanism +to fetch and store them. + +This specification assumes the reader is familiar with Software Heritage's +:ref:`architecture` and :ref:`data-model`. + + +Metadata sources +---------------- + +Authorities +^^^^^^^^^^^ + +Metadata authorities are moral entities that provide metadata about a +:term:`origin`. Metadata authorities include: code hosts, +:term:`deposit clients `, and registries (eg. Wikidata). + +An authority is uniquely defined by these properties: + +* its name, representing the software/database from which metadata is + extracted (eg. `gitlab`, `wikidata`, `hal`). + +* its URL, which unambiguously identifies an instance of the provider. + +Examples: + +=============== ================================= +name url +=============== ================================= +hal https://hal.archives-ouvertes.fr/ +swh https://www.softwareheritage.org/ +gitlab https://gitlab.com/ +gitlab https://gitlab.com/ +wikidata https://www.wikidata.org/ +=============== ================================= + +Tools +^^^^^ + +Metadata fetching tools are software components used to fetch metadata from +a metadata authority, and ingest them into the Software Heritage archive. + +A tool is uniquely defined by these properties: + +* its name +* its version + +Examples: + +* :term:`loaders `, which may either discover metadata as a + side-effect of loading source code, or be dedicated to fetching metadata. + +* :term:`listers `, which may discover metadata as a side-effect + of discovering origins. + +* :term:`deposit clients `, which push metadata to SWH from a + third-party; usually at the same time as a :term:`software artifact` + +* gatherers, which fetch metadata from an authority in a way that is + none of the above (eg. by querying a specific API of the origin's forge). + + +Storage API +~~~~~~~~~~~ + +Authorities and tools +^^^^^^^^^^^^^^^^^^^^^ + +The :term:`storage` API offers these endpoints to manipulate metadata +authorities and tools: + +* ``metadata_authority_add(name, url, type, metadata)`` + which adds a new metadata authority to the storage. + +* ``metadata_authority_get_by(name, url)`` + which looks up a known authority (there is at most one) and if it is + known, returns a dictionary with keys ``name``, ``url``, and ``metadata``. + +* ``metadata_tool_add(name, version, metadata)`` + which adds a new metadata authority to the storage. + +* ``metadata_tool_get_by(name, version)`` + which looks up a known authority (there is at most one) and if it is + known, returns a dictionary with keys ``name``, ``version``, and ``metadata``. + +These `metadata` fields contain arbitrary JSON-encodable dictionaries +with informations about the authority/tool, in a format specific to each +authority/tool. +These fields are reserved for future uses; currently they should always be +empty. + +Origin metadata storage +----------------------- + +Extrinsic metadata are stored in SWH's :term:`storage database`. +The storage API offers three endpoints to manipulate origin metadata: + +* Adding metadata:: + + origin_metadata_add(origin_url, discovery_date, + authority_name, authority_url, + tool_name, tool_version, + metadata) + + which adds a new `metadata` byte string obtained from a given authority + and associated to the origin. + The authority and tool must be known to the storage before using this + endpoint. + +* Getting latest metadata:: + + origin_metadata_get_latest(origin_url, + authority_name, authority_url) + + which returns a dictionary:: + + { + 'authority': {'name': ..., 'url': ...}, + 'tool': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'metadata': b'...' + } + + corresponding to the latest metadata entry added from this origin + +* Getting all metadata:: + + origin_metadata_get(origin_url, + authority_name, authority_url, + after, limit) + + which returns a list of dictionaries:: + + { + 'authority': {'name': ..., 'url': ...}, + 'tool': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'metadata': b'...' + } + + one for each metadata item deposited, corresponding to the given origin + and obtained from the specified authority + +The parameters ``after`` and ``limit`` are used for pagination based on the +order defined by the ``discovery_date``. + +``metadata`` is a bytes array (eventually encoded using Base64). +Its format is specific to each authority; and is treated as an opaque value +by the storage. +Unifying these various formats into a common language is outside the scope +of this specification. diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -30,6 +30,11 @@ * :ref:`archive-copies` +Specifications +-------------- + +* :ref:`extrinsic-metadata-specification` + Reference Documentation -----------------------