diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst new file mode 100644 --- /dev/null +++ b/docs/extrinsic-metadata-specification.rst @@ -0,0 +1,178 @@ +.. _extrinsic-metadata-specification: + +Extrinsic metadata specification +================================ + +:term:`Extrinsic metadata` is information about software that is not part +of the source code itself but still closely related to the software. +Typical sources for extrinsic metadata are: the hosting place of a +repository, which can offer metadata via its web view or API; external +registries like collaborative curation initiatives; and out-of-band +information available at source code archival time. + +Since they are not part of the source code, a dedicated mechanism to fetch +and store them is needed. + +This specification assumes the reader is familiar with Software Heritage's +:ref:`architecture` and :ref:`data-model`. + + +Metadata sources +---------------- + +Authorities +^^^^^^^^^^^ + +Metadata authorities are entities that provide metadata about an +:term:`origin`. Metadata authorities include: code hosting places, +:term:`deposit` submitters, and registries (eg. Wikidata). + +An authority is uniquely defined by these properties: + +* its type, representing the software/database from which metadata is + extracted (eg. `gitlab`, `wikidata`, `hal`). + +* its URL, which unambiguously identifies an instance of the authority type. + +Examples: + +=============== ================================= +type url +=============== ================================= +deposit https://hal.archives-ouvertes.fr/ +deposit https://hal.inria.fr/ +deposit https://software.intel.com/ +gitlab https://gitlab.com/ +gitlab https://gitlab.inria.fr/ +gitlab https://0xacab.org/ +github https://github.com/ +wikidata https://www.wikidata.org/ +swmath https://swmath.org/ +ascl.net http://ascl.net/ +=============== ================================= + +Metadata fetchers +^^^^^^^^^^^^^^^^^ + +Metadata fetchers are software components used to fetch metadata from +a metadata authority, and ingest them into the Software Heritage archive. + +A metadata fetcher is uniquely defined by these properties: + +* its type +* its version + +Examples: + +* :term:`loaders `, which may either discover metadata as a + side-effect of loading source code, or be dedicated to fetching metadata. + +* :term:`listers `, which may discover metadata as a side-effect + of discovering origins. + +* :term:`deposit` submitters, which push metadata to SWH from a + third-party; usually at the same time as a :term:`software artifact` + +* crawlers, which fetch metadata from an authority in a way that is + none of the above (eg. by querying a specific API of the origin's forge). + + +Storage API +~~~~~~~~~~~ + +Authorities and metadata fetchers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :term:`storage` API offers these endpoints to manipulate metadata +authorities and metadata fetchers: + +* ``metadata_authority_add(type, url, metadata)`` + which adds a new metadata authority to the storage. + +* ``metadata_authority_get(type, url)`` + which looks up a known authority (there is at most one) and if it is + known, returns a dictionary with keys ``type``, ``url``, and ``metadata``. + +* ``metadata_fetcher_add(name, version, metadata)`` + which adds a new metadata fetcher to the storage. + +* ``metadata_fetcher_get(name, version)`` + which looks up a known fetcher (there is at most one) and if it is + known, returns a dictionary with keys ``name``, ``version``, and + ``metadata``. + +These `metadata` fields contain JSON-encodable dictionaries +with information about the authority/fetcher, in a format specific to each +authority/fetcher. +With authority, the `metadata` field is reserved for information describing +and qualifying the authority. +With fetchers, the `metadata` field is reserved for configuration metadata +and other technical usage. + +Origin metadata storage +----------------------- + +Extrinsic metadata are stored in SWH's :term:`storage database`. +The storage API offers three endpoints to manipulate origin metadata: + +* Adding metadata:: + + origin_metadata_add(origin_url, discovery_date, + authority, fetcher, + format, metadata) + + which adds a new `metadata` byte string obtained from a given authority + and associated to the origin. + `authority` must be a dict containing keys `type` and `url`, and + `fetcher` a dict containing keys `name` and `version`. + The authority and fetcher must be known to the storage before using this + endpoint. + `format` is a text field indicating the format of the content of the + `metadata` byte string. + +* Getting latest metadata:: + + origin_metadata_get_latest(origin_url, authority) + + where `authority` must be a dict containing keys `type` and `url`, + which returns a dictionary corresponding to the latest metadata entry + added from this origin, in the format:: + + { + 'authority': {'type': ..., 'url': ...}, + 'fetcher': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'format': '...', + 'metadata': b'...' + } + + +* Getting all metadata:: + + origin_metadata_get(origin_url, + authority, + after, limit) + + which returns a list of dictionaries, one for each metadata item + deposited, corresponding to the given origin and obtained from the + specified authority. + `authority` must be a dict containing keys `type` and `url`. + + Each of these dictionaries is in the following format:: + + { + 'authority': {'type': ..., 'url': ...}, + 'fetcher': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'format': '...', + 'metadata': b'...' + } + +The parameters ``after`` and ``limit`` are used for pagination based on the +order defined by the ``discovery_date``. + +``metadata`` is a bytes array (eventually encoded using Base64). +Its format is specific to each authority; and is treated as an opaque value +by the storage. +Unifying these various formats into a common language is outside the scope +of this specification. diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -30,6 +30,11 @@ * :ref:`archive-copies` +Specifications +-------------- + +* :ref:`extrinsic-metadata-specification` + Reference Documentation -----------------------