diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst new file mode 100644 --- /dev/null +++ b/docs/extrinsic-metadata-specification.rst @@ -0,0 +1,116 @@ +.. _extrinsic-metadata-specification: + +Extrinsic metadata specification +================================ + +:term:`Extrinsic metadata` is information about software that is not part +of the source code itself but still closely related to the software. +Usually it is available on the web view of a repository's forge and its API +or an external registry. + +Since they are not part of the source code, we need a separate mechanism +to fetch and store them. + +This specification assumes the reader is familiar with Software Heritage's +:ref:`architecture` and :ref:`data-model`. + + +Metadata providers +------------------ + +Definition +~~~~~~~~~~ + +We define five types of metadata providers: + +* :term:`loaders `, which are the components dedicated to fetching + the source-code from origins (VCS repositories, distribution packages, + ...). They may either discover metadata as a side-effect of loading + source code, or be dedicated to fetching metadata. + +* :term:`listers `, which are the components of SWH dedicated to + discovering origins on known websites/forges; and may discover + metadata as a side-effect + +* :term:`deposit clients `, which push metadata to SWH from a + third-party; usually at the same time as a :term:`software artifact` + +* gatherers, which fetch metadata from an authoritative source of the + repository (eg. its website or forge) in a way that is none of the three + above (eg. by querying a specific API of the origin's forge). + +* registries, which fetch data from non-authoritative databases, meaning + they are not directly referenced to by the origin's website/forge/... + (eg. Wikidata) + +A provider is uniquely defined by these two properties: + +* its name, representing the software/database from which metadata is + extracted (eg. `gitlab`, `wikidata`, `hal`); each provider name + matches a component of SWH, dedicated to getting data from it. + +* its URL, which unambiguously identifies an instance of the provider. + +Example providers: + +=========== =============== ================================= +type name url +=========== =============== ================================= +hal deposit_client https://hal.archives-ouvertes.fr/ +swh deposit_client https://www.softwareheritage.org/ +lister gitlab_lister https://gitlab.com/ +loader gitlab_loader https://gitlab.com/ +registry wikidata https://www.wikidata.org/ +=========== =============== ================================= + +Storage API +~~~~~~~~~~~ + +The :term:`storage` API offers two endpoints to manipulate metadata +providers: + +* `metadata_provider_add(name, url, type, metadata)` + which adds a new metadata provider to the storage. + +* `metadata_provider_get_by(name, url)` + which looks up for a known provider (there is at most one) and if it is + known, returns a dictionary with keys `name`, `url`, `type`, and `metadata`. + +`metadata` is an arbitrary JSON-encodable dictionary with informations +about the provider, in a format specific to each provider name. +This field only uses for future uses; currently it should always be empty. + +Origin metadata storage +----------------------- + +Extrinsic metadata are stored in SWH's :term:`storage database`, alongside +the :term:`Merkle DAG` containing all known software artifacts. +The storage API offers three endpoints to manipulate origin metadata: + +* `origin_metadata_add(origin_id, discovery_date, provider_name, provider_url, metadata)` + which adds a new `metadata` dictionary obtained from a given provider + and associated to the origin. + The provider must be known to the storage before using this endpoint. + +* `origin_metadata_get(origin_id, provider_name, provider_url, after, limit)` + which returns a list of dictionaries: + `{'provider': {...}, 'discovery_date': ..., 'metadata': {...}}`, + one for each metadata item deposited, corresponding to the given origin + and obtained from the specified provider + +* `origin_metadata_get_by_provider_type(origin_id, provider_type, after, limit)` + which works similarly to `origin_metadata_get`, but returns results for + all providers of a given type. + +The parameters `after` and `limit` are used for pagination based on the +order defined by the `discovery_date`. + +All of the results of `origin_metadata_get` and +`origin_metadata_get_by_provider_type` can be considered authoritative +for the given origin at the given `discovery_date`, unless the provider type +is `registry`. + +The format of `metadata` is a JSON-encodable dictionary. Its format is +specific to each provider; and is treated as an opaque value by the storage. +Unifying these various formats into a common language is outside the scope +of this specification. diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -30,6 +30,11 @@ * :ref:`archive-copies` +Specifications +-------------- + +* :ref:`extrinsic-metadata-specification` + Reference Documentation -----------------------