Page MenuHomeSoftware Heritage

D1509.diff
No OneTemporary

D1509.diff

diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst
new file mode 100644
--- /dev/null
+++ b/docs/extrinsic-metadata-specification.rst
@@ -0,0 +1,116 @@
+.. _extrinsic-metadata-specification:
+
+Extrinsic metadata specification
+================================
+
+:term:`Extrinsic metadata` is information about software that is not part
+of the source code itself but still closely related to the software.
+Usually it is available on the web view of a repository's forge and its API
+or an external registry.
+
+Since they are not part of the source code, we need a separate mechanism
+to fetch and store them.
+
+This specification assumes the reader is familiar with Software Heritage's
+:ref:`architecture` and :ref:`data-model`.
+
+
+Metadata providers
+------------------
+
+Definition
+~~~~~~~~~~
+
+We define five types of metadata providers:
+
+* :term:`loaders <loader>`, which are the components dedicated to fetching
+ the source-code from origins (VCS repositories, distribution packages,
+ ...). They may either discover metadata as a side-effect of loading
+ source code, or be dedicated to fetching metadata.
+
+* :term:`listers <lister>`, which are the components of SWH dedicated to
+ discovering origins on known websites/forges; and may discover
+ metadata as a side-effect
+
+* :term:`deposit clients <deposit>`, which push metadata to SWH from a
+ third-party; usually at the same time as a :term:`software artifact`
+
+* gatherers, which fetch metadata from an authoritative source of the
+ repository (eg. its website or forge) in a way that is none of the three
+ above (eg. by querying a specific API of the origin's forge).
+
+* registries, which fetch data from non-authoritative databases, meaning
+ they are not directly referenced to by the origin's website/forge/...
+ (eg. Wikidata)
+
+A provider is uniquely defined by these two properties:
+
+* its name, representing the software/database from which metadata is
+ extracted (eg. `gitlab`, `wikidata`, `hal`); each provider name
+ matches a component of SWH, dedicated to getting data from it.
+
+* its URL, which unambiguously identifies an instance of the provider.
+
+Example providers:
+
+=============== =============== =================================
+type name url
+=============== =============== =================================
+deposit_client hal https://hal.archives-ouvertes.fr/
+deposit_client swh https://www.softwareheritage.org/
+lister gitlab_lister https://gitlab.com/
+loader gitlab_loader https://gitlab.com/
+registry wikidata https://www.wikidata.org/
+=============== =============== =================================
+
+Storage API
+~~~~~~~~~~~
+
+The :term:`storage` API offers two endpoints to manipulate metadata
+providers:
+
+* `metadata_provider_add(name, url, type, metadata)`
+ which adds a new metadata provider to the storage.
+
+* `metadata_provider_get_by(name, url)`
+ which looks up for a known provider (there is at most one) and if it is
+ known, returns a dictionary with keys `name`, `url`, `type`, and `metadata`.
+
+`metadata` is an arbitrary JSON-encodable dictionary with informations
+about the provider, in a format specific to each provider name.
+This field only uses for future uses; currently it should always be empty.
+
+Origin metadata storage
+-----------------------
+
+Extrinsic metadata are stored in SWH's :term:`storage database`, alongside
+the :term:`Merkle DAG` containing all known software artifacts.
+The storage API offers three endpoints to manipulate origin metadata:
+
+* `origin_metadata_add(origin_id, discovery_date, provider_name, provider_url, metadata)`
+ which adds a new `metadata` dictionary obtained from a given provider
+ and associated to the origin.
+ The provider must be known to the storage before using this endpoint.
+
+* `origin_metadata_get(origin_id, provider_name, provider_url, after, limit)`
+ which returns a list of dictionaries:
+ `{'provider': {...}, 'discovery_date': ..., 'metadata': {...}}`,
+ one for each metadata item deposited, corresponding to the given origin
+ and obtained from the specified provider
+
+* `origin_metadata_get_by_provider_type(origin_id, provider_type, after, limit)`
+ which works similarly to `origin_metadata_get`, but returns results for
+ all providers of a given type.
+
+The parameters `after` and `limit` are used for pagination based on the
+order defined by the `discovery_date`.
+
+All of the results of `origin_metadata_get` and
+`origin_metadata_get_by_provider_type` can be considered authoritative
+for the given origin at the given `discovery_date`, unless the provider type
+is `registry`.
+
+The format of `metadata` is a JSON-encodable dictionary. Its format is
+specific to each provider; and is treated as an opaque value by the storage.
+Unifying these various formats into a common language is outside the scope
+of this specification.
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -30,6 +30,11 @@
* :ref:`archive-copies`
+Specifications
+--------------
+
+* :ref:`extrinsic-metadata-specification`
+
Reference Documentation
-----------------------

File Metadata

Mime Type
text/plain
Expires
Jul 3 2025, 6:44 PM (5 w, 6 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3222186

Event Timeline