Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9348706
D1509.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
5 KB
Subscribers
None
D1509.diff
View Options
diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst
new file mode 100644
--- /dev/null
+++ b/docs/extrinsic-metadata-specification.rst
@@ -0,0 +1,116 @@
+.. _extrinsic-metadata-specification:
+
+Extrinsic metadata specification
+================================
+
+:term:`Extrinsic metadata` is information about software that is not part
+of the source code itself but still closely related to the software.
+Usually it is available on the web view of a repository's forge and its API
+or an external registry.
+
+Since they are not part of the source code, we need a separate mechanism
+to fetch and store them.
+
+This specification assumes the reader is familiar with Software Heritage's
+:ref:`architecture` and :ref:`data-model`.
+
+
+Metadata providers
+------------------
+
+Definition
+~~~~~~~~~~
+
+We define five types of metadata providers:
+
+* :term:`loaders <loader>`, which are the components dedicated to fetching
+ the source-code from origins (VCS repositories, distribution packages,
+ ...). They may either discover metadata as a side-effect of loading
+ source code, or be dedicated to fetching metadata.
+
+* :term:`listers <lister>`, which are the components of SWH dedicated to
+ discovering origins on known websites/forges; and may discover
+ metadata as a side-effect
+
+* :term:`deposit clients <deposit>`, which push metadata to SWH from a
+ third-party; usually at the same time as a :term:`software artifact`
+
+* gatherers, which fetch metadata from an authoritative source of the
+ repository (eg. its website or forge) in a way that is none of the three
+ above (eg. by querying a specific API of the origin's forge).
+
+* registries, which fetch data from non-authoritative databases, meaning
+ they are not directly referenced to by the origin's website/forge/...
+ (eg. Wikidata)
+
+A provider is uniquely defined by these two properties:
+
+* its name, representing the software/database from which metadata is
+ extracted (eg. `gitlab`, `wikidata`, `hal`); each provider name
+ matches a component of SWH, dedicated to getting data from it.
+
+* its URL, which unambiguously identifies an instance of the provider.
+
+Example providers:
+
+=============== =============== =================================
+type name url
+=============== =============== =================================
+deposit_client hal https://hal.archives-ouvertes.fr/
+deposit_client swh https://www.softwareheritage.org/
+lister gitlab_lister https://gitlab.com/
+loader gitlab_loader https://gitlab.com/
+registry wikidata https://www.wikidata.org/
+=============== =============== =================================
+
+Storage API
+~~~~~~~~~~~
+
+The :term:`storage` API offers two endpoints to manipulate metadata
+providers:
+
+* `metadata_provider_add(name, url, type, metadata)`
+ which adds a new metadata provider to the storage.
+
+* `metadata_provider_get_by(name, url)`
+ which looks up for a known provider (there is at most one) and if it is
+ known, returns a dictionary with keys `name`, `url`, `type`, and `metadata`.
+
+`metadata` is an arbitrary JSON-encodable dictionary with informations
+about the provider, in a format specific to each provider name.
+This field only uses for future uses; currently it should always be empty.
+
+Origin metadata storage
+-----------------------
+
+Extrinsic metadata are stored in SWH's :term:`storage database`, alongside
+the :term:`Merkle DAG` containing all known software artifacts.
+The storage API offers three endpoints to manipulate origin metadata:
+
+* `origin_metadata_add(origin_id, discovery_date, provider_name, provider_url, metadata)`
+ which adds a new `metadata` dictionary obtained from a given provider
+ and associated to the origin.
+ The provider must be known to the storage before using this endpoint.
+
+* `origin_metadata_get(origin_id, provider_name, provider_url, after, limit)`
+ which returns a list of dictionaries:
+ `{'provider': {...}, 'discovery_date': ..., 'metadata': {...}}`,
+ one for each metadata item deposited, corresponding to the given origin
+ and obtained from the specified provider
+
+* `origin_metadata_get_by_provider_type(origin_id, provider_type, after, limit)`
+ which works similarly to `origin_metadata_get`, but returns results for
+ all providers of a given type.
+
+The parameters `after` and `limit` are used for pagination based on the
+order defined by the `discovery_date`.
+
+All of the results of `origin_metadata_get` and
+`origin_metadata_get_by_provider_type` can be considered authoritative
+for the given origin at the given `discovery_date`, unless the provider type
+is `registry`.
+
+The format of `metadata` is a JSON-encodable dictionary. Its format is
+specific to each provider; and is treated as an opaque value by the storage.
+Unifying these various formats into a common language is outside the scope
+of this specification.
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -30,6 +30,11 @@
* :ref:`archive-copies`
+Specifications
+--------------
+
+* :ref:`extrinsic-metadata-specification`
+
Reference Documentation
-----------------------
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Jul 3 2025, 6:44 PM (5 w, 6 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3222186
Attached To
D1509: Write a specification of extrinsic origin metadata storage.
Event Timeline
Log In to Comment