Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9341705
D1614.id5619.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
6 KB
Subscribers
None
D1614.id5619.diff
View Options
diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst
new file mode 100644
--- /dev/null
+++ b/docs/extrinsic-metadata-specification.rst
@@ -0,0 +1,174 @@
+.. _extrinsic-metadata-specification:
+
+Extrinsic metadata specification
+================================
+
+:term:`Extrinsic metadata` is information about software that is not part
+of the source code itself but still closely related to the software.
+Typical sources for extrinsic metadata are: the hosting place of a
+repository, which can offer metadata via its web view or API; external
+registries like collaborative curation initiatives; and out-of-band
+information available at source code archival time.
+
+Since they are not part of the source code, a dedicated mechanism to fetch
+and store them is needed.
+
+This specification assumes the reader is familiar with Software Heritage's
+:ref:`architecture` and :ref:`data-model`.
+
+
+Metadata sources
+----------------
+
+Authorities
+^^^^^^^^^^^
+
+Metadata authorities are entities that provide metadata about an
+:term:`origin`. Metadata authorities include: code hosting places,
+:term:`deposit` submitters, and registries (eg. Wikidata).
+
+An authority is uniquely defined by these properties:
+
+* its type, representing the software/database from which metadata is
+ extracted (eg. `gitlab`, `wikidata`, `hal`).
+
+* its URL, which unambiguously identifies an instance of the authority type.
+
+Examples:
+
+=============== =================================
+type url
+=============== =================================
+deposit https://hal.archives-ouvertes.fr/
+deposit https://hal.inria.fr/
+deposit https://software.intel.com/
+gitlab https://gitlab.com/
+gitlab https://gitlab.inria.fr/
+gitlab https://0xacab.org/
+github https://github.com/
+wikidata https://www.wikidata.org/
+swmath https://swmath.org/
+ascl.net http://ascl.net/
+=============== =================================
+
+Metadata fetchers
+^^^^^^^^^^^^^^^^^
+
+Metadata fetchers are software components used to fetch metadata from
+a metadata authority, and ingest them into the Software Heritage archive.
+
+A metadata fetcher is uniquely defined by these properties:
+
+* its type
+* its version
+
+Examples:
+
+* :term:`loaders <loader>`, which may either discover metadata as a
+ side-effect of loading source code, or be dedicated to fetching metadata.
+
+* :term:`listers <lister>`, which may discover metadata as a side-effect
+ of discovering origins.
+
+* :term:`deposit` submitters, which push metadata to SWH from a
+ third-party; usually at the same time as a :term:`software artifact`
+
+* crawlers, which fetch metadata from an authority in a way that is
+ none of the above (eg. by querying a specific API of the origin's forge).
+
+
+Storage API
+~~~~~~~~~~~
+
+Authorities and metadata fetchers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The :term:`storage` API offers these endpoints to manipulate metadata
+authorities and metadata fetchers:
+
+* ``metadata_authority_add(type, url, metadata)``
+ which adds a new metadata authority to the storage.
+
+* ``metadata_authority_get(type, url)``
+ which looks up a known authority (there is at most one) and if it is
+ known, returns a dictionary with keys ``type``, ``url``, and ``metadata``.
+
+* ``metadata_fetcher_add(name, version, metadata)``
+ which adds a new metadata fetcher to the storage.
+
+* ``metadata_fetcher_get(name, version)``
+ which looks up a known fetcher (there is at most one) and if it is
+ known, returns a dictionary with keys ``name``, ``version``, and
+ ``metadata``.
+
+These `metadata` fields contain JSON-encodable dictionaries
+with information about the authority/fetcher, in a format specific to each
+authority/fetcher.
+With authority, the `metadata` field is reserved for information describing
+and qualifying the authority.
+With fetchers, the `metadata` field is reserved for configuration metadata
+and other technical usage.
+
+Origin metadata storage
+-----------------------
+
+Extrinsic metadata are stored in SWH's :term:`storage database`.
+The storage API offers three endpoints to manipulate origin metadata:
+
+* Adding metadata::
+
+ origin_metadata_add(origin_url, discovery_date,
+ authority, fetcher,
+ metadata)
+
+ which adds a new `metadata` byte string obtained from a given authority
+ and associated to the origin.
+ `authority` must be a dict containing keys `type` and `url`, and
+ `fetcher` a dict containing keys `name` and `version`.
+ The authority and fetcher must be known to the storage before using this
+ endpoint.
+
+* Getting latest metadata::
+
+ origin_metadata_get_latest(origin_url, authority)
+
+ where `authority` must be a dict containing keys `type` and `url`,
+ which returns a dictionary corresponding to the latest metadata entry
+ added from this origin, in the format::
+
+ {
+ 'authority': {'type': ..., 'url': ...},
+ 'fetcher': {'name': ..., 'version': ...},
+ 'discovery_date': ...,
+ 'metadata': b'...'
+ }
+
+
+* Getting all metadata::
+
+ origin_metadata_get(origin_url,
+ authority,
+ after, limit)
+
+ which returns a list of dictionaries, one for each metadata item
+ deposited, corresponding to the given origin and obtained from the
+ specified authority.
+ `authority` must be a dict containing keys `type` and `url`.
+
+ Each of these dictionaries is in the following format::
+
+ {
+ 'authority': {'type': ..., 'url': ...},
+ 'fetcher': {'name': ..., 'version': ...},
+ 'discovery_date': ...,
+ 'metadata': b'...'
+ }
+
+The parameters ``after`` and ``limit`` are used for pagination based on the
+order defined by the ``discovery_date``.
+
+``metadata`` is a bytes array (eventually encoded using Base64).
+Its format is specific to each authority; and is treated as an opaque value
+by the storage.
+Unifying these various formats into a common language is outside the scope
+of this specification.
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -30,6 +30,11 @@
* :ref:`archive-copies`
+Specifications
+--------------
+
+* :ref:`extrinsic-metadata-specification`
+
Reference Documentation
-----------------------
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Thu, Jul 3, 12:16 PM (2 w, 5 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3225174
Attached To
D1614: Specification of extrinsic origin metadata and their storage.
Event Timeline
Log In to Comment