Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9336742
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
6 KB
Subscribers
None
View Options
diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst
index 00b5f27b..3550a2d2 100644
--- a/docs/extrinsic-metadata-specification.rst
+++ b/docs/extrinsic-metadata-specification.rst
@@ -1,183 +1,184 @@
:orphan:
.. _extrinsic-metadata-specification:
Extrinsic metadata specification
================================
:term:`Extrinsic metadata` is information about software that is not part
of the source code itself but still closely related to the software.
Typical sources for extrinsic metadata are: the hosting place of a
repository, which can offer metadata via its web view or API; external
registries like collaborative curation initiatives; and out-of-band
information available at source code archival time.
Since they are not part of the source code, a dedicated mechanism to fetch
and store them is needed.
This specification assumes the reader is familiar with Software Heritage's
:ref:`architecture` and :ref:`data-model`.
Metadata sources
----------------
Authorities
^^^^^^^^^^^
Metadata authorities are entities that provide metadata about an
:term:`origin`. Metadata authorities include: code hosting places,
:term:`deposit` submitters, and registries (eg. Wikidata).
An authority is uniquely defined by these properties:
* its type, representing the kind of authority, which is one of these values:
* `deposit`, for metadata pushed to Software Heritage at the same time
as a software artifact
* `forge`, for metadata pulled from the same source as the one hosting
the software artifacts (which includes package managers)
* `registry`, for metadata pulled from a third-party
* its URL, which unambiguously identifies an instance of the authority type.
Examples:
=============== =================================
type url
=============== =================================
deposit https://hal.archives-ouvertes.fr/
deposit https://hal.inria.fr/
deposit https://software.intel.com/
forge https://gitlab.com/
forge https://gitlab.inria.fr/
forge https://0xacab.org/
forge https://github.com/
registry https://www.wikidata.org/
registry https://swmath.org/
registry https://ascl.net/
=============== =================================
Metadata fetchers
^^^^^^^^^^^^^^^^^
Metadata fetchers are software components used to fetch metadata from
a metadata authority, and ingest them into the Software Heritage archive.
A metadata fetcher is uniquely defined by these properties:
* its type
* its version
Examples:
* :term:`loaders <loader>`, which may either discover metadata as a
side-effect of loading source code, or be dedicated to fetching metadata.
* :term:`listers <lister>`, which may discover metadata as a side-effect
of discovering origins.
* :term:`deposit` submitters, which push metadata to SWH from a
third-party; usually at the same time as a :term:`software artifact`
* crawlers, which fetch metadata from an authority in a way that is
none of the above (eg. by querying a specific API of the origin's forge).
Storage API
~~~~~~~~~~~
Authorities and metadata fetchers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :term:`storage` API offers these endpoints to manipulate metadata
authorities and metadata fetchers:
* ``metadata_authority_add(type, url, metadata)``
which adds a new metadata authority to the storage.
* ``metadata_authority_get(type, url)``
which looks up a known authority (there is at most one) and if it is
known, returns a dictionary with keys ``type``, ``url``, and ``metadata``.
* ``metadata_fetcher_add(name, version, metadata)``
which adds a new metadata fetcher to the storage.
* ``metadata_fetcher_get(name, version)``
which looks up a known fetcher (there is at most one) and if it is
known, returns a dictionary with keys ``name``, ``version``, and
``metadata``.
These `metadata` fields contain JSON-encodable dictionaries
with information about the authority/fetcher, in a format specific to each
authority/fetcher.
With authority, the `metadata` field is reserved for information describing
and qualifying the authority.
With fetchers, the `metadata` field is reserved for configuration metadata
and other technical usage.
Origin metadata storage
-----------------------
Extrinsic metadata are stored in SWH's :term:`storage database`.
The storage API offers three endpoints to manipulate origin metadata:
* Adding metadata::
origin_metadata_add(origin_url, discovery_date,
authority, fetcher,
format, metadata)
which adds a new `metadata` byte string obtained from a given authority
and associated to the origin.
+ `discovery_date` is an ISO 8601 datetime string.
`authority` must be a dict containing keys `type` and `url`, and
`fetcher` a dict containing keys `name` and `version`.
The authority and fetcher must be known to the storage before using this
endpoint.
`format` is a text field indicating the format of the content of the
`metadata` byte string.
* Getting latest metadata::
origin_metadata_get_latest(origin_url, authority)
where `authority` must be a dict containing keys `type` and `url`,
which returns a dictionary corresponding to the latest metadata entry
added from this origin, in the format::
{
'authority': {'type': ..., 'url': ...},
'fetcher': {'name': ..., 'version': ...},
- 'discovery_date': ...,
+ 'discovery_date': '...',
'format': '...',
'metadata': b'...'
}
* Getting all metadata::
origin_metadata_get(origin_url,
authority,
after, limit)
which returns a list of dictionaries, one for each metadata item
deposited, corresponding to the given origin and obtained from the
specified authority.
`authority` must be a dict containing keys `type` and `url`.
Each of these dictionaries is in the following format::
{
'authority': {'type': ..., 'url': ...},
'fetcher': {'name': ..., 'version': ...},
- 'discovery_date': ...,
+ 'discovery_date': '...',
'format': '...',
'metadata': b'...'
}
The parameters ``after`` and ``limit`` are used for pagination based on the
order defined by the ``discovery_date``.
``metadata`` is a bytes array (eventually encoded using Base64).
Its format is specific to each authority; and is treated as an opaque value
by the storage.
Unifying these various formats into a common language is outside the scope
of this specification.
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Jul 4 2025, 7:42 AM (10 w, 5 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3301703
Attached To
rDSTO Storage manager
Event Timeline
Log In to Comment