Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7450577
extrinsic-metadata-specification.rst
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
6 KB
Subscribers
None
extrinsic-metadata-specification.rst
View Options
..
_extrinsic-metadata-specification:
Extrinsic metadata specification
================================
:term:
`Extrinsic metadata`
is information about software that is not part
of the source code itself but still closely related to the software.
Typical sources for extrinsic metadata are: the hosting place of a
repository, which can offer metadata via its web view or API; external
registries like collaborative curation initiatives; and out-of-band
information available at source code archival time.
Since they are not part of the source code, a dedicated mechanism to fetch
and store them is needed.
This specification assumes the reader is familiar with Software Heritage's
:ref:
`architecture`
and
:ref:
`data-model`
.
Metadata sources
----------------
Authorities
^^^^^^^^^^^
Metadata authorities are entities that provide metadata about an
:term:
`origin`
. Metadata authorities include: code hosting places,
:term:
`deposit`
submitters, and registries (eg. Wikidata).
An authority is uniquely defined by these properties:
*
its type, representing the kind of authority, which is one of these values:
*
`deposit`
, for metadata pushed to Software Heritage at the same time
as a software artifact
*
`forge`
, for metadata pulled from the same source as the one hosting
the software artifacts (which includes package managers)
*
`registry`
, for metadata pulled from a third-party
*
its URL, which unambiguously identifies an instance of the authority type.
Examples:
=============== =================================
type url
=============== =================================
deposit https://hal.archives-ouvertes.fr/
deposit https://hal.inria.fr/
deposit https://software.intel.com/
forge https://gitlab.com/
forge https://gitlab.inria.fr/
forge https://0xacab.org/
forge https://github.com/
registry https://www.wikidata.org/
registry https://swmath.org/
registry https://ascl.net/
=============== =================================
Metadata fetchers
^^^^^^^^^^^^^^^^^
Metadata fetchers are software components used to fetch metadata from
a metadata authority, and ingest them into the Software Heritage archive.
A metadata fetcher is uniquely defined by these properties:
*
its type
*
its version
Examples:
*
:term:
`loaders <loader>`
, which may either discover metadata as a
side-effect of loading source code, or be dedicated to fetching metadata.
*
:term:
`listers <lister>`
, which may discover metadata as a side-effect
of discovering origins.
*
:term:
`deposit`
submitters, which push metadata to SWH from a
third-party; usually at the same time as a
:term:
`software artifact`
*
crawlers, which fetch metadata from an authority in a way that is
none of the above (eg. by querying a specific API of the origin's forge).
Storage API
~~~~~~~~~~~
Authorities and metadata fetchers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The
:term:
`storage`
API offers these endpoints to manipulate metadata
authorities and metadata fetchers:
*
``metadata_authority_add(type, url, metadata)``
which adds a new metadata authority to the storage.
*
``metadata_authority_get(type, url)``
which looks up a known authority (there is at most one) and if it is
known, returns a dictionary with keys
``type``
,
``url``
, and
``metadata``
.
*
``metadata_fetcher_add(name, version, metadata)``
which adds a new metadata fetcher to the storage.
*
``metadata_fetcher_get(name, version)``
which looks up a known fetcher (there is at most one) and if it is
known, returns a dictionary with keys
``name``
,
``version``
, and
``metadata``
.
These
`metadata`
fields contain JSON-encodable dictionaries
with information about the authority/fetcher, in a format specific to each
authority/fetcher.
With authority, the
`metadata`
field is reserved for information describing
and qualifying the authority.
With fetchers, the
`metadata`
field is reserved for configuration metadata
and other technical usage.
Origin metadata storage
-----------------------
Extrinsic metadata are stored in SWH's
:term:
`storage database`
.
The storage API offers three endpoints to manipulate origin metadata:
*
Adding metadata::
origin_metadata_add(origin_url, discovery_date,
authority, fetcher,
format, metadata)
which adds a new
`metadata`
byte string obtained from a given authority
and associated to the origin.
`authority`
must be a dict containing keys
`type`
and
`url`
, and
`fetcher`
a dict containing keys
`name`
and
`version`
.
The authority and fetcher must be known to the storage before using this
endpoint.
`format`
is a text field indicating the format of the content of the
`metadata`
byte string.
*
Getting latest metadata::
origin_metadata_get_latest(origin_url, authority)
where
`authority`
must be a dict containing keys
`type`
and
`url`
,
which returns a dictionary corresponding to the latest metadata entry
added from this origin, in the format
::
{
'authority': {'type': ..., 'url': ...},
'fetcher': {'name': ..., 'version': ...},
'discovery_date': ...,
'format': '...',
'metadata': b'...'
}
*
Getting all metadata::
origin_metadata_get(origin_url,
authority,
after, limit)
which returns a list of dictionaries, one for each metadata item
deposited, corresponding to the given origin and obtained from the
specified authority.
`authority`
must be a dict containing keys
`type`
and
`url`
.
Each of these dictionaries is in the following format
::
{
'authority': {'type': ..., 'url': ...},
'fetcher': {'name': ..., 'version': ...},
'discovery_date': ...,
'format': '...',
'metadata': b'...'
}
The parameters
``after``
and
``limit``
are used for pagination based on the
order defined by the
``discovery_date``
.
``metadata``
is a bytes array (eventually encoded using Base64).
Its format is specific to each authority; and is treated as an opaque value
by the storage.
Unifying these various formats into a common language is outside the scope
of this specification.
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Thu, Apr 17, 7:57 AM (5 d, 7 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3287695
Attached To
rDSTOC swh-storage-cassandra
Event Timeline
Log In to Comment