Changeset View
Standalone View
docs/extrinsic-metadata-specification.rst
- This file was added.
.. _extrinsic-metadata-specification: | |||||
Extrinsic metadata specification | |||||
================================ | |||||
:term:`Extrinsic metadata` is information about software that is not part | |||||
of the source code itself but still closely related to the software. | |||||
Typical sources for extrinsic metadata are: the hosting place of a | |||||
repository, which can offer metadata via its web view or API; external | |||||
moranegg: It can be available also as part of a deposit. | |||||
Done Inline ActionsI propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information provided at source code archival time." That should address @moranegg concern and it feels pretty clear to (the biased) me. zack: I propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which… | |||||
registries like collaborative curation initiatives; and out-of-band | |||||
information available at source code archival time. | |||||
Done Inline Actionss/provided/available/ (sorry, this issue come from my suggestion, i know, but it didn't make sense that way :)) zack: s/provided/available/
(sorry, this issue come from my suggestion, i know, but it didn't make… | |||||
Done Inline Actions"Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed." zack: "Since they are not part of the source code, a dedicated mechanism to fetch and store them is… | |||||
Since they are not part of the source code, a dedicated mechanism to fetch | |||||
and store them is needed. | |||||
Done Inline Actionsmissing trailing '.' zack: missing trailing '.' | |||||
This specification assumes the reader is familiar with Software Heritage's | |||||
:ref:`architecture` and :ref:`data-model`. | |||||
Metadata sources | |||||
---------------- | |||||
Authorities | |||||
^^^^^^^^^^^ | |||||
Done Inline Actionsthey can provide metadata about different software artifacts, here we deal with origins, but the authorities aren't specifically related to origins. moranegg: they can provide metadata about different software artifacts, here we deal with origins, but… | |||||
Done Inline ActionsWe don't support any object other than origins yet. If you think we should support other objects, I can amend the last part of spec accordingly, but I don't think we need it yet. vlorentz: We don't support any object other than origins yet. If you think we should support other… | |||||
Done Inline Actionsok. change a origin to an origin. moranegg: ok. change `a origin` to `an origin`. | |||||
Done Inline Actionss/code hosts/code hosting places/ zack: s/code hosts/code hosting places/ | |||||
Metadata authorities are entities that provide metadata about an | |||||
Done Inline ActionsIt's not the deposit client that has the metadata, that is just a dumb software component; it's the person doing the deposit who has them. Hence, I suggest to use "deposit submitters" here instead. zack: It's not the deposit //client// that has the metadata, that is just a dumb software component… | |||||
Done Inline Actions"moral entities" is a false friend from french; just use "entities", I guess? zack: "moral entities" is a false friend from french; just use "entities", I guess? | |||||
:term:`origin`. Metadata authorities include: code hosting places, | |||||
:term:`deposit` submitters, and registries (eg. Wikidata). | |||||
An authority is uniquely defined by these properties: | |||||
* its type, representing the software/database from which metadata is | |||||
extracted (eg. `gitlab`, `wikidata`, `hal`). | |||||
* its URL, which unambiguously identifies an instance of the authority type. | |||||
Not Done Inline Actionsprovider isn't authority now? moranegg: provider isn't authority now? | |||||
Examples: | |||||
=============== ================================= | |||||
type url | |||||
=============== ================================= | |||||
deposit https://hal.archives-ouvertes.fr/ | |||||
deposit https://hal.inria.fr/ | |||||
deposit https://software.intel.com/ | |||||
gitlab https://gitlab.com/ | |||||
Not Done Inline Actions
zack: - the gitlab rows should be about two different instances, e.g. the main one and the inria one… | |||||
Done Inline Actions
That was a typo
Do you have example URLs for the deposit you want to use for the deposit? vlorentz: > * i don't understand the swh row
That was a typo
> * we want an example (better: two) of… | |||||
Not Done Inline Actions
I personally don't. Maybe @moranegg does? Alternatively, we can just provide a sample deposit URL with '...' where applicable. zack: > Do you have example URLs for the deposit you want to use for the deposit?
I personally don't. | |||||
Not Done Inline Actionshttps://hal.inria.fr/ and https://hal.archives-ouvertes.fr/ moranegg: https://hal.inria.fr/ and https://hal.archives-ouvertes.fr/
I'm not sure if this is what you… | |||||
Done Inline ActionsWhat is the difference between (name=hal, url= https://hal.archives-ouvertes.fr/) and (name=deposit, url= https://hal.archives-ouvertes.fr/)? vlorentz: What is the difference between `(name=hal, url= https://hal.archives-ouvertes.fr/)` and `… | |||||
Not Done Inline ActionsHere is the paste for a full table example: moranegg: Here is the paste for a full table example:
https://forge.softwareheritage.org/P457 | |||||
Not Done Inline ActionsThe following idea surfaced during discussion: moranegg: The following idea surfaced during discussion:
keeping only the URL and metametadata to… | |||||
gitlab https://gitlab.inria.fr/ | |||||
gitlab https://0xacab.org/ | |||||
github https://github.com/ | |||||
wikidata https://www.wikidata.org/ | |||||
swmath https://swmath.org/ | |||||
Done Inline ActionsHaving a non ambiguous name here would be helpful to streamline language. From this text I'm assuming you'd be ok with "metadata fetcher"? (It's OK with me.) Hence, s/Metadata fetching tools/*Metadata fetchers*/ here. zack: Having a non ambiguous name here would be helpful to streamline language. From this text I'm… | |||||
ascl.net http://ascl.net/ | |||||
=============== ================================= | |||||
Metadata fetchers | |||||
^^^^^^^^^^^^^^^^^ | |||||
Metadata fetchers are software components used to fetch metadata from | |||||
a metadata authority, and ingest them into the Software Heritage archive. | |||||
A metadata fetcher is uniquely defined by these properties: | |||||
Not Done Inline ActionsI don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch code and if it changes functionality, it should be a different tool. moranegg: I don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch… | |||||
Not Done Inline ActionsA loader here is consistent with what we discussed f2f though, at least IIRC. The idea was that you might have a generic "git loader", and sub-class it (or whatever) into a "gitlab loader", a "github loader", etc. While the most generic one will only load source code artifacts, the host-specific instances will also fetch extrinsic metadata. TL;DR: this seems correct to me. (and is also consistent with the lister example just below) zack: A loader here is consistent with what we discussed f2f though, at least IIRC.
The idea was… | |||||
* its type | |||||
* its version | |||||
Examples: | |||||
* :term:`loaders <loader>`, which may either discover metadata as a | |||||
Done Inline ActionsEchoing my previous comment, the authority here is the deposit submitter. As we don't' have their identity, for the purpose of the authority table we should probably just use "deposit" here. zack: Echoing my previous comment, the authority here is the deposit submitter. As we don't' have… | |||||
side-effect of loading source code, or be dedicated to fetching metadata. | |||||
* :term:`listers <lister>`, which may discover metadata as a side-effect | |||||
Done Inline Actionsis there a reason gatherers doesn't have the :term: item? moranegg: is there a reason `gatherers` doesn't have the `:term:` item? | |||||
Done Inline ActionsBecause there is no "gatherer" entry in the glossary yet. vlorentz: Because there is no "gatherer" entry in the glossary yet. | |||||
Done Inline Actionsack. moranegg: ack. | |||||
Done Inline Actionsgatherer v. fetcher starts becoming clumsy. How about "metadata crawler" here? I'm open to other suggestions if that doesn't work… zack: gatherer v. fetcher starts becoming clumsy.
How about "metadata crawler" here?
I'm open to… | |||||
Done Inline ActionsMuch better indeed, thanks! vlorentz: Much better indeed, thanks! | |||||
of discovering origins. | |||||
* :term:`deposit` submitters, which push metadata to SWH from a | |||||
third-party; usually at the same time as a :term:`software artifact` | |||||
* crawlers, which fetch metadata from an authority in a way that is | |||||
none of the above (eg. by querying a specific API of the origin's forge). | |||||
Storage API | |||||
~~~~~~~~~~~ | |||||
Authorities and metadata fetchers | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
Done Inline Actionswhat does the "_by" adds here? wouldn't metadata_authority_get be better/clearer? zack: what does the "_by" adds here? wouldn't `metadata_authority_get` be better/clearer? | |||||
Done Inline ActionsIndeed. It made sense when (name, url) was not an intrinsic identifier, but it's no longer true. vlorentz: Indeed. It made sense when `(name, url)` was not an intrinsic identifier, but it's no longer… | |||||
The :term:`storage` API offers these endpoints to manipulate metadata | |||||
authorities and metadata fetchers: | |||||
* ``metadata_authority_add(type, url, metadata)`` | |||||
which adds a new metadata authority to the storage. | |||||
* ``metadata_authority_get(type, url)`` | |||||
Done Inline Actions"_by" → ditto zack: "_by" → ditto | |||||
which looks up a known authority (there is at most one) and if it is | |||||
known, returns a dictionary with keys ``type``, ``url``, and ``metadata``. | |||||
* ``metadata_fetcher_add(name, version, metadata)`` | |||||
which adds a new metadata fetcher to the storage. | |||||
* ``metadata_fetcher_get(name, version)`` | |||||
which looks up a known fetcher (there is at most one) and if it is | |||||
Done Inline ActionsI don't like future uses and arbitrary.
moranegg: I don't like `future uses` and `arbitrary`.
I propose delete `arbitrary` and change to:
>… | |||||
known, returns a dictionary with keys ``name``, ``version``, and | |||||
``metadata``. | |||||
These `metadata` fields contain JSON-encodable dictionaries | |||||
with information about the authority/fetcher, in a format specific to each | |||||
authority/fetcher. | |||||
With authority, the `metadata` field is reserved for information describing | |||||
and qualifying the authority. | |||||
With fetchers, the `metadata` field is reserved for configuration metadata | |||||
and other technical usage. | |||||
Origin metadata storage | |||||
----------------------- | |||||
Extrinsic metadata are stored in SWH's :term:`storage database`. | |||||
The storage API offers three endpoints to manipulate origin metadata: | |||||
Done Inline Actionsgiven they're (correctly) grouped together in the result dictionary, maybe you want to also group them together here authority_name/authority_url (as a pair) and same thing for fetcher_name/fetcher_version (unless, dunno, this is mapped to an HTTP API somewhere and it's easier to avoid the packing conceptually they are really two things a fetcher and an authority, so it'd make sense to have 2 args instead of 4 zack: given they're (correctly) grouped together in the result dictionary, maybe you want to also… | |||||
Done Inline ActionsGood point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON anyway. vlorentz: Good point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON… | |||||
* Adding metadata:: | |||||
origin_metadata_add(origin_url, discovery_date, | |||||
authority, fetcher, | |||||
format, metadata) | |||||
which adds a new `metadata` byte string obtained from a given authority | |||||
and associated to the origin. | |||||
`authority` must be a dict containing keys `type` and `url`, and | |||||
`fetcher` a dict containing keys `name` and `version`. | |||||
The authority and fetcher must be known to the storage before using this | |||||
endpoint. | |||||
`format` is a text field indicating the format of the content of the | |||||
`metadata` byte string. | |||||
* Getting latest metadata:: | |||||
origin_metadata_get_latest(origin_url, authority) | |||||
where `authority` must be a dict containing keys `type` and `url`, | |||||
which returns a dictionary corresponding to the latest metadata entry | |||||
added from this origin, in the format:: | |||||
{ | |||||
'authority': {'type': ..., 'url': ...}, | |||||
'fetcher': {'name': ..., 'version': ...}, | |||||
'discovery_date': ..., | |||||
'format': '...', | |||||
'metadata': b'...' | |||||
} | |||||
Done Inline ActionsI think a [ ] and adding a second origin_metadata entry, can clarify. [{ 'authority': {'name': ..., 'url': ...}, 'tool': {'name': ..., 'version': ...}, 'discovery_date': ..., 'metadata': b'...' }, { 'authority': {'name': ..., 'url': ...}, 'tool': {'name': ..., 'version': ...}, 'discovery_date': ..., 'metadata': b'...' }] moranegg: I think a `[ ]` and adding a second origin_metadata entry, can clarify.
```
[{… | |||||
* Getting all metadata:: | |||||
Done Inline Actionsthe term deposited is too connected to the deposit and here seems that you talk about all authorities. Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries? also, this explanation should come before the example. moranegg: the term `deposited` is too connected to the deposit and here seems that you talk about all… | |||||
Done Inline Actions
No, it returns all of them, but paginated
It's not an example, it's the format of the output vlorentz: > Did you mean, that a list of the latest origin_metadata entries for a given authority is… | |||||
Done Inline Actions
Is that really useful? to have it all?
This explanation should come before the format output :-) moranegg: > No, it returns all of them, but paginated
Is that really useful? to have it all?
> It's not… | |||||
Done Inline Actions
Yes, for the same reason we can get the list of snapshots of an origins.
* shrug * vlorentz: > Is that really useful? to have it all?
Yes, for the same reason we can get the list of… | |||||
origin_metadata_get(origin_url, | |||||
authority, | |||||
after, limit) | |||||
which returns a list of dictionaries, one for each metadata item | |||||
deposited, corresponding to the given origin and obtained from the | |||||
specified authority. | |||||
`authority` must be a dict containing keys `type` and `url`. | |||||
Each of these dictionaries is in the following format:: | |||||
{ | |||||
'authority': {'type': ..., 'url': ...}, | |||||
'fetcher': {'name': ..., 'version': ...}, | |||||
'discovery_date': ..., | |||||
'format': '...', | |||||
'metadata': b'...' | |||||
} | |||||
The parameters ``after`` and ``limit`` are used for pagination based on the | |||||
order defined by the ``discovery_date``. | |||||
``metadata`` is a bytes array (eventually encoded using Base64). | |||||
Its format is specific to each authority; and is treated as an opaque value | |||||
by the storage. | |||||
Unifying these various formats into a common language is outside the scope | |||||
of this specification. |
It can be available also as part of a deposit.