Changeset View
Standalone View
docs/extrinsic-metadata-specification.rst
- This file was added.
.. _extrinsic-metadata-specification: | |||||
Extrinsic metadata specification | |||||
================================ | |||||
:term:`Extrinsic metadata` is information about software that is not part | |||||
of the source code itself but still closely related to the software. | |||||
Usually it is available on the web view of a repository's forge and its API | |||||
or an external registry. | |||||
moranegg: It can be available also as part of a deposit. | |||||
Done Inline ActionsI propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information provided at source code archival time." That should address @moranegg concern and it feels pretty clear to (the biased) me. zack: I propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which… | |||||
Since they are not part of the source code, we need a separate mechanism | |||||
Done Inline Actionss/provided/available/ (sorry, this issue come from my suggestion, i know, but it didn't make sense that way :)) zack: s/provided/available/
(sorry, this issue come from my suggestion, i know, but it didn't make… | |||||
to fetch and store them. | |||||
Done Inline Actions"Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed." zack: "Since they are not part of the source code, a dedicated mechanism to fetch and store them is… | |||||
This specification assumes the reader is familiar with Software Heritage's | |||||
Done Inline Actionsmissing trailing '.' zack: missing trailing '.' | |||||
:ref:`architecture` and :ref:`data-model`. | |||||
Metadata sources | |||||
---------------- | |||||
Authorities | |||||
^^^^^^^^^^^ | |||||
Metadata authorities are moral entities that provide metadata about a | |||||
:term:`origin`. Metadata authorities include: code hosts, | |||||
Done Inline Actionsthey can provide metadata about different software artifacts, here we deal with origins, but the authorities aren't specifically related to origins. moranegg: they can provide metadata about different software artifacts, here we deal with origins, but… | |||||
Done Inline ActionsWe don't support any object other than origins yet. If you think we should support other objects, I can amend the last part of spec accordingly, but I don't think we need it yet. vlorentz: We don't support any object other than origins yet. If you think we should support other… | |||||
Done Inline Actionsok. change a origin to an origin. moranegg: ok. change `a origin` to `an origin`. | |||||
Done Inline Actionss/code hosts/code hosting places/ zack: s/code hosts/code hosting places/ | |||||
:term:`deposit clients <deposit>`, and registries (eg. Wikidata). | |||||
Done Inline ActionsIt's not the deposit client that has the metadata, that is just a dumb software component; it's the person doing the deposit who has them. Hence, I suggest to use "deposit submitters" here instead. zack: It's not the deposit //client// that has the metadata, that is just a dumb software component… | |||||
Done Inline Actions"moral entities" is a false friend from french; just use "entities", I guess? zack: "moral entities" is a false friend from french; just use "entities", I guess? | |||||
An authority is uniquely defined by these properties: | |||||
* its name, representing the software/database from which metadata is | |||||
extracted (eg. `gitlab`, `wikidata`, `hal`). | |||||
* its URL, which unambiguously identifies an instance of the provider. | |||||
Examples: | |||||
Not Done Inline Actionsprovider isn't authority now? moranegg: provider isn't authority now? | |||||
=============== ================================= | |||||
name url | |||||
=============== ================================= | |||||
hal https://hal.archives-ouvertes.fr/ | |||||
swh https://www.softwareheritage.org/ | |||||
gitlab https://gitlab.com/ | |||||
gitlab https://gitlab.com/ | |||||
wikidata https://www.wikidata.org/ | |||||
=============== ================================= | |||||
Not Done Inline Actions
zack: - the gitlab rows should be about two different instances, e.g. the main one and the inria one… | |||||
Done Inline Actions
That was a typo
Do you have example URLs for the deposit you want to use for the deposit? vlorentz: > * i don't understand the swh row
That was a typo
> * we want an example (better: two) of… | |||||
Not Done Inline Actions
I personally don't. Maybe @moranegg does? Alternatively, we can just provide a sample deposit URL with '...' where applicable. zack: > Do you have example URLs for the deposit you want to use for the deposit?
I personally don't. | |||||
Not Done Inline Actionshttps://hal.inria.fr/ and https://hal.archives-ouvertes.fr/ moranegg: https://hal.inria.fr/ and https://hal.archives-ouvertes.fr/
I'm not sure if this is what you… | |||||
Done Inline ActionsWhat is the difference between (name=hal, url= https://hal.archives-ouvertes.fr/) and (name=deposit, url= https://hal.archives-ouvertes.fr/)? vlorentz: What is the difference between `(name=hal, url= https://hal.archives-ouvertes.fr/)` and `… | |||||
Not Done Inline ActionsHere is the paste for a full table example: moranegg: Here is the paste for a full table example:
https://forge.softwareheritage.org/P457 | |||||
Not Done Inline ActionsThe following idea surfaced during discussion: moranegg: The following idea surfaced during discussion:
keeping only the URL and metametadata to… | |||||
Tools | |||||
^^^^^ | |||||
Metadata fetching tools are software components used to fetch metadata from | |||||
Done Inline ActionsHaving a non ambiguous name here would be helpful to streamline language. From this text I'm assuming you'd be ok with "metadata fetcher"? (It's OK with me.) Hence, s/Metadata fetching tools/*Metadata fetchers*/ here. zack: Having a non ambiguous name here would be helpful to streamline language. From this text I'm… | |||||
a metadata authority, and ingest them into the Software Heritage archive. | |||||
A tool is uniquely defined by these properties: | |||||
* its name | |||||
* its version | |||||
Examples: | |||||
* :term:`loaders <loader>`, which may either discover metadata as a | |||||
side-effect of loading source code, or be dedicated to fetching metadata. | |||||
Not Done Inline ActionsI don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch code and if it changes functionality, it should be a different tool. moranegg: I don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch… | |||||
Not Done Inline ActionsA loader here is consistent with what we discussed f2f though, at least IIRC. The idea was that you might have a generic "git loader", and sub-class it (or whatever) into a "gitlab loader", a "github loader", etc. While the most generic one will only load source code artifacts, the host-specific instances will also fetch extrinsic metadata. TL;DR: this seems correct to me. (and is also consistent with the lister example just below) zack: A loader here is consistent with what we discussed f2f though, at least IIRC.
The idea was… | |||||
* :term:`listers <lister>`, which may discover metadata as a side-effect | |||||
of discovering origins. | |||||
* :term:`deposit clients <deposit>`, which push metadata to SWH from a | |||||
third-party; usually at the same time as a :term:`software artifact` | |||||
Done Inline ActionsEchoing my previous comment, the authority here is the deposit submitter. As we don't' have their identity, for the purpose of the authority table we should probably just use "deposit" here. zack: Echoing my previous comment, the authority here is the deposit submitter. As we don't' have… | |||||
* gatherers, which fetch metadata from an authority in a way that is | |||||
none of the above (eg. by querying a specific API of the origin's forge). | |||||
Done Inline Actionsis there a reason gatherers doesn't have the :term: item? moranegg: is there a reason `gatherers` doesn't have the `:term:` item? | |||||
Done Inline ActionsBecause there is no "gatherer" entry in the glossary yet. vlorentz: Because there is no "gatherer" entry in the glossary yet. | |||||
Done Inline Actionsack. moranegg: ack. | |||||
Done Inline Actionsgatherer v. fetcher starts becoming clumsy. How about "metadata crawler" here? I'm open to other suggestions if that doesn't work… zack: gatherer v. fetcher starts becoming clumsy.
How about "metadata crawler" here?
I'm open to… | |||||
Done Inline ActionsMuch better indeed, thanks! vlorentz: Much better indeed, thanks! | |||||
Storage API | |||||
~~~~~~~~~~~ | |||||
Authorities and tools | |||||
^^^^^^^^^^^^^^^^^^^^^ | |||||
The :term:`storage` API offers these endpoints to manipulate metadata | |||||
authorities and tools: | |||||
* ``metadata_authority_add(name, url, type, metadata)`` | |||||
which adds a new metadata authority to the storage. | |||||
* ``metadata_authority_get_by(name, url)`` | |||||
Done Inline Actionswhat does the "_by" adds here? wouldn't metadata_authority_get be better/clearer? zack: what does the "_by" adds here? wouldn't `metadata_authority_get` be better/clearer? | |||||
Done Inline ActionsIndeed. It made sense when (name, url) was not an intrinsic identifier, but it's no longer true. vlorentz: Indeed. It made sense when `(name, url)` was not an intrinsic identifier, but it's no longer… | |||||
which looks up a known authority (there is at most one) and if it is | |||||
known, returns a dictionary with keys ``name``, ``url``, and ``metadata``. | |||||
* ``metadata_tool_add(name, version, metadata)`` | |||||
which adds a new metadata authority to the storage. | |||||
* ``metadata_tool_get_by(name, version)`` | |||||
Done Inline Actions"_by" → ditto zack: "_by" → ditto | |||||
which looks up a known authority (there is at most one) and if it is | |||||
known, returns a dictionary with keys ``name``, ``version``, and ``metadata``. | |||||
These `metadata` fields contain arbitrary JSON-encodable dictionaries | |||||
with informations about the authority/tool, in a format specific to each | |||||
authority/tool. | |||||
These fields are reserved for future uses; currently they should always be | |||||
empty. | |||||
Done Inline ActionsI don't like future uses and arbitrary.
moranegg: I don't like `future uses` and `arbitrary`.
I propose delete `arbitrary` and change to:
>… | |||||
Origin metadata storage | |||||
----------------------- | |||||
Extrinsic metadata are stored in SWH's :term:`storage database`. | |||||
The storage API offers three endpoints to manipulate origin metadata: | |||||
* Adding metadata:: | |||||
origin_metadata_add(origin_url, discovery_date, | |||||
authority_name, authority_url, | |||||
tool_name, tool_version, | |||||
metadata) | |||||
which adds a new `metadata` byte string obtained from a given authority | |||||
and associated to the origin. | |||||
Done Inline Actionsgiven they're (correctly) grouped together in the result dictionary, maybe you want to also group them together here authority_name/authority_url (as a pair) and same thing for fetcher_name/fetcher_version (unless, dunno, this is mapped to an HTTP API somewhere and it's easier to avoid the packing conceptually they are really two things a fetcher and an authority, so it'd make sense to have 2 args instead of 4 zack: given they're (correctly) grouped together in the result dictionary, maybe you want to also… | |||||
Done Inline ActionsGood point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON anyway. vlorentz: Good point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON… | |||||
The authority and tool must be known to the storage before using this | |||||
endpoint. | |||||
* Getting latest metadata:: | |||||
origin_metadata_get_latest(origin_url, | |||||
authority_name, authority_url) | |||||
which returns a dictionary:: | |||||
{ | |||||
'authority': {'name': ..., 'url': ...}, | |||||
'tool': {'name': ..., 'version': ...}, | |||||
'discovery_date': ..., | |||||
'metadata': b'...' | |||||
} | |||||
corresponding to the latest metadata entry added from this origin | |||||
* Getting all metadata:: | |||||
origin_metadata_get(origin_url, | |||||
authority_name, authority_url, | |||||
after, limit) | |||||
which returns a list of dictionaries:: | |||||
{ | |||||
'authority': {'name': ..., 'url': ...}, | |||||
'tool': {'name': ..., 'version': ...}, | |||||
'discovery_date': ..., | |||||
'metadata': b'...' | |||||
} | |||||
Done Inline ActionsI think a [ ] and adding a second origin_metadata entry, can clarify. [{ 'authority': {'name': ..., 'url': ...}, 'tool': {'name': ..., 'version': ...}, 'discovery_date': ..., 'metadata': b'...' }, { 'authority': {'name': ..., 'url': ...}, 'tool': {'name': ..., 'version': ...}, 'discovery_date': ..., 'metadata': b'...' }] moranegg: I think a `[ ]` and adding a second origin_metadata entry, can clarify.
```
[{… | |||||
one for each metadata item deposited, corresponding to the given origin | |||||
Done Inline Actionsthe term deposited is too connected to the deposit and here seems that you talk about all authorities. Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries? also, this explanation should come before the example. moranegg: the term `deposited` is too connected to the deposit and here seems that you talk about all… | |||||
Done Inline Actions
No, it returns all of them, but paginated
It's not an example, it's the format of the output vlorentz: > Did you mean, that a list of the latest origin_metadata entries for a given authority is… | |||||
Done Inline Actions
Is that really useful? to have it all?
This explanation should come before the format output :-) moranegg: > No, it returns all of them, but paginated
Is that really useful? to have it all?
> It's not… | |||||
Done Inline Actions
Yes, for the same reason we can get the list of snapshots of an origins.
* shrug * vlorentz: > Is that really useful? to have it all?
Yes, for the same reason we can get the list of… | |||||
and obtained from the specified authority | |||||
The parameters ``after`` and ``limit`` are used for pagination based on the | |||||
order defined by the ``discovery_date``. | |||||
``metadata`` is a bytes array (eventually encoded using Base64). | |||||
Its format is specific to each authority; and is treated as an opaque value | |||||
by the storage. | |||||
Unifying these various formats into a common language is outside the scope | |||||
of this specification. |
It can be available also as part of a deposit.