Changeset View
Standalone View
docs/extrinsic-metadata-specification.rst
- This file was added.
.. _extrinsic-metadata-specification: | |||||
Extrinsic metadata specification | |||||
================================ | |||||
zack: rather than about specifying the metadata, I think this spec is about specifying the "workflow"… | |||||
:term:`Extrinsic metadata` is information about software that is not part | |||||
of the source code itself but still closely related to the software. | |||||
Usually it is available on the web view of a repository's forge and its API | |||||
or an external registry. | |||||
Done Inline Actionsmaybe we should add that metadata on registries is 'extrinsic' metadata moranegg: maybe we should add that metadata on registries is 'extrinsic' metadata | |||||
moraneggUnsubmitted Not Done Inline Actions
I changed the text a bit, and introducing two new terms:
With this new remark, if accepted by @zack will affect the rest of the document to divide specification between these 2 categories. moranegg: > is information about a software artifact that is not included
> in the source code itself but… | |||||
Since they are not part of the source code, we need a separate mechanism | |||||
to fetch and store them. | |||||
This specification assumes the reader is familiar with Software Heritage's | |||||
:ref:`architecture` and :ref:`data-model`. | |||||
Metadata providers | |||||
------------------ | |||||
Definition | |||||
~~~~~~~~~~ | |||||
zackUnsubmitted Not Done Inline Actionsaside from the various comment on terminology already raised, can't we just refer to the glossary, and avoid re-explaining what origins/loaders/listers are? there are already prerequisite for this doc, so... zack: aside from the various comment on terminology already raised, can't we just refer to the… | |||||
We define five types of metadata providers: | |||||
zackUnsubmitted Not Done Inline ActionsBefore defining the types, you need to define the notion of metadata provider here, I think, which is just a title in this version of the spec. One phrase would suffice, but is needed. zack: Before defining the types, you need to define the notion of metadata provider here, I think… | |||||
* :term:`loaders <loader>`, which are the components dedicated to fetching | |||||
Not Done Inline Actions
this is a bit weird, not sure it is necessary. The list is described as
but today listers are not really metadata providers, unless we consider origins as metadata. Wouldn't be simpler to not consider a lister as a metadata provider? It looks like it's in fact a 'gatherer', according the definition below, that is described here. Then (but it is an implementation detail not an architectural aspect), both the gatherer and the lister could be done at the same time by the same worker. douardda: > and may discover metadata as a side-effect
this is a bit weird, not sure it is necessary. | |||||
Not Done Inline Actionsthis depends on the component (yes implementation detail) is in use when retrieving metadata. If we decide that a new component is needed it will be a new provider_type. Maybe keeping this potential provider_types to help decide which should be implemented. moranegg: this depends on the component (yes implementation detail) is in use when retrieving metadata. | |||||
Done Inline Actions
Indeed they are not. This spec describes what we want to do, not the current state.
"gatherer" is a concept I introduce as a catch-all for everything that is neither a lister, loader, or deposit. I did not yet check how much metadata listers and loaders can fetch. If it happens they can't get a lot of metadata, I'll drop them from the list and replace them with gatherers. (EDIT: like moranegg just said ^^) vlorentz: > today listers are not really metadata providers,
Indeed they are not. This spec describes… | |||||
the source-code from origins (VCS repositories, distribution packages, | |||||
...). They may either discover metadata as a side-effect of loading | |||||
Done Inline Actions"origins" here has not been defined, nor it has been said that origins are in fact what a lister produces. Adding a sentence in the 'lister' definition stating this fact. douardda: "origins" here has not been defined, nor it has been said that origins are in fact what a… | |||||
Done Inline Actionsorigins are part of the SWH data model, which I assume known by readers of the spec. I should probably mention it at the beginning. vlorentz: origins are part of the SWH data model, which I assume known by readers of the spec. I should… | |||||
source code, or be dedicated to fetching metadata. | |||||
Not Done Inline ActionsAt first I thought that we might see metadata when loading, but due to our recent discussion, maybe this is not the case and only the software artifact itself is fetched without additional information. moranegg: At first I thought that we might see metadata when loading, but due to our recent discussion… | |||||
Not Done Inline Actions
Kinda same as above: we are defining these components (i.e. what they are, not what they can be) so this 'it could be used to do other sutff "as a side effect"' should not be in there. This "prospective" part should be in a dedicated paragraph/list/section. douardda: > They may either discover metadata as a side-effect of loading source code, or be dedicated to… | |||||
Not Done Inline ActionsFor now the only provider_type that is not a "prospective" is deposit_client
moranegg: For now the only provider_type that is not a "prospective" is deposit_client
> It isn't a… | |||||
* :term:`listers <lister>`, which are the components of SWH dedicated to | |||||
discovering origins on known websites/forges; and may discover | |||||
Done Inline Actions
is a bit weird here. It does not appears like we are defining the deposit as a metadata provider. douardda: > meaning
is a bit weird here. It does not appears like we are defining the deposit as a… | |||||
Done Inline Actionsmaybe should be:
moranegg: maybe should be:
> :term:`deposit clients <deposit>`, are the clients of the deposit… | |||||
metadata as a side-effect | |||||
Not Done Inline Actionswe will provide in the future the possibility to deposit only metadata about an artifact in the archive moranegg: we will provide in the future the possibility to deposit only metadata about an artifact in the… | |||||
* :term:`deposit clients <deposit>`, which push metadata to SWH from a | |||||
third-party; usually at the same time as a :term:`software artifact` | |||||
* gatherers, which fetch metadata from an authoritative source of the | |||||
repository (eg. its website or forge) in a way that is none of the three | |||||
above (eg. by querying a specific API of the origin's forge). | |||||
* registries, which fetch data from non-authoritative databases, meaning | |||||
they are not directly referenced to by the origin's website/forge/... | |||||
(eg. Wikidata) | |||||
zackUnsubmitted Not Done Inline ActionsWe're inconsistent in the type of entities we mention in this list. Three of them (loaders, listers, gatherers) are active components that do stuff to retrieve metadata. Whereas two of them (registries, deposit clients) are passive information storage that need to be actively consulted to retrieve metadata. That should be uniformed. Bonus point, get rid of the fact it is a deposit client, that's just an implementation detail, the metadata are part of the deposit, no matter what piece of software has been used to transfer it to SWH. zack: We're inconsistent in the type of entities we mention in this list. Three of them (loaders… | |||||
A provider is uniquely defined by these two properties: | |||||
zackUnsubmitted Not Done Inline Actionsgeneral writing suggestion: to avoid the inconsistencies between preambles/suffixes/lists, just avoid repeating how many properties are in the list; it's annoying to maintain and doesn't add any value zack: general writing suggestion: to avoid the inconsistencies between preambles/suffixes/lists, just… | |||||
* its name, representing the software/database from which metadata is | |||||
Done Inline Actionsyou describe the list as
but only list 2 elements. And the 'name' seems to correspond to the 'type' column in the examples below. The type/name/instance/xxx family of terms needs to be clarified IMHO. douardda: you describe the list as
> defined by these three properties:
but only list 2 elements.
And… | |||||
Done Inline ActionsYou are right.
one = name moranegg: You are right.
- type and name are mixed up in the table below
- type is 'deposit_client'… | |||||
Done Inline ActionsYeah, I should say "two properties". I used to count the type, but it turns out it's not part of the unique identifier. vlorentz: Yeah, I should say "two properties". I used to count the `type`, but it turns out it's not part… | |||||
extracted (eg. `gitlab`, `wikidata`, `hal`); each provider name | |||||
matches a component of SWH, dedicated to getting data from it. | |||||
* its URL, which unambiguously identifies an instance of the provider. | |||||
Example providers: | |||||
=============== =============== ================================= | |||||
type name url | |||||
Done Inline ActionsStill type and name are mixed up... when registry is type and moranegg: Still type and name are mixed up...
deposit_client is type
hal is name
when registry is type… | |||||
=============== =============== ================================= | |||||
zackUnsubmitted Not Done Inline ActionsI don't think this ontology based on three properties is working well, but I cannot yet pinpoint what's the key problem. Meanwhile, here are some symptoms which are IMO red flag:
the registry line looks fine, but overall I can't shake the fining that we haven't found the right general model yet zack: I don't think this ontology based on three properties is working well, but I cannot yet… | |||||
deposit_client hal https://hal.archives-ouvertes.fr/ | |||||
Done Inline Actionsswitch moranegg: switch | |||||
deposit_client swh https://www.softwareheritage.org/ | |||||
Done Inline Actionsswitch moranegg: switch | |||||
lister gitlab_lister https://gitlab.com/ | |||||
Done Inline Actionsall the rest are in the correct order moranegg: all the rest are in the correct order | |||||
loader gitlab_loader https://gitlab.com/ | |||||
registry wikidata https://www.wikidata.org/ | |||||
=============== =============== ================================= | |||||
Storage API | |||||
~~~~~~~~~~~ | |||||
The :term:`storage` API offers two endpoints to manipulate metadata | |||||
Not Done Inline ActionsDo we really want these table in the storage? douardda: Do we really want these table in the storage? | |||||
Done Inline ActionsYes. They will be part of the archive. (And they already are.) vlorentz: Yes. They will be part of the archive. (And they already are.) | |||||
providers: | |||||
* `metadata_provider_add(name, url, type, metadata)` | |||||
which adds a new metadata provider to the storage. | |||||
* `metadata_provider_get_by(name, url)` | |||||
which looks up for a known provider (there is at most one) and if it is | |||||
zackUnsubmitted Not Done Inline Actionsminor: either "looks up a" or "looks for a" zack: minor: either "looks up a" or "looks for a" | |||||
known, returns a dictionary with keys `name`, `url`, `type`, and `metadata`. | |||||
`metadata` is an arbitrary JSON-encodable dictionary with informations | |||||
about the provider, in a format specific to each provider name. | |||||
This field only uses for future uses; currently it should always be empty. | |||||
moraneggUnsubmitted Not Done Inline Actions
I think we should use the same metadata schema for each provider type moranegg: > in a format specific to each provider type
I think we should use the same metadata schema… | |||||
vlorentzAuthorUnsubmitted Done Inline Actionsswh-storage should store data as close to the original source as possible. If we did the translation before sending to swh-storage, then a bug in the translation means all existing data is corrupted. vlorentz: `swh-storage` should store data as close to the original source as possible. | |||||
moraneggUnsubmitted Not Done Inline ActionsI now understand why you thought I didn't want the raw metadata, but here we discuss the information about the provider and not the fetched metadata. moranegg: I now understand why you thought I didn't want the raw metadata, but here we discuss the… | |||||
Origin metadata storage | |||||
----------------------- | |||||
Extrinsic metadata are stored in SWH's :term:`storage database`, alongside | |||||
the :term:`Merkle DAG` containing all known software artifacts. | |||||
zackUnsubmitted Not Done Inline ActionsNot sure that implementation details belongs here. For the spec it is enough to know metadata are stored somewhere, and cross-referencable with source code artifacts available in the archive. zack: Not sure that implementation details belongs here. For the spec it is enough to know metadata… | |||||
The storage API offers three endpoints to manipulate origin metadata: | |||||
* `origin_metadata_add(origin_id, discovery_date, provider_name, provider_url, metadata)` | |||||
which adds a new `metadata` dictionary obtained from a given provider | |||||
and associated to the origin. | |||||
The provider must be known to the storage before using this endpoint. | |||||
* `origin_metadata_get(origin_id, provider_name, provider_url, after, limit)` | |||||
which returns a list of dictionaries: | |||||
`{'provider': {...}, 'discovery_date': ..., 'metadata': {...}}`, | |||||
one for each metadata item deposited, corresponding to the given origin | |||||
and obtained from the specified provider | |||||
zackUnsubmitted Not Done Inline ActionsA common use case will be retrieving the most recent metadata provided by a given provider; I don't see how this API endpoint will allow to do that (short of accepting a special value for the "after" parameter which means "most recent"—which isn't terribly elegant). zack: A common use case will be retrieving the most recent metadata provided by a given provider; I… | |||||
* `origin_metadata_get_by_provider_type(origin_id, provider_type, after, limit)` | |||||
which works similarly to `origin_metadata_get`, but returns results for | |||||
all providers of a given type. | |||||
The parameters `after` and `limit` are used for pagination based on the | |||||
order defined by the `discovery_date`. | |||||
All of the results of `origin_metadata_get` and | |||||
`origin_metadata_get_by_provider_type` can be considered authoritative | |||||
for the given origin at the given `discovery_date`, unless the provider type | |||||
is `registry`. | |||||
moraneggUnsubmitted Not Done Inline ActionsI'm reading this paragraph over and over and this is ambiguous.
as I written in the new comment, when we distinguish between attached extrinsic metadata and independent extrinsic metadata, we can "qualify" the attached extrinsic metadata as more authoritative. I would erase this paragraph. Unless there is another way to say that we are not the authority. moranegg: I'm reading this paragraph over and over and this is ambiguous.
We don't presume to have th… | |||||
The format of `metadata` is a JSON-encodable dictionary. Its format is | |||||
specific to each provider; and is treated as an opaque value by the storage. | |||||
moraneggUnsubmitted Not Done Inline Actionsadd raw moranegg: add `raw` | |||||
vlorentzAuthorUnsubmitted Done Inline ActionsWhere? vlorentz: Where? | |||||
Unifying these various formats into a common language is outside the scope | |||||
of this specification. |
rather than about specifying the metadata, I think this spec is about specifying the "workflow" for gathering/archiving them, so maybe "Extrinsic metadata archival" here ?