Page MenuHomeSoftware Heritage

Write a specification of extrinsic origin metadata storage.

Authored by vlorentz on May 23 2019, 3:53 PM.


Group Reviewers
Maniphest Tasks
T1737: Define and specify metadata providers

This is based on my understanding of how it currently works, and
this comment by @moranegg:

I introduced the following changes:

  • introduced the concept of "gatherer" as a metadata provider, as a catch-all for everything that does not fit in the other types, eg. if we decide on implementing this: We can discard it afterward if we don't need it.
  • Dropped the concept of 'tool'. As far as I understand, they are an intermediary between the provider and the SWH archive. IMO they only complicate stuff if we plan on extracting metadata from providers without any change.
  • Changed the described endpoints to drop the concept of "provider id" (you know I don't like extrinsic identifiers by now :) )
  • Changed the described endpoints to give consistent names to the parameters.

Resolves T1737.

Diff Detail

rDSTO Storage manager
No Linters Available
No Unit Test Coverage
Build Status
Buildable 5912
Build 8100: tox-on-jenkinsJenkins
Build 8099: arc lint + arc unit

Event Timeline

I really like it!
Good job!


At first I thought that we might see metadata when loading, but due to our recent discussion, maybe this is not the case and only the software artifact itself is fetched without additional information.


we will provide in the future the possibility to deposit only metadata about an artifact in the archive

This revision is now accepted and ready to land.May 24 2019, 10:43 AM

maybe we should add that metadata on registries is 'extrinsic' metadata

Mention registries in the definition of extrinsic metadata

I may have made stupid comments, but...


and may discover metadata as a side-effect

this is a bit weird, not sure it is necessary. The list is described as

We define five types of metadata providers:

but today listers are not really metadata providers, unless we consider origins as metadata.

Wouldn't be simpler to not consider a lister as a metadata provider? It looks like it's in fact a 'gatherer', according the definition below, that is described here.

Then (but it is an implementation detail not an architectural aspect), both the gatherer and the lister could be done at the same time by the same worker.


"origins" here has not been defined, nor it has been said that origins are in fact what a lister produces. Adding a sentence in the 'lister' definition stating this fact.


They may either discover metadata as a side-effect of loading source code, or be dedicated to fetching metadata.

Kinda same as above: we are defining these components (i.e. what they are, not what they can be) so this 'it could be used to do other sutff "as a side effect"' should not be in there. This "prospective" part should be in a dedicated paragraph/list/section.



is a bit weird here. It does not appears like we are defining the deposit as a metadata provider.


you describe the list as

defined by these three properties:

but only list 2 elements.

And the 'name' seems to correspond to the 'type' column in the examples below.

The type/name/instance/xxx family of terms needs to be clarified IMHO.


Do we really want these table in the storage?


this depends on the component (yes implementation detail) is in use when retrieving metadata.
If the component is a lister- lister_gitlab for example is the mechanism with which we fetched the metadata.

If we decide that a new component is needed it will be a new provider_type.

Maybe keeping this potential provider_types to help decide which should be implemented.


For now the only provider_type that is not a "prospective" is deposit_client

It isn't a limited list for now, we have at the moment only one provider_type deposit_client, but the metadata provider entity should be used in the following future cases:

when listing: lister type
when loading: loader type
when fetching metadata from registries: registry type

Because it is not implemented, I'm not sure if we should list the types above in the docs


maybe should be:

:term:deposit clients <deposit>, are the clients of the deposit component that push metadata to SWH using credentials; usually at the same time as a :term:software artifact


You are right.

  • type and name are mixed up in the table below
    • type is 'deposit_client'
    • name is 'hal'

one = name
two = url
third = type
Where type is used to qualify the metadata, for example, deposit_client metadata is more certified than registry metadata

All comments are open for discussion and IMHO are not subject to accepting this diff.
On the other hand, I didn't notice that the table is mixed up between name and type and should be changed before push.

This revision now requires changes to proceed.May 24 2019, 2:33 PM

today listers are not really metadata providers,

Indeed they are not. This spec describes what we want to do, not the current state.

It looks like it's in fact a 'gatherer', according the definition below, that is described here.

"gatherer" is a concept I introduce as a catch-all for everything that is neither a lister, loader, or deposit. I did not yet check how much metadata listers and loaders can fetch. If it happens they can't get a lot of metadata, I'll drop them from the list and replace them with gatherers.

(EDIT: like moranegg just said ^^)


origins are part of the SWH data model, which I assume known by readers of the spec. I should probably mention it at the beginning.


Yeah, I should say "two properties". I used to count the type, but it turns out it's not part of the unique identifier.


Yes. They will be part of the archive. (And they already are.)

vlorentz marked 2 inline comments as done.
  • better explain origins
  • reword description of deposit clients
  • s/three properties/two properties/

Still type and name are mixed up...
deposit_client is type
hal is name

when registry is type and
wikidata is name






all the rest are in the correct order


is information about a software artifact that is not included
in the source code itself but should describe different aspects of the software artifact.
We distinguish attached extrinsic metadata found with the software artifact (on the web view of a repository's forge and/or its API, the metadata attached to a deposit)
and independent extrinsic metadata found in other locations (such as software catalogs and registries).

I changed the text a bit, and introducing two new terms:

  • attached extrinsic metadata
  • independent extrinsic metadata

With this new remark, if accepted by @zack will affect the rest of the document to divide specification between these 2 categories.


in a format specific to each provider type

I think we should use the same metadata schema for each provider type
This is important when we will document information about specific registries.


I'm reading this paragraph over and over and this is ambiguous.
We don't presume to have th authority to say this is true and this is false.
we say this is a fact:

We found on this day at this location from this provider this metadata

as I written in the new comment, when we distinguish between attached extrinsic metadata and independent extrinsic metadata, we can "qualify" the attached extrinsic metadata as more authoritative.

I would erase this paragraph. Unless there is another way to say that we are not the authority.


add raw


swh-storage should store data as close to the original source as possible.
Unification/translation should be done as an indexer, so it can be re-started based on the content of swh-storage.

If we did the translation before sending to swh-storage, then a bug in the translation means all existing data is corrupted.



zack requested changes to this revision.Jun 12 2019, 1:31 PM

Thanks @vlorentz for this first draft. In spite of all the comments above, I think it's a very good start.

Overall, I've two reservations: one is a bunch of minor issues noted above and that can be easily fixed, I think; the other is more profound, I think we haven't yet found a model general enough to describe metadata providers, but I don't have a concrete proposal yet. Maybe this deserves a shared brainstorming session with a whiteboard?


rather than about specifying the metadata, I think this spec is about specifying the "workflow" for gathering/archiving them, so maybe "Extrinsic metadata archival" here ?


aside from the various comment on terminology already raised, can't we just refer to the glossary, and avoid re-explaining what origins/loaders/listers are? there are already prerequisite for this doc, so...


Before defining the types, you need to define the notion of metadata provider here, I think, which is just a title in this version of the spec. One phrase would suffice, but is needed.


We're inconsistent in the type of entities we mention in this list. Three of them (loaders, listers, gatherers) are active components that do stuff to retrieve metadata. Whereas two of them (registries, deposit clients) are passive information storage that need to be actively consulted to retrieve metadata. That should be uniformed.

Bonus point, get rid of the fact it is a deposit client, that's just an implementation detail, the metadata are part of the deposit, no matter what piece of software has been used to transfer it to SWH.


general writing suggestion: to avoid the inconsistencies between preambles/suffixes/lists, just avoid repeating how many properties are in the list; it's annoying to maintain and doesn't add any value


I don't think this ontology based on three properties is working well, but I cannot yet pinpoint what's the key problem. Meanwhile, here are some symptoms which are IMO red flag:

  • we have a gitlab loader? what is that? we have loaders that are specific to VCS for now, not to specific forges that hold those VCSs
  • sort of the same for deposits, we have one deposit interface, which just happens to accept submissions from multiple open access site "users"; why do we need to replicate the full list of "users" as metadata providers?
  • an example of a gatherer is missing

the registry line looks fine, but overall I can't shake the fining that we haven't found the right general model yet


minor: either "looks up a" or "looks for a"


Not sure that implementation details belongs here. For the spec it is enough to know metadata are stored somewhere, and cross-referencable with source code artifacts available in the archive.


A common use case will be retrieving the most recent metadata provided by a given provider; I don't see how this API endpoint will allow to do that (short of accepting a special value for the "after" parameter which means "most recent"—which isn't terribly elegant).

This revision now requires changes to proceed.Jun 12 2019, 1:31 PM
moranegg added inline comments.

I now understand why you thought I didn't want the raw metadata, but here we discuss the information about the provider and not the fetched metadata.
This is metadata about the provider.