Page MenuHomeSoftware Heritage

Define and specify extrinsic origin metadata
Closed, ResolvedPublic

Description

We already have endpoints dealing with them in the storage API (origin_metadata_add and origin_metadata_get_by), but they are not properly documented.

Event Timeline

vlorentz triaged this task as Normal priority.May 23 2019, 11:23 AM
vlorentz created this task.
vlorentz renamed this task from Define and specify origin extrinsic metadata to Define and specify extrinsic origin metadata.May 23 2019, 11:26 AM

After the discussion on extrinsic_metadata this morning with @zack, @vlorentz, @olasd and @douardda, here is a quick recap of the discussion,
feel free to add, comment and rewrite...

  1. the notion of attached extrinsic metadata and independent extrinsic metadata was introduced for comprehension purposes without the necessity to implement
  2. the term metadata provider was discussed and shouldn't be used, we will continue with authority
  3. the distinction between the authority providing the metadata and the tool making it possible, should be documented and reflected in the specs and implementation
  4. two options on how extrinsic metadata can be fetched and kept:
  • crawling a source (authority) and keeping all metadata with its source
metadata_urlmetadata_loader
github.com/foo/bargithub_metadata_loader
gitlab.inria.org/foo/bargitlab_metadata_loader
  • fetching metadata that describes a code repository (potentially in our archive)
origin_urlauthoritytooltime-stampraw_metadata
github.com/foo/bargithub.comgithub_loadertsrm
github.com/foo/barwikidatawikidata_gatherertsrm
github.com/foo/barfsf.orgfsf_gatherertsrm
  1. types of authorities:
    • code hosts
    • deposit clients
    • registries
  1. points to clarify:
    • metadata found from a mirror that kept the data from a different authority (Antelink scenario)
    • do we want to keep metadata found without the associated origin (a.k.a code repository)
    • do we want to document extrinsic metadata about other granularity levels (content, directory, revision, snapshot)? is this type of metadata exists?
    • should we create new tools for lister/loader type named metadata_loader/lister or refactor existing tools?
    • this was not discussed, but there is a table for the providers now called authoritywhere metadata about the providers should be kept in a know metadata schema (D1509#inline-9677)

Actions on D1509:
I propose to abort with this diff and relaunch a new diff taking into account the comments from the discussion, and in particular:

  • specify authority instead of provider and add description of tools
  • specify that raw metadata will be kept anyway before syntax and semantic translation

Here is the current implementation:
swh:1:cnt:ea4b149cd76c67c304425771caa67ec5641a1b64;lines=381-428

zack added a comment.Jun 19 2019, 2:33 PM

Thanks a lot for this recap Morane !

Regarding the two options you mention, I'm pretty sure we decided to go for the second (the 5 columns table, at least conceptually). I'm not sure I understand the first option, nor if it is alternative or in addition to the second.

As an additional note: we discussed that tool needed a version and/or configuration—similar to what we have for intrinsic metadata indexers (although maybe the actual representation can be improved), and unlike what we do with loading content into the archive (but we agreed that, in theory, we should have version/configuration also for code loaders, the fact we don't is a bug that we do not want to replicate here).

As discussed F2F, I concur we can restart from scratch with D1509, and I'll be happy to review its reincarnation when ready.

@vlorentz I think we can resolve this due to D1614?

vlorentz closed this task as Resolved.Jul 18 2019, 3:23 PM

Indeed