We already have endpoints dealing with them in the storage API (origin_metadata_add and origin_metadata_get_by), but they are not properly documented.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2201 Indexing / mining | ||
Migrated | gitlab-migration | T2202 Collect extrinsic metadata | ||
Migrated | gitlab-migration | T833 When listing an origin, add origin level metadata to RMD storage | ||
Migrated | gitlab-migration | T4283 Load https://github.com/chromium/chromium with a higher packfile size limit | ||
Migrated | gitlab-migration | T3273 Use "fork" relationships to speed-up initial load of large repositories | ||
Migrated | gitlab-migration | T1102 Handle all GitHub elements | ||
Migrated | gitlab-migration | T1740 fetch extrinsic origin metadata from GitHub | ||
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T1344 Write specs about metadata workflow | ||
Migrated | gitlab-migration | T1738 Define and specify extrinsic origin metadata | ||
Migrated | gitlab-migration | T1737 Define and specify metadata providers |
Event Timeline
After the discussion on extrinsic_metadata this morning with @zack, @vlorentz, @olasd and @douardda, here is a quick recap of the discussion,
feel free to add, comment and rewrite...
- the notion of attached extrinsic metadata and independent extrinsic metadata was introduced for comprehension purposes without the necessity to implement
- the term metadata provider was discussed and shouldn't be used, we will continue with authority
- the distinction between the authority providing the metadata and the tool making it possible, should be documented and reflected in the specs and implementation
- two options on how extrinsic metadata can be fetched and kept:
- crawling a source (authority) and keeping all metadata with its source
metadata_url | metadata_loader |
github.com/foo/bar | github_metadata_loader |
gitlab.inria.org/foo/bar | gitlab_metadata_loader |
- fetching metadata that describes a code repository (potentially in our archive)
origin_url | authority | tool | time-stamp | raw_metadata |
github.com/foo/bar | github.com | github_loader | ts | rm |
github.com/foo/bar | wikidata | wikidata_gatherer | ts | rm |
github.com/foo/bar | fsf.org | fsf_gatherer | ts | rm |
- types of authorities:
- code hosts
- deposit clients
- registries
- points to clarify:
- metadata found from a mirror that kept the data from a different authority (Antelink scenario)
- do we want to keep metadata found without the associated origin (a.k.a code repository)
- do we want to document extrinsic metadata about other granularity levels (content, directory, revision, snapshot)? is this type of metadata exists?
- should we create new tools for lister/loader type named metadata_loader/lister or refactor existing tools?
- this was not discussed, but there is a table for the providers now called authoritywhere metadata about the providers should be kept in a know metadata schema (D1509#inline-9677)
Actions on D1509:
I propose to abort with this diff and relaunch a new diff taking into account the comments from the discussion, and in particular:
- specify authority instead of provider and add description of tools
- specify that raw metadata will be kept anyway before syntax and semantic translation
Here is the current implementation:
swh:1:cnt:ea4b149cd76c67c304425771caa67ec5641a1b64;lines=381-428
Thanks a lot for this recap Morane !
Regarding the two options you mention, I'm pretty sure we decided to go for the second (the 5 columns table, at least conceptually). I'm not sure I understand the first option, nor if it is alternative or in addition to the second.
As an additional note: we discussed that tool needed a version and/or configuration—similar to what we have for intrinsic metadata indexers (although maybe the actual representation can be improved), and unlike what we do with loading content into the archive (but we agreed that, in theory, we should have version/configuration also for code loaders, the fact we don't is a bug that we do not want to replicate here).
As discussed F2F, I concur we can restart from scratch with D1509, and I'll be happy to review its reincarnation when ready.