Define and specify extrinsic origin metadata
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	May 23 2019, 11:23 AM

Description

We already have endpoints dealing with them in the storage API (origin_metadata_add and origin_metadata_get_by), but they are not properly documented.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2201 Indexing / mining
Migrated	gitlab-migration	T2202 Collect extrinsic metadata
Migrated	gitlab-migration	T833 When listing an origin, add origin level metadata to RMD storage
Migrated	gitlab-migration	T4283 Load https://github.com/chromium/chromium with a higher packfile size limit
Migrated	gitlab-migration	T3273 Use "fork" relationships to speed-up initial load of large repositories
Migrated	gitlab-migration	T1102 Handle all GitHub elements
Migrated	gitlab-migration	T1740 fetch extrinsic origin metadata from GitHub
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T1344 Write specs about metadata workflow
Migrated	gitlab-migration	T1738 Define and specify extrinsic origin metadata
Migrated	gitlab-migration	T1737 Define and specify metadata providers

Event Timeline

vlorentz triaged this task as Normal priority.May 23 2019, 11:23 AM

vlorentz created this task.

vlorentz renamed this task from Define and specify origin extrinsic metadata to Define and specify extrinsic origin metadata.May 23 2019, 11:26 AM

After the discussion on extrinsic_metadata this morning with @zack, @vlorentz, @olasd and @douardda, here is a quick recap of the discussion,
feel free to add, comment and rewrite...

the notion of attached extrinsic metadata and independent extrinsic metadata was introduced for comprehension purposes without the necessity to implement
the term metadata provider was discussed and shouldn't be used, we will continue with authority
the distinction between the authority providing the metadata and the tool making it possible, should be documented and reflected in the specs and implementation
two options on how extrinsic metadata can be fetched and kept:

crawling a source (authority) and keeping all metadata with its source

`metadata_url`	`metadata_loader`
github.com/foo/bar	github_metadata_loader
gitlab.inria.org/foo/bar	gitlab_metadata_loader

fetching metadata that describes a code repository (potentially in our archive)

`origin_url`	`authority`	`tool`	`time-stamp`	`raw_metadata`
github.com/foo/bar	github.com	github_loader	ts	rm
github.com/foo/bar	wikidata	wikidata_gatherer	ts	rm
github.com/foo/bar	fsf.org	fsf_gatherer	ts	rm

types of authorities:
- code hosts
- deposit clients
- registries

points to clarify:
- metadata found from a mirror that kept the data from a different authority (Antelink scenario)
- do we want to keep metadata found without the associated origin (a.k.a code repository)
- do we want to document extrinsic metadata about other granularity levels (content, directory, revision, snapshot)? is this type of metadata exists?
- should we create new tools for lister/loader type named metadata_loader/lister or refactor existing tools?
- this was not discussed, but there is a table for the providers now called authoritywhere metadata about the providers should be kept in a know metadata schema (D1509#inline-9677)

Actions on D1509:
I propose to abort with this diff and relaunch a new diff taking into account the comments from the discussion, and in particular:

specify authority instead of provider and add description of tools
specify that raw metadata will be kept anyway before syntax and semantic translation

Here is the current implementation:
swh:1:cnt:ea4b149cd76c67c304425771caa67ec5641a1b64;lines=381-428

moranegg added a subtask: T1737: Define and specify metadata providers.Jun 14 2019, 3:39 PM

Thanks a lot for this recap Morane !

Regarding the two options you mention, I'm pretty sure we decided to go for the second (the 5 columns table, at least conceptually). I'm not sure I understand the first option, nor if it is alternative or in addition to the second.

As an additional note: we discussed that tool needed a version and/or configuration—similar to what we have for intrinsic metadata indexers (although maybe the actual representation can be improved), and unlike what we do with loading content into the archive (but we agreed that, in theory, we should have version/configuration also for code loaders, the fact we don't is a bug that we do not want to replicate here).

As discussed F2F, I concur we can restart from scratch with D1509, and I'll be happy to review its reincarnation when ready.

vlorentz closed subtask T1737: Define and specify metadata providers as Resolved.Jul 4 2019, 11:14 AM

@vlorentz I think we can resolve this due to D1614?

Indeed

moranegg mentioned this in T3681: Review extrinsic metadata specification.Oct 21 2021, 12:36 PM

This task has been migrated to GitLab.

Define and specify extrinsic origin metadataClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Define and specify extrinsic origin metadata
Closed, MigratedEdits Locked
Actions

Related Objects
Search...