Generic storage for extrinsic, qualified metadata related to any node of the swh archive
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Mar 9 2020, 8:48 PM

Description

Our archive is currently in a bizarre middle-ground where some nodes in our graph can have free-form metadata attached (currently, that's revisions and origins), while others can't.

For revisions, some of that metadata is an integral part of the identifier computation (e.g. referencing arbitrary headers stored in a git commit); the rest of that metadata is attached to the revision object, without taking part in the identifier computation.

This is an issue in several fronts:

we end up creating revisions and storing metadata there, even when objects should conceptually be releases (that's T1258 as well as T1755)
we've been wary of adding a free-form metadata field to other objects, as we felt the need to update identifier computation to support it...
...as in the current state, metadata that is not part of the identifier computation is lossy: if we get different metadata, but generate the same revision id as something that has already been loaded, we will not be loading the new metadata to the archive (making the "idempotent" nature of our archive graph weaker, and in the worst case, losing (meta)data).

This issue has come up again while discussing our upcoming work with the scientific community (notably HAL/Archives Ouvertes). During this discussion, we've clarified a somewhat common misconception about our object identifiers:

Not all swh object identifiers are created equal.

persistent content identifiers are fully intrinsic, and are therefore totally suitable for the very long-term identification and retrieval of source code;
persistent directory identifiers are, as well, fully intrinsic (given proper normalization of file modes, which we're doing on the tarballs that we load). While harder to match "by chance" (as the complete hierarchy needs to be bit by bit identical), they're still likely to be usable in the very long term to retrieve source by id;
In the specific case of objects synthesized by Software Heritage (e.g. revisions or releases generated from deposits of source code, or from tarballs of project releases), the persistent revision, release and snapshot identifiers are less useful to the long-term identification of software. In essence, Software Heritage acts as a source of truth for these object ids, and expecting third parties to be able to replicate them in a long-term future is dubious at best.
For objects created by third parties (e.g. commits and tags from version control systems), the fact that the v1 of SWH persistent revision and release identifiers are compatible with the corresponding git object identifiers helps with their longer term usefulness, but in the future there's a good chance that we'll need to generate our own identifiers from scratch, and to store these external identifiers as free-form metadata as well.
currently, snapshot objects are purely swh-specific.

Having said that, we've concluded with a way forward on storing extrinsic metadata on the graph:

we want a way to attach free-form, qualified metadata to objects at all levels of the graph (there's a good chance we can replace T1260 with that)
- we should be able to insert this metadata at object creation time
- separate crawlers should be able to insert this metadata post-hoc (T1739)
- trusted third parties should be able to push this metadata to us, e.g. via a SWORD / deposit process
  - for new objects (deposit of source code with attached metadata)
  - for existing objects (deposit of metadata only, attached to an object created externally, e.g. by loading a git origin);
this metadata store should remain completely outside of the object identifier computation
- minimizing the metadata accounted for inside of our object identifiers (and therefore, improving their "intrinsicness") increases the probability that they can be reproduced and used by third parties in the very long term;
this metadata store should be outside of the main graph storage
- We'll surely want to use / experiment on the metadata store separately from our work on the main graph

Once this separate metadata store is introduced, we should export the current "identifier-excluded" metadata out of the objects currently stored in the graph, then harden the archive storage schema to only allow intrinsic, identifier-included metadata fields in the main archive storage.

A minimum viable implementation of this metadata store would allow queries of the metadata attached to a given object, by PID, so that metadata can be displayed on the website and made available via the public API.

Metadata-based or faceted search is a further step that is out of scope for this task.

(this is the summary of parts of an IRL discussion with @rdicosmo, @douardda, @vlorentz, @moranegg and @ardumont; corrections are, of course, welcome)

Revisions and Commits

rDSTO Storage manager
	Closed		D3623 Rename object_metadata to raw_extrinsic_metadata.
	Closed		D3456 Make metadata-related endpoints consistent with other endpoints by using Iterables of swh-model objects instead of a dict.
		D3357	rDSTOffe6b9253ecc Add content_metadata_{add,get}.
		D3356	rDSTO869679a85c55 Add context columns to object_metadata table and object_metadata_{add,get}.
		D3355	rDSTO27e942621cef Generalize origin_metadata to allow support for other object types in the…
		D3154	rDSTO213f1b1239a8 Add artifact metadata to the extrinsic metadata storage specification.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3197 Mirror: fix common issues of a replayer session
Migrated	gitlab-migration	T3201 Mirror: unsupported Unicode escape sequence
Migrated	gitlab-migration	T1258 Synthesize release objects for all upstream things that match the concept of a release
Migrated	gitlab-migration	T2059 Generate (swh) releases from all git tags
Migrated	gitlab-migration	T3089 Remove the 'metadata' column of the 'revision' table
Migrated	gitlab-migration	T2513 Copy metadata on revisions to the extrinsic metadata storage
Migrated	gitlab-migration	T3097 Expose metadata in the WebApp and make it searchable
Migrated	gitlab-migration	T2064 Add metadata from deposits to metadata search
Migrated	gitlab-migration	T2073 Index extrinsic metadata from the journal in swh-search/Elasticsearch
		Restricted Maniphest Task
Migrated	gitlab-migration	T4401 Index metadata from the deposit
Migrated	gitlab-migration	T4459 Deploy swh-indexer > v2.6 on staging then production
Migrated	gitlab-migration	T4429 Deploy swh-indexer v2.3.0 on production and staging
Migrated	gitlab-migration	T4477 staging origin intrinsic metadata indexer are stuck
Migrated	gitlab-migration	T4606 Deploy swh-indexer v2.7.0
Migrated	gitlab-migration	T4694 Use directory metadata in origin search
Migrated	gitlab-migration	T2201 Indexing / mining
Migrated	gitlab-migration	T2202 Collect extrinsic metadata
Migrated	gitlab-migration	T2328 Collect metadata about software from ScanR
Migrated	gitlab-migration	T2311 Review the deposit of CodeMeta metadata in xml (following SWORD V2 specs)
Migrated	gitlab-migration	T2512 Make all loaders write their extrinsic metadata to the appropriate storage
Migrated	gitlab-migration	T2496 Write deposit metadata for revisions in the generic metadata storage
Migrated	gitlab-migration	T2514 Add raw_extrinsic_metadata to the journal backfiller
Migrated	gitlab-migration	T2074 Publish extrinsic metadata to swh-journal/Kafka
Migrated	gitlab-migration	T2344 Build a connector for software deposit via Zenodo/InvenioRDM
Migrated	gitlab-migration	T1732 Extend metadata for portals depositing software through SWORD
Migrated	gitlab-migration	T2306 Generic storage for extrinsic, qualified metadata related to any node of the swh archive

Event Timeline

olasd triaged this task as Normal priority.Mar 9 2020, 8:48 PM

olasd created this task.

Thanks a lot for this summary of a recurrent discussion we've had over the past few years now.

+1 on the general idea. It fits well as the backing store of the "factual knowledge base" for software artifacts stored in the main (graph) storage of Software Heritage.

Just an extra comment on the fact that out-of-graph metadata might be context-dependent.
E.g., two different HAL deposits might be depositing the very same source code tree but associate different, potentially even conflictful metadata (e.g., each deposit declaring that the author is a different person).
Another example that came up in the past are license statements: the same content blob might be recognized (by some tool or human review) as being distributed under GPL in a given repo, and under MIT in a different one.

This is absolutely not in conflict with the general idea here, but it does have a couple of design implications:

the metadata storage should support 1-N mappings from graph objects to metadata and associate contextual information to each key pair there
in terms of UI, showing the metadata to user (e.g., on archive.s.o) will require some judgment calls on whether the user is browsing the graph object in the "right" context or not (this is a significance difference with what we do now where we can just show all metadata we have)

Let's go for this!

Thanks @olasd for the accurate and detailed summary !

A thought, we have the origin_metadata table where we only store at the moment HAL metadata, with provider and tool as FK.
This table also keeps discovery_date, so potentially an origin can have multiple metadata entries at different dates. Visible on the db-schema.

To answer @zack design implications, I completely agree with the 1-N mappings that should be supported for all artifacts, maybe with the discovery_date.

About showing metadata to the user, we are far from showing all metadata to the user at moment, it's quite well hidden ;-)
And we don't show the intrinsic metadata at all.
I think we can keep the idea of showing all metadata, but some metadata will have higher visibility than other metadata and thus subject to necessary judgement calls.

Here is the link to the deposit-metadata specifications (written in 2018):
https://docs.softwareheritage.org/devel/swh-deposit/specs/spec-meta-deposit.html
I'll open a task to review and improve that.

rdicosmo added a parent task: T2328: Collect metadata about software from ScanR.Mar 20 2020, 6:59 PM

rdicosmo added a parent task: T1732: Extend metadata for portals depositing software through SWORD.Mar 21 2020, 11:50 AM

vlorentz claimed this task.Mar 24 2020, 12:04 PM

vlorentz removed a subtask: T2311: Review the deposit of CodeMeta metadata in xml (following SWORD V2 specs) .

vlorentz added a parent task: T2311: Review the deposit of CodeMeta metadata in xml (following SWORD V2 specs) .

rdicosmo added a parent task: T2344: Build a connector for software deposit via Zenodo/InvenioRDM.Apr 1 2020, 5:43 PM

ardumont mentioned this in T2371: nixguix: fails to use previous visit snapshot.Apr 24 2020, 9:21 AM

vlorentz added a revision: D3154: Add artifact metadata to the extrinsic metadata storage specification..May 14 2020, 2:53 PM

vlorentz added a commit: rDSTO213f1b1239a8: Add artifact metadata to the extrinsic metadata storage specification..May 26 2020, 1:02 PM

douardda mentioned this in D3247: [WIP] Add content_metadata_{add,get}..Jun 10 2020, 2:57 PM

vlorentz added revisions: D3355: Generalize origin_metadata to allow support for other object types in the future., D3356: Add context columns to object_metadata table and object_metadata_{add,get}., D3357: Add content_metadata_{add,get}..Jun 25 2020, 5:56 PM