Page MenuHomeSoftware Heritage

Generic storage for extrinsic, qualified metadata related to any node of the swh archive
Closed, ResolvedPublic

Description

Our archive is currently in a bizarre middle-ground where some nodes in our graph can have free-form metadata attached (currently, that's revisions and origins), while others can't.

For revisions, some of that metadata is an integral part of the identifier computation (e.g. referencing arbitrary headers stored in a git commit); the rest of that metadata is attached to the revision object, without taking part in the identifier computation.

This is an issue in several fronts:

  • we end up creating revisions and storing metadata there, even when objects should conceptually be releases (that's T1258 as well as T1755)
  • we've been wary of adding a free-form metadata field to other objects, as we felt the need to update identifier computation to support it...
  • ...as in the current state, metadata that is not part of the identifier computation is lossy: if we get different metadata, but generate the same revision id as something that has already been loaded, we will not be loading the new metadata to the archive (making the "idempotent" nature of our archive graph weaker, and in the worst case, losing (meta)data).

This issue has come up again while discussing our upcoming work with the scientific community (notably HAL/Archives Ouvertes). During this discussion, we've clarified a somewhat common misconception about our object identifiers:

Not all swh object identifiers are created equal.

  • persistent content identifiers are fully intrinsic, and are therefore totally suitable for the very long-term identification and retrieval of source code;
  • persistent directory identifiers are, as well, fully intrinsic (given proper normalization of file modes, which we're doing on the tarballs that we load). While harder to match "by chance" (as the complete hierarchy needs to be bit by bit identical), they're still likely to be usable in the very long term to retrieve source by id;
  • In the specific case of objects synthesized by Software Heritage (e.g. revisions or releases generated from deposits of source code, or from tarballs of project releases), the persistent revision, release and snapshot identifiers are less useful to the long-term identification of software. In essence, Software Heritage acts as a source of truth for these object ids, and expecting third parties to be able to replicate them in a long-term future is dubious at best.
  • For objects created by third parties (e.g. commits and tags from version control systems), the fact that the v1 of SWH persistent revision and release identifiers are compatible with the corresponding git object identifiers helps with their longer term usefulness, but in the future there's a good chance that we'll need to generate our own identifiers from scratch, and to store these external identifiers as free-form metadata as well.
  • currently, snapshot objects are purely swh-specific.

Having said that, we've concluded with a way forward on storing extrinsic metadata on the graph:

  • we want a way to attach free-form, qualified metadata to objects at all levels of the graph (there's a good chance we can replace T1260 with that)
    • we should be able to insert this metadata at object creation time
    • separate crawlers should be able to insert this metadata post-hoc (T1739)
    • trusted third parties should be able to push this metadata to us, e.g. via a SWORD / deposit process
      • for new objects (deposit of source code with attached metadata)
      • for existing objects (deposit of metadata only, attached to an object created externally, e.g. by loading a git origin);
  • this metadata store should remain completely outside of the object identifier computation
    • minimizing the metadata accounted for inside of our object identifiers (and therefore, improving their "intrinsicness") increases the probability that they can be reproduced and used by third parties in the very long term;
  • this metadata store should be outside of the main graph storage
    • We'll surely want to use / experiment on the metadata store separately from our work on the main graph

Once this separate metadata store is introduced, we should export the current "identifier-excluded" metadata out of the objects currently stored in the graph, then harden the archive storage schema to only allow intrinsic, identifier-included metadata fields in the main archive storage.

A minimum viable implementation of this metadata store would allow queries of the metadata attached to a given object, by PID, so that metadata can be displayed on the website and made available via the public API.

Metadata-based or faceted search is a further step that is out of scope for this task.

(this is the summary of parts of an IRL discussion with @rdicosmo, @douardda, @vlorentz, @moranegg and @ardumont; corrections are, of course, welcome)

Related Objects

Event Timeline

olasd triaged this task as Normal priority.Mar 9 2020, 8:48 PM
olasd created this task.
zack added a subscriber: zack.Mar 10 2020, 11:39 AM

Thanks a lot for this summary of a recurrent discussion we've had over the past few years now.

+1 on the general idea. It fits well as the backing store of the "factual knowledge base" for software artifacts stored in the main (graph) storage of Software Heritage.

Just an extra comment on the fact that out-of-graph metadata might be context-dependent.
E.g., two different HAL deposits might be depositing the very same source code tree but associate different, potentially even conflictful metadata (e.g., each deposit declaring that the author is a different person).
Another example that came up in the past are license statements: the same content blob might be recognized (by some tool or human review) as being distributed under GPL in a given repo, and under MIT in a different one.

This is absolutely not in conflict with the general idea here, but it does have a couple of design implications:

  • the metadata storage should support 1-N mappings from graph objects to metadata and associate contextual information to each key pair there
  • in terms of UI, showing the metadata to user (e.g., on archive.s.o) will require some judgment calls on whether the user is browsing the graph object in the "right" context or not (this is a significance difference with what we do now where we can just show all metadata we have)

Let's go for this!

Thanks @olasd for the accurate and detailed summary !

A thought, we have the origin_metadata table where we only store at the moment HAL metadata, with provider and tool as FK.
This table also keeps discovery_date, so potentially an origin can have multiple metadata entries at different dates. Visible on the db-schema.

To answer @zack design implications, I completely agree with the 1-N mappings that should be supported for all artifacts, maybe with the discovery_date.

About showing metadata to the user, we are far from showing all metadata to the user at moment, it's quite well hidden ;-)
And we don't show the intrinsic metadata at all.
I think we can keep the idea of showing all metadata, but some metadata will have higher visibility than other metadata and thus subject to necessary judgement calls.

Here is the link to the deposit-metadata specifications (written in 2018):
https://docs.softwareheritage.org/devel/swh-deposit/specs/spec-meta-deposit.html
I'll open a task to review and improve that.

ardumont changed the task status from Open to Work in Progress.Jul 1 2020, 3:50 PM

heads up on status, storage related endpoints developed by vlorentz deployed in storage 0.9.0:

  • origin-metadata (we already had but it got abstracted)
  • content-metadata (new)
vlorentz moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Mon, Jul 27, 2:43 PM