Our archive is currently in a bizarre middle-ground where some nodes in our graph can have free-form metadata attached (currently, that's revisions and origins), while others can't.
For revisions, some of that metadata is an integral part of the identifier computation (e.g. referencing arbitrary headers stored in a git commit); the rest of that metadata is attached to the revision object, without taking part in the identifier computation.
This is an issue in several fronts:
- we end up creating revisions and storing metadata there, even when objects should conceptually be releases (that's T1258 as well as T1755)
- we've been wary of adding a free-form metadata field to other objects, as we felt the need to update identifier computation to support it...
- ...as in the current state, metadata that is not part of the identifier computation is lossy: if we get different metadata, but generate the same revision id as something that has already been loaded, we will not be loading the new metadata to the archive (making the "idempotent" nature of our archive graph weaker, and in the worst case, losing (meta)data).
This issue has come up again while discussing our upcoming work with the scientific community (notably HAL/Archives Ouvertes). During this discussion, we've clarified a somewhat common misconception about our object identifiers:
Not all swh object identifiers are created equal.
- persistent content identifiers are fully intrinsic, and are therefore totally suitable for the very long-term identification and retrieval of source code;
- persistent directory identifiers are, as well, fully intrinsic (given proper normalization of file modes, which we're doing on the tarballs that we load). While harder to match "by chance" (as the complete hierarchy needs to be bit by bit identical), they're still likely to be usable in the very long term to retrieve source by id;
- In the specific case of objects synthesized by Software Heritage (e.g. revisions or releases generated from deposits of source code, or from tarballs of project releases), the persistent revision, release and snapshot identifiers are less useful to the long-term identification of software. In essence, Software Heritage acts as a source of truth for these object ids, and expecting third parties to be able to replicate them in a long-term future is dubious at best.
- For objects created by third parties (e.g. commits and tags from version control systems), the fact that the v1 of SWH persistent revision and release identifiers are compatible with the corresponding git object identifiers helps with their longer term usefulness, but in the future there's a good chance that we'll need to generate our own identifiers from scratch, and to store these external identifiers as free-form metadata as well.
- currently, snapshot objects are purely swh-specific.
Having said that, we've concluded with a way forward on storing extrinsic metadata on the graph:
- we want a way to attach free-form, qualified metadata to objects at all levels of the graph (there's a good chance we can replace T1260 with that)
- we should be able to insert this metadata at object creation time
- separate crawlers should be able to insert this metadata post-hoc (T1739)
- trusted third parties should be able to push this metadata to us, e.g. via a SWORD / deposit process
- for new objects (deposit of source code with attached metadata)
- for existing objects (deposit of metadata only, attached to an object created externally, e.g. by loading a git origin);
- this metadata store should remain completely outside of the object identifier computation
- minimizing the metadata accounted for inside of our object identifiers (and therefore, improving their "intrinsicness") increases the probability that they can be reproduced and used by third parties in the very long term;
- this metadata store should be outside of the main graph storage
- We'll surely want to use / experiment on the metadata store separately from our work on the main graph
Once this separate metadata store is introduced, we should export the current "identifier-excluded" metadata out of the objects currently stored in the graph, then harden the archive storage schema to only allow intrinsic, identifier-included metadata fields in the main archive storage.
A minimum viable implementation of this metadata store would allow queries of the metadata attached to a given object, by PID, so that metadata can be displayed on the website and made available via the public API.
Metadata-based or faceted search is a further step that is out of scope for this task.