One of the important goals of Software Heritage is to allow users to find out when and where Software Heritage has seen a given content : tracking provenance information.
There's two sides of this coin :
- Mapping contents to revisions (and corresponding paths)
- Mapping revisions to origins (and corresponding {visit, branch} couples)
Our current metadata storage pushes the space efficiency / query efficiency tradeoff all the way towards space efficiency: we aggressively deduplicate and normalize every single layer of our DAG, and each deduplicatable object is stored exactly once.
We can traverse our tree in the "right" direction (root to leaves) in a fairly efficient manner, as the recursive traversal always yields a reasonable, finite number of objects. Traversing our tree from leaves to roots, however, is very costly, as a lot of contents are heavily duplicated (for instance, the empty file can be found in millions of revisions).
Now that some external, more elastic resources are starting to be available, we can assess the resource requirements to provide efficient queries of content origin information.
The prototype will be limited in volume (operating only on a subset - to be defined - of "interesting" origins), but we need to make sure that it can expand further as the available resources increase.