Page MenuHomeSoftware Heritage

Azure prototype: Content provenance information API
Closed, MigratedEdits Locked

Description

One of the important goals of Software Heritage is to allow users to find out when and where Software Heritage has seen a given content : tracking provenance information.

There's two sides of this coin :

  • Mapping contents to revisions (and corresponding paths)
  • Mapping revisions to origins (and corresponding {visit, branch} couples)

Our current metadata storage pushes the space efficiency / query efficiency tradeoff all the way towards space efficiency: we aggressively deduplicate and normalize every single layer of our DAG, and each deduplicatable object is stored exactly once.

We can traverse our tree in the "right" direction (root to leaves) in a fairly efficient manner, as the recursive traversal always yields a reasonable, finite number of objects. Traversing our tree from leaves to roots, however, is very costly, as a lot of contents are heavily duplicated (for instance, the empty file can be found in millions of revisions).

Now that some external, more elastic resources are starting to be available, we can assess the resource requirements to provide efficient queries of content origin information.

The prototype will be limited in volume (operating only on a subset - to be defined - of "interesting" origins), but we need to make sure that it can expand further as the available resources increase.

Event Timeline

zack added a subscriber: zack.

Ack on all the above. Just a precision on the revisions→origin mapping.

At the first visit of a (new) origin, all the revisions we find will be marked as having been seen at the moment of the first visit.
At subsequent visits (and in 99.99% of the cases) we will see new revisions as well as revisions that we have already seen in the past.
There are at least two ways to go about what we store in the revisions→origin mapping for subsequent visits:

  1. transitive closure: at each visit, we store in the mapping all revisions that are reachable starting from repository roots at the time of visit
  2. new revisions only: at each visit, we store in the mapping only the new revisions that we haven't seen in the past. Two variants of this are possible:
    1. global cache: we consider "revisions we haven't seen" to be revisions not seen anywhere in the Software Heritage archive
    2. local cache: we consider "revisions we haven't seen" to be revisions not seen in the past for a specific origin (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

  • there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)
  • at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only up to what the VCS guarantees, e.g., up to SHA1 collisions for git and up to repository tampering for svn
In T547#9188, @zack wrote:

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past for a specific origin (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

  • there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)

In a second step, we could also look into storing ranges of visits, which would reduce duplication a lot.

  • at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only up to what the VCS guarantees, e.g., up to SHA1 collisions for git and up to repository tampering for svn

... although this would still hold, so the first visit would be the only one where we have really seen a new revision...

zack added a parent task: Unknown Object (Maniphest Task).Aug 30 2016, 12:53 PM
ardumont closed subtask Unknown Object (Maniphest Task) as Resolved.Sep 19 2016, 10:38 AM
zack renamed this task from Prototype: Content provenance information API to Azure prototype: Content provenance information API.Feb 12 2017, 6:17 PM
zack lowered the priority of this task from High to Low.
zack claimed this task.
zack added a subscriber: grouss.

we're taking a different route for this now, based on @grouss WIP

gitlab-migration changed the status of subtask Unknown Object (Maniphest Task) from Resolved to Migrated.