Azure prototype: Content provenance information API
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Aug 29 2016, 4:29 PM

Description

One of the important goals of Software Heritage is to allow users to find out when and where Software Heritage has seen a given content : tracking provenance information.

There's two sides of this coin :

Mapping contents to revisions (and corresponding paths)
Mapping revisions to origins (and corresponding {visit, branch} couples)

Our current metadata storage pushes the space efficiency / query efficiency tradeoff all the way towards space efficiency: we aggressively deduplicate and normalize every single layer of our DAG, and each deduplicatable object is stored exactly once.

We can traverse our tree in the "right" direction (root to leaves) in a fairly efficient manner, as the recursive traversal always yields a reasonable, finite number of objects. Traversing our tree from leaves to roots, however, is very costly, as a lot of contents are heavily duplicated (for instance, the empty file can be found in millions of revisions).

Now that some external, more elastic resources are starting to be available, we can assess the resource requirements to provide efficient queries of content origin information.

The prototype will be limited in volume (operating only on a subset - to be defined - of "interesting" origins), but we need to make sure that it can expand further as the available resources increase.

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T547 Azure prototype: Content provenance information API
Migrated	gitlab-migration	T550 Add cache tables for provenance information API
Migrated	gitlab-migration	T551 List interesting origins for the content provenance information prototype
Migrated	gitlab-migration	T553 Open api endpoint /api/1/provenance/ to read a content's provenance information
Migrated	gitlab-migration	T554 List the revisions added between two subsequent visits of an origin
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T552 List and populate contents per revision in cache_content_revision (api endpoint)
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T598 Store content -> revision cache in azure table storage

Event Timeline

olasd created this task.Aug 29 2016, 4:29 PM

Ack on all the above. Just a precision on the revisions→origin mapping.

At the first visit of a (new) origin, all the revisions we find will be marked as having been seen at the moment of the first visit.
At subsequent visits (and in 99.99% of the cases) we will see new revisions as well as revisions that we have already seen in the past.
There are at least two ways to go about what we store in the revisions→origin mapping for subsequent visits:

transitive closure: at each visit, we store in the mapping all revisions that are reachable starting from repository roots at the time of visit
new revisions only: at each visit, we store in the mapping only the new revisions that we haven't seen in the past. Two variants of this are possible:
1. global cache: we consider "revisions we haven't seen" to be revisions not seen anywhere in the Software Heritage archive
2. local cache: we consider "revisions we haven't seen" to be revisions not seen in the past for a specific origin (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)
at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only up to what the VCS guarantees, e.g., up to SHA1 collisions for git and up to repository tampering for svn

In T547#9188, @zack wrote:

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past for a specific origin (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)

In a second step, we could also look into storing ranges of visits, which would reduce duplication a lot.

at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only up to what the VCS guarantees, e.g., up to SHA1 collisions for git and up to repository tampering for svn

... although this would still hold, so the first visit would be the only one where we have really seen a new revision...

olasd created subtask T550: Add cache tables for provenance information API.Aug 30 2016, 12:12 PM

olasd created subtask T551: List interesting origins for the content provenance information prototype.Aug 30 2016, 12:15 PM

ardumont created subtask T552: List and populate contents per revision in cache_content_revision (api endpoint).Aug 30 2016, 12:20 PM

ardumont created subtask T553: Open api endpoint /api/1/provenance/ to read a content's provenance information.Aug 30 2016, 12:30 PM

olasd created subtask T554: List the revisions added between two subsequent visits of an origin.Aug 30 2016, 12:53 PM

zack added a parent task: Unknown Object (Maniphest Task).Aug 30 2016, 12:53 PM

olasd closed subtask T550: Add cache tables for provenance information API as Resolved.Aug 30 2016, 2:53 PM

ardumont closed subtask T552: List and populate contents per revision in cache_content_revision (api endpoint) as Resolved.Aug 30 2016, 4:58 PM

ardumont removed a subtask: T552: List and populate contents per revision in cache_content_revision (api endpoint).Aug 30 2016, 5:19 PM

ardumont added a subtask: Unknown Object (Maniphest Task).

olasd closed subtask T554: List the revisions added between two subsequent visits of an origin as Resolved.Sep 1 2016, 2:29 PM

ardumont closed subtask T553: Open api endpoint /api/1/provenance/ to read a content's provenance information as Resolved.Sep 1 2016, 2:43 PM

olasd closed subtask T551: List interesting origins for the content provenance information prototype as Resolved.Sep 9 2016, 4:30 PM

ardumont closed subtask Unknown Object (Maniphest Task) as Resolved.Sep 19 2016, 10:38 AM

olasd created subtask T598: Store content -> revision cache in azure table storage.Nov 7 2016, 4:04 PM

zack renamed this task from Prototype: Content provenance information API to Azure prototype: Content provenance information API.Feb 12 2017, 6:17 PM

zack lowered the priority of this task from High to Low.

we're taking a different route for this now, based on @grouss WIP

zack closed subtask T598: Store content -> revision cache in azure table storage as Wontfix.Sep 15 2017, 9:58 AM

gitlab-migration changed the status of subtask T550: Add cache tables for provenance information API from Resolved to Migrated.Jan 8 2023, 4:19 PM

gitlab-migration changed the status of subtask T553: Open api endpoint /api/1/provenance/ to read a content's provenance information from Resolved to Migrated.

gitlab-migration changed the status of subtask T554: List the revisions added between two subsequent visits of an origin from Resolved to Migrated.

gitlab-migration changed the status of subtask T598: Store content -> revision cache in azure table storage from Wontfix to Migrated.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T551: List interesting origins for the content provenance information prototype from Resolved to Migrated.Jan 8 2023, 9:56 PM

gitlab-migration changed the status of subtask Unknown Object (Maniphest Task) from Resolved to Migrated.

Azure prototype: Content provenance information APIClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past for a specific origin (the one being visited)

Azure prototype: Content provenance information API
Closed, MigratedEdits Locked
Actions

Related Objects
Search...