Page MenuHomeSoftware Heritage

New model for the origin layer
Closed, MigratedEdits Locked

Description

The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"

Rationale

The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.

It consists in 2 tables:

  • Revision in Origin (RiO) : a table with 2 columns (revision, origin).
  • Revision before Revision (RbR): a table with 2 columns (prev, next)

The table RiO stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.

The table RbR stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.

For example, a repo like:

with only one visit producing a single snapshot with h1, h2 and h3 heads (branch/tags) would end up filling the tables described above like:

for the RiO table :

OrigRev
O1r5
O1r4
O1r2

and for the RbR table

Rev (next)Rev (prev)
r2r1
r4r2
r4r1
r5r4
r5r3
r5r2
r5r1

This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).

Proposal

We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.

Use 3 tables:

  • Head in Origin (HiO), same as the RiO above, but we make it explicit that revisions stored in this table are actually heads,
  • Head before Head (HbH), similar to the RbR above, but only for head revisions
  • Rev before Head (RbH) to store the location of all other (non head) revisions with regard to heads.

Using such a model, the example above would become:

HiO:

OrigHead
O1r5
O1r4
O1r2

HbH:

Head (next)Head (prev)
r5r4
r5r2
r4r2

RbH:

HeadRev
r5r3
r2r1