The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"
The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.
It consists in 2 tables:
- Revision in Origin (RiO) : a table with 2 columns (revision, origin).
- Revision before Revision (RbR): a table with 2 columns (prev, next)
The table RiO stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.
The table RbR stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.
For example, a repo like:
with only one visit producing a single snapshot with h1, h2 and h3 heads (branch/tags) would end up filling the tables described above like:
for the RiO table :
and for the RbR table
|Rev (next)||Rev (prev)|
This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).
We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.
Use 3 tables:
- Head in Origin (HiO), same as the RiO above, but we make it explicit that revisions stored in this table are actually heads,
- Head before Head (HbH), similar to the RbR above, but only for head revisions
- Rev before Head (RbH) to store the location of all other (non head) revisions with regard to heads.
Using such a model, the example above would become:
|Head (next)||Head (prev)|