The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"
# Rationale
The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.
It consists in 2 tables:
- Revision in Origin (`RiO`) : a table with 2 columns (revision, origin).
- Revision before Revision (`RbR`): a table with 2 columns (prev, next)
The table `RiO` stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.
The table `RbR` stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.
For example, a repo like:
{F5919997} {F5920000}
with only one visit producing a single snapshot with `h1`, `h2` and `h3` heads (branch/tags) would end up filling the tables described above like:
for the `RiO` table :
| Orig | Rev |
| O1 | r5 |
| O1 | r4 |
| O1 | r2 |
and for the `RbR` table
| Rev | Rev |
| r2 | r1 |
| r4 | r2 |
| r4 | r1 |
| r5 | r4 |
| r5 | r3 |
| r5 | r2 |
| r5 | r1 |
This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).
## Proposal
We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.
Use 3 tables:
- Head in Origin (`HiO`), same as the `RiO` above, but we make it explicit that revisions stored in this table are actually heads,
- Head before Head (`HbH`), similar to the `RbR` above, but only for head revisions
- Rev before Head (`RbH`) to store the location of all other (non head) revisions with regard to heads.
Using such a model, the example above would become:
`HiO`:
| Orig | Head |
| O1 | r5 |
| O1 | r4 |
| O1 | r2 |
`HbH`:
| Head | Head |
| r5 | r4 |
| r5 | r2 |
| r4 | r2 |
`RbH`:
| Head | Rev |
| r5 | r3 |
| r2 | r1 |