Page MenuHomeSoftware Heritage

occurrence history table: use an array of timestamps instead of intervals
Closed, MigratedEdits Locked

Description

Rationale: with intervals we do lose information. Also, given the update frequency for each repository is not necessarily going to be stable, it is not clear how to interpret intervals.
We should worry about "compressing" timestamps using intervals the day storing timestamps become a problem, and not before.

(Strictly speaking, the SQL modeling does not need to be based on arrays, but arrays are likely the more efficient representation for this. To be verified.)

Event Timeline

zack raised the priority of this task from to Low.
zack updated the task description. (Show Details)
zack added projects: Developers, Storage manager.

Our main query on occurrences is looking for occurrences that are

  • from a given origin
  • on a given branch (or all branches)
  • that are the newest, or the closest to a given timestamp.

That last condition is the hardest to have in SQL and we need to make sure that our representation of timestamps is relevant for it. Indexing the contents of an array is very inefficient in postgres, and we can't really do more than indexing inclusion. If the first two conditions return few enough rows, then using an array for timestamps should not be an issue.

Another possibility would be to use list_history, which will only have one line per origin per timestamp, and make the occurrence_history column an indexed "array foreign key" to list_history. We can work with indexing array contents then, as the timestamp search would be done on list_history which has a single timestamp column.

Still thinking about this .

The timestamps in occurrence_history seem to conflate different concepts.

  • The Software Heritage authority's timestamp says: "We have retrieved the refs of this repository at date X, and we have noticed that they pointed to the object Y"
  • The GNU authority's timestamp says: "This file has been added to the GNU mirror with timestamp Z"

Those are two very different kinds of information: "apparition time according to some authority" versus "visibility time for Software Heritage". By definition the first one is a single point in time, while the other is several points in time.

I think we also want to be able to store that "Software Heritage has visited this origin at times X, Y and Z, and has noticed this occurrence only at time X and Z (time Y saw another thing)". Basically, we want to be able to fill a grid with axes "visit time" and "branch name", and be able to link objects to that grid. Something along those lines:

\ Visit timet1t2t3t4
Branch Name \ Visit id1234occurrence_history contents
refs/heads/masterrev abcdrev abcdrev def0rev def0refs/heads/master rev abcd {1,2}; refs/heads/master rev def0 {3,4}
refs/tags/2.6.11-treedirectory f001directory f001emptydirectory f001refs/tags/2.6.11-tree dir f001 {1,2,4}
refs/heads/tmprev def0emptyemptyrev 0431refs/heads/tmp rev def0 {1}; refs/heads/tmp rev 0431 {4}

Therefore, I believe we need two tables:

create table origin_visit (
  origin bigint not null references origin(id),
  visit bigint not null,
  date timestamptz not null,
  primary key (origin, visit)
);

create table occurrence_validity (
  occurrence bigint not null references occurrence_history(object_id),
  authority uuid not null references entity(uuid),
  creation timestamptz not null,
  reason text not null
  primary key (occurrence, authority)
);

We can then modify the occurrence_history table to store an array of visits. We can even decide to index the contents of that array if its length ever becomes significant.

The reason field in occurrence_validity would be used to store "why" we think that the authority has seen the occurrence created on the given date.

full SQL code with the new schemata for origin_visit, occurrence_history and occurrence. Those three tables are implicitly relevant only for the "Software Heritage" authority.

create table origin_visit (
  origin bigint not null references origin(id),
  visit bigint not null,
  date timestamptz not null,
  primary key (origin, visit)
);

create table occurrence_history (
  origin bigint not null references origin(id),
  branch bytea not null,
  target sha1_git not null,
  target_type object_type not null,
  visits bigint[] not null, -- references origin_visit(visit) where origin_visit.origin = occurrence_history.origin 
  primary key (origin, branch, target, target_type)
);

create table occurrence (
  origin bigint not null references origin(id),
  branch bytea not null,
  target sha1_git not null,
  target_type object_type not null,
  primary key (origin, branch)
);

occurrence can be populated with a query such as

delete from occurrence where origin = origin_id;
insert into occurrence (origin, branch, target, target_type)
    select origin, branch, target, target_type
    from occurrence_history
    where origin = origin_id and
          any(visits) = (select visit
                         from origin_visit
                         where origin = origin_id
                         order by date desc 
                         limit 1);

or can even be updated on the fly when we register a new visit for a given origin.

AIUI, the occurrence_validity stuff is completely orthogonal to this.

In T72#3501, @olasd wrote:

full SQL code with the new schemata for origin_visit, occurrence_history and occurrence. Those three tables are implicitly relevant only for the "Software Heritage" authority.

This part looks good to me!

(I also agree that the occurrence_validity part is orthogonal, and propose to let it linger a bit more, while we go ahead implementing the visit part to unblock updates.)

olasd claimed this task.

This schema change is now done in production.

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM