occurrence history table: use an array of timestamps instead of intervals
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Oct 6 2015, 10:52 AM

Description

Rationale: with intervals we do lose information. Also, given the update frequency for each repository is not necessarily going to be stable, it is not clear how to interpret intervals.
We should worry about "compressing" timestamps using intervals the day storing timestamps become a problem, and not before.

(Strictly speaking, the SQL modeling does not need to be based on arrays, but arrays are likely the more efficient representation for this. To be verified.)

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T286 view /browse/origin/ORIGIN_ID/ - calendar-like UI
		Migrated	gitlab-migration	T72 occurrence history table: use an array of timestamps instead of intervals

Event Timeline

zack created this task.Oct 6 2015, 10:52 AM

zack raised the priority of this task from to Low.

zack updated the task description. (Show Details)

zack added projects: Developers, Storage manager.

Herald added a project: Staff. · View Herald TranscriptOct 6 2015, 10:52 AM

Our main query on occurrences is looking for occurrences that are

from a given origin
on a given branch (or all branches)
that are the newest, or the closest to a given timestamp.

That last condition is the hardest to have in SQL and we need to make sure that our representation of timestamps is relevant for it. Indexing the contents of an array is very inefficient in postgres, and we can't really do more than indexing inclusion. If the first two conditions return few enough rows, then using an array for timestamps should not be an issue.

Another possibility would be to use list_history, which will only have one line per origin per timestamp, and make the occurrence_history column an indexed "array foreign key" to list_history. We can work with indexing array contents then, as the timestamp search would be done on list_history which has a single timestamp column.

Still thinking about this .

The timestamps in occurrence_history seem to conflate different concepts.

The Software Heritage authority's timestamp says: "We have retrieved the refs of this repository at date X, and we have noticed that they pointed to the object Y"
The GNU authority's timestamp says: "This file has been added to the GNU mirror with timestamp Z"

Those are two very different kinds of information: "apparition time according to some authority" versus "visibility time for Software Heritage". By definition the first one is a single point in time, while the other is several points in time.

I think we also want to be able to store that "Software Heritage has visited this origin at times X, Y and Z, and has noticed this occurrence only at time X and Z (time Y saw another thing)". Basically, we want to be able to fill a grid with axes "visit time" and "branch name", and be able to link objects to that grid. Something along those lines:

Branch Name \ Visit id	1	2	3	4	occurrence_history contents
\ Visit time	t1	t2	t3	t4
refs/heads/master	rev abcd	rev abcd	rev def0	rev def0	refs/heads/master rev abcd {1,2}; refs/heads/master rev def0 {3,4}
refs/tags/2.6.11-tree	directory f001	directory f001	~~empty~~	directory f001	refs/tags/2.6.11-tree dir f001 {1,2,4}
refs/heads/tmp	rev def0	~~empty~~	~~empty~~	rev 0431	refs/heads/tmp rev def0 {1}; refs/heads/tmp rev 0431 {4}

Therefore, I believe we need two tables:

create table origin_visit (
  origin bigint not null references origin(id),
  visit bigint not null,
  date timestamptz not null,
  primary key (origin, visit)
);

create table occurrence_validity (
  occurrence bigint not null references occurrence_history(object_id),
  authority uuid not null references entity(uuid),
  creation timestamptz not null,
  reason text not null
  primary key (occurrence, authority)
);

We can then modify the occurrence_history table to store an array of visits. We can even decide to index the contents of that array if its length ever becomes significant.

The reason field in occurrence_validity would be used to store "why" we think that the authority has seen the occurrence created on the given date.

full SQL code with the new schemata for origin_visit, occurrence_history and occurrence. Those three tables are implicitly relevant only for the "Software Heritage" authority.

create table origin_visit (
  origin bigint not null references origin(id),
  visit bigint not null,
  date timestamptz not null,
  primary key (origin, visit)
);

create table occurrence_history (
  origin bigint not null references origin(id),
  branch bytea not null,
  target sha1_git not null,
  target_type object_type not null,
  visits bigint[] not null, -- references origin_visit(visit) where origin_visit.origin = occurrence_history.origin 
  primary key (origin, branch, target, target_type)
);

create table occurrence (
  origin bigint not null references origin(id),
  branch bytea not null,
  target sha1_git not null,
  target_type object_type not null,
  primary key (origin, branch)
);

occurrence can be populated with a query such as

delete from occurrence where origin = origin_id;
insert into occurrence (origin, branch, target, target_type)
    select origin, branch, target, target_type
    from occurrence_history
    where origin = origin_id and
          any(visits) = (select visit
                         from origin_visit
                         where origin = origin_id
                         order by date desc 
                         limit 1);

or can even be updated on the fly when we register a new visit for a given origin.

AIUI, the occurrence_validity stuff is completely orthogonal to this.

In T72#3501, @olasd wrote:

full SQL code with the new schemata for origin_visit, occurrence_history and occurrence. Those three tables are implicitly relevant only for the "Software Heritage" authority.

This part looks good to me!

(I also agree that the occurrence_validity part is orthogonal, and propose to let it linger a bit more, while we go ahead implementing the visit part to unblock updates.)

olasd mentioned this in rDSTO92114e983d3f: sql: Use a new schema for occurrences.Jan 27 2016, 5:53 PM

ardumont added a parent task: T286: view /browse/origin/ORIGIN_ID/ - calendar-like UI.Jan 28 2016, 4:21 PM

This schema change is now done in production.

zack removed projects: Developers, Staff.Mar 10 2016, 5:53 PM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM

qcampos mentioned this in rDOBJS273cb6f0594f: sql: Use a new schema for occurrences.Jun 27 2016, 1:46 PM

This task has been migrated to GitLab.

occurrence history table: use an array of timestamps instead of intervalsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

occurrence history table: use an array of timestamps instead of intervals
Closed, MigratedEdits Locked
Actions

Related Objects
Search...