Page MenuHomeSoftware Heritage

Refactor the origin visit data model (aka get rid of the OriginVisit model object)
Open, HighPublic

Description

This model object represent the notion of "one visit of an origin", which model looks like:

+--------+     +--------------+     +-------------------+
| Origin |     | OriginVisit  |     | OriginVisitStatus |
+--------+     +--------------+     +-------------------+
| - id <-|-----|-- origin     |     | - origin (same)   |
| - url  |     | - visit <----|-----|-- visit           |
|        |     | - type       |     | - type (same)     |
|        |     | - date       |     | - date            |
|        |     |              |     | - status          |
|        |     |              |     | - metadata        |
|        |     |              |     | - snapshot        |
+--------+     +--------------+     +-------------------+

In this model, there can be several OV for a given O (cardinality is Origin <-(*-1)- OriginVisit), then for a visit, there can be several OVS (so similar cardinality).

The only attribute of an OriginVisit that is not duplicated in OriginVisitStatus objects related to this visit is the date. However, the current implementation of the code pack together the creation of the first OriginVisitStatus object with the OriginVisit it is related to. So in practice, the OriginVisit object does not carry any useful information.

In this model, the OV is only holding a local counter of visits for the origin, which main purpose probably is to ease the pagination in the origin visit API.

It seems clear that this OriginVisit object is not very useful, and moreover, can make the replayer process more difficult to do properly (mirror).

Possible evolution (from olasd)

A possible model would be to get rid of the OriginVisit object, use the origin url directly (instead of the origin pkey id) and replace the visit id by a UUID so that there is no need for keeping a (reliable) counter any more.

A simple solution like:

OriginVisitStatus:
  origin: str # origin url
  status_date: datetime
  type: enum
  visit_id: UUID
  status: enum
  snapshot_id: sha1

where the primary key is (origin, status_date, type, visit_id).

Event Timeline

douardda triaged this task as High priority.Jul 1 2022, 4:35 PM
douardda created this task.