# Current problem
Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.
This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.
# Solution
We should split a visit object from its updates in the journal: have one topic for the visits themselves (identified by `(origin_url, visit_id)`), and one for the successive updates (identified by `(origin_url, visit_id, update_id)`; where `update_id` must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.
The current fields of origin visits would be split this way:
* the new "visit" objects would get the `type` (git/tar/...), maybe (?) `metadata`, and maybe a new `start_date` field
* the "visit_update" objects would get the other fields (`date`, `status`, `metadata` (?), `snapshot`).
There are two ways to make the replayer work with this:
1. ~~simply add a "version" field to origin visits, so that when the replayer passes `origin_visit_upsert` this version, and swh-storage would discard the upsert if it's older than what is currently in the DB~~
2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates
# New data model for the visits
an origin-visit represents a run of a loader. It currently carries the information:
* origin url
* visit id: unique for a given origin
* type (git, hg, ...)
* start_date: when the loader was started, shortly before it created the origin visit
* (snapshot): snapshot of all the branches already/currently loaded
* (metadata): associated metadata (unused)
Note: The left-member wrapping `(parenthesis)` conveys the optional nature of the property.
and origin-visit-status represents a snapshot of a visit's loader at a point in time (sent from time to time by the loader, like a heartbeat). It has the fields:
* origin url
* visit id
* date: the timestamp of the snapshot of the loader task
* status: Status of the visit (possible values: created, ongoing, full, partial)
* (snapshot): snapshot of all the branches already/currently loaded
* (metadata): associated metadata (not used, kept for future update)
The following operations are supported on origin visits:
* creating a visit (to get a unique id from the storage)
* getting a visit from its origin url + id
* listing visits of an origin (with filters, order, etc.)
* upsert a visit (to add an origin with a predetermined id, needed for the replayer)
(note that there is no (need for an) equivalent for the current "origin_visit_update()" endpoint, as origin visits are now immutable.)
and on origin visit updates:
* adding a new status (using the origin url and visit id; there must be an override to allow the replayer to add an update for a visit that doesn't exist yet)
* getting the last status of a visit (so one can know if it's completed and get the id of its snapshot)
* listing all statuses of a visit (?) (there is no need for it yet)
# Example
Loaders will use the storage API like this:
```
# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_status_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_status_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_status_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))
```
and readers (mostly just swh-web) would get this from the API:
```
{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_status_get_latest(origin_url, id)
```
# Work plan
- [x] Amend `origin_visit_update` to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
- [x] D2880: add OriginVisitUpdate to swh-model
- [x] D2941: Update swh-model documentation
- [ ] D2879: storage*: Split `origin_visit` table in backends (internal change): new origin_visit_status table
- [x] D2937: in-memory
- [x] D2938: pg-storage
- [x] D2938: sql migration scripts
- [x] D2939: cassandra
- [ ] Deploy and migrate data
- [ ] storage*: add origin_visit_status_add endpoint
- [ ] storage*: remove origin_visit_upsert
- [ ] journal: adapt swh.journal.replayer to use origin_visit_status_add
- [ ] model: remove fields of OriginVisit
- [ ] write: migrate loader code
- [ ] read: swh-web
- [ ] storage/journal: produce to new kafka topic (origin_visit_add, origin_visit_status)
- [ ] kafka topics: backfill `origin_visit_status`, overwrite `origin_visit` to remove fields (can be done at the end)
- [ ] journal: adapt replayer to new topics