# Current problem
Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.
This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.
# Solution
The solution to this is to split a visit object from its updates in the journal: have one topic for the visits themselves (identified by `(origin_url, visit_id)`), and one for the successive updates (identified by `(origin_url, visit_id, update_id)`; where `update_id` must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.
The current fields of origin visits would be splitted this way:
* the new "visit" objects would get the `type` (git/tar/...), maybe (?) `metadata`, and maybe a new `start_date` field
* the "visit_update" objects would get the other fields (`date`, `status`, `metadata` (?), `snapshot`).
There are two ways to make the replayer work with this:
1. ~~simply add a "version" field to origin visits, so that when the replayer passes `origin_visit_upsert` this version, and swh-storage would discard the upsert if it's older than what is currently in the DB~~
2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates
The goal is to have a loader use the storage API like this:
```
# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_update_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_update_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_update_add(OriginVisitUpdate(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))
```
and readers (mostly just swh-web) would get this from the API:
```
{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_update_get_latest(origin_url, id)
```
# Work plan
- [x] Amend `origin_visit_update` to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
- [x] D2880: add OriginVisitUpdate to swh-model
- [ ] D2941: Update swh-model documentation
- [ ] D2879: storage*: Split `origin_visit` table in backends (internal change): new origin_visit_update table
- [ ] D2937: in-memory
- [ ] D2938: pg-storage
- [ ] D2938: sql migration scripts
- [ ] D2939: cassandra
- [ ] storage*: add origin_visit_update_add endpoint
- [ ] storage*: remove origin_visit_upsert
- [ ] journal: adapt swh.journal.replayer to use origin_visit_update_add
- [ ] model: remove fields of OriginVisit
- [ ] write: migrate loader code
- [ ] read: swh-web
- [ ] storage/journal: produce to new kafka topic (origin_visit_add, origin_visit_update)
- [ ] kafka topics: backfill `origin_visit_update`, overwrite `origin_visit` to remove fields (can be done at the end)
- [ ] journal: adapt replayer to new topics