Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.
This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.
The solution to this is to split a visit object from its updates in the journal: have one topic for the visits themselves (identified by `(origin_url, visit_id)`), and one for the successive updates (identified by `(origin_url, visit_id, update_id)`; where `update_id` must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.
The current fields of origin visits would be splitted this way:
* the new "visit" objects would get the `type` (git/tar/...), maybe (?) `metadata`, and maybe a new `start_date` field
* the "visit_update" objects would get the other fields (`date`, `status`, `metadata` (?), `snapshot`).
There are two ways to make the replayer work with this:
1. ~~simply add a "version" field to origin visits, so that when the replayer passes `origin_visit_upsert` this version, and swh-storage would discard the upsert if it's older than what is currently in the DB~~
2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates
----
The plan:
* [x] Amend `origin_visit_update` to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
* [ ] split the `origin_visit` table in the backends (with no change in the API)
* [ ] D2879: implementation wise, internal changes in origin_visit_add, origin_visit_update, and origin_visit_upsert, and origin_visit_get*
- [ ] D2937: in-memory
- [ ] D2938: pg-storage
- [ ] D2939: cassandra
* [ ] D2938: sql migration scripts (sql,...)s
* [ ] D2880, D2941: add OriginVisitUpdate to swh-model
* [ ] add origin_visit_update_add
* [ ] remove origin_visit_upsert
* [ ] adapt swh.journal.replayer
* [ ] remove fields of OriginVisit
* [ ] write: migrate loader code
* [ ] read: swh-web
* [ ] produce to new kafka topic (origin_visit_add, origin_visit_update)
* [ ] kafka topics: backfill origin_visit_update, overwrite origin_visit to remove fields (can be done at the end)
* [ ] adapt replayer to the new topics