# Current problem
Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.
This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.
# Solution
We should split a visit object from its updates in the journal: have one topic for the visits themselves (identified by `(origin_url, visit_id)`), and one for the successive updates (identified by `(origin_url, visit_id, update_id)`; where `update_id` must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.
The current fields of origin visits would be split this way:
* the new "visit" objects would get the `type` (git/tar/...), maybe (?) `metadata`, and maybe a new `start_date` field
* the "visit_update" objects would get the other fields (`date`, `status`, `metadata` (?), `snapshot`).
There are two ways to make the replayer work with this:
1. ~~simply add a "version" field to origin visits, so that when the replayer passes `origin_visit_upsert` this version, and swh-storage would discard the upsert if it's older than what is currently in the DB~~
2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates
# New data model for the visits
an origin-visit represents a run of a loader. It currently carries the information:
* origin url
* visit id: unique for a given origin
* type (git, hg, ...)
* start_date: when the loader was started, shortly before it created the origin visit
* (snapshot): snapshot of all the branches already/currently loaded
* (metadata): associated metadata (unused)
Note: The left-member wrapping `(parenthesis)` conveys the optional nature of the property.
and origin-visit-status represents a snapshot of a visit's loader at a point in time (sent from time to time by the loader, like a heartbeat). It has the fields:
* origin url
* visit id
* date: the timestamp of the snapshot of the loader task
* status: Status of the visit (possible values: created, ongoing, full, partial)
* (snapshot): snapshot of all the branches already/currently loaded
* (metadata): associated metadata (not used, kept for future update)
The following operations are supported on origin visits:
* creating a visit (to get a unique id from the storage)
* getting a visit from its origin url + id
* listing visits of an origin (with filters, order, etc.)
* upsert a visit (to add an origin with a predetermined id, needed for the replayer)
(note that there is no (need for an) equivalent for the current "origin_visit_update()" endpoint, as origin visits are now immutable.)
and on origin visit updates:
* adding a new status (using the origin url and visit id; there must be an override to allow the replayer to add an update for a visit that doesn't exist yet)
* getting the last status of a visit (so one can know if it's completed and get the id of its snapshot)
* listing all statuses of a visit (?) (there is no need for it yet)
# Example
Loaders will use the storage API like this:
```
# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))
```
and readers (mostly just swh-web) would get this from the API:
```
{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_status_get_latest(origin_url, id)
```
# Work plan
- [x] Amend `origin_visit_update` to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
- [x] D2880: D3001: add OriginVisitStatus to swh-model
- [x] D2941: Update swh-model documentation
- [x] D2879: storage*: Split `origin_visit` table in backends (internal change): new origin_visit_status table
- [x] D2937: in-memory
- [x] D2938: pg-storage
- [x] D2938: sql migration scripts
- [x] D2939: cassandra
- [x] D3101: pg-storage: Write both origin-visit and origin-visit-status in parallel
- [x] Deploy storage (and migrate data, this now can occur while loaders are running)
- [x] migrate remaining data (data that did not get migrated during the first migration, while loaders continued their work)
- [x] D3180: pg-storage: Switch over the read queries to the new tables (=> revert D3101)
- [x] Deploy storage
- [x] D3212: storage*: add origin_visit_status_add endpoint
- [x] D3238: storage*: make origin-visit-add write origin-visit-status as well
- [x] D3244: storage*: make origin-visit-update write origin-visit-status as well
- [x] D3251: storage*: make origin-visit-upsert write origin-visit-status as well
- [x] Deploy storage
- [x] Migrate swh code away from origin-visit-update (use origin-visit-status-add endpoints)
- [x] D3253: loader (replace origin-visit-update calls)
- [x] D3259: swh-web (tests: replace origin-visit-update)
- [x] D3260: swh-indexer (tests: replace origin-visit-update)
- [x] Migrate swh code away from origin-visit-upsert (use origin-visit-add endpoint)
- [x] D3262: storage: Align origin-visit-add with other endpoint
- [x] D3264: loader: migrate to the new origin-visit-add contract
- [x] D3267: swh-web (test): migrate to the new origin-visit-add contract
- [x] D3265: indexer (test): migrate to the new origin-visit-add contract
- [x] D3273: journal: adapt swh.journal.replayer to stop using origin-visit-upsert (use origin-visit-add instead)
- [x] storage/journal: produce to new kafka topic (origin_visit_add, origin_visit_status)
- [x] storage*: remove inconsistent & unused endpoints (origin-visit-upsert, origin-visit-update)
- [x] D3276: Drop origin-visit-update
- [x] D3281: Drop origin-visit-upsert
- [x] Deploy storage/loaders
- [ ] model: remove no longer used OriginVisit fields (snapshot, metadata, status)
- [x] D3296: storage: open origin-visit-status-get-latest
- [x] D3314: Deprecate storage.snapshot-get-latest, expose algos.snapshot.snapshot-get-latest instead
- [x] migrate clients to use origin-visit-status-get-latest/snapshot-get-latest
- [x] D3301: indexer
- [x] D3305: loader-core
- [x] D3308: loader-git
- [x] D3306: loader-mercurial
- [x] D3307: loader-svn
- [x] D3316: webapp
- [x] Deploy everything (storage & clients), make sure everything is ok. in-progress
- [x] D3359, D3350: Fix timeouts on loaders (large numbers of visits make snapshot-get-latest timeout)
- [x] Make obsolete fields optional to avoid breaking the world
- [x] D3340: Adapt model make those fields optional
- [x] D3344: Adapt journal test data
- [x] D3342: Adapt storage to remove those fields in backend
- [x] D3360: replay: update origin-visit fixer to drop now origin-visit's "extra" fields
- [x] Adapt clients code to stop providing unnecessary fields (loader, web, ...)
- [x] D3363: storage
- [x D3361: loaders
- [x] D3362: indexer
- [x] D3365: webapp
- [x] Check deployment is fine (docker, staging)
- [ ] Deploy everything (storage, loader) + checks
- [ ] D3337: adapt model to drop those fields
- [ ] All builds should still be green
- [ ] Deploy everything + checks
- [ ] kafka fill-in-the-hole
- [x] D3299: backfiller: make it able to run origin-visit-status
- [ ] backfill `origin_visit_status`
- [ ] backfill `origin_visit` to remove dropped fields (can be done at the end)