Page MenuHomeSoftware Heritage

Make origin visits immutable
Closed, MigratedEdits Locked

Description

Current problem

Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.

This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.

Solution

We should split a visit object from its updates in the journal: have one topic for the visits themselves (identified by (origin_url, visit_id)), and one for the successive updates (identified by (origin_url, visit_id, update_id); where update_id must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.

The current fields of origin visits would be split this way:

  • the new "visit" objects would get the type (git/tar/...), maybe (?) metadata, and maybe a new start_date field
  • the "visit_update" objects would get the other fields (date, status, metadata (?), snapshot).

There are two ways to make the replayer work with this:

  1. simply add a "version" field to origin visits, so that when the replayer passes origin_visit_upsert this version, and swh-storage would discard the upsert if it's older than what is currently in the DB
  2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates

New data model for the visits

an origin-visit represents a run of a loader. It currently carries the information:

  • origin url
  • visit id: unique for a given origin
  • type (git, hg, ...)
  • start_date: when the loader was started, shortly before it created the origin visit
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (unused)

Note: The left-member wrapping (parenthesis) conveys the optional nature of the property.

and origin-visit-status represents a snapshot of a visit's loader at a point in time (sent from time to time by the loader, like a heartbeat). It has the fields:

  • origin url
  • visit id
  • date: the timestamp of the snapshot of the loader task
  • status: Status of the visit (possible values: created, ongoing, full, partial)
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (not used, kept for future update)

The following operations are supported on origin visits:

  • creating a visit (to get a unique id from the storage)
  • getting a visit from its origin url + id
  • listing visits of an origin (with filters, order, etc.)
  • upsert a visit (to add an origin with a predetermined id, needed for the replayer)

(note that there is no (need for an) equivalent for the current "origin_visit_update()" endpoint, as origin visits are now immutable.)

and on origin visit updates:

  • adding a new status (using the origin url and visit id; there must be an override to allow the replayer to add an update for a visit that doesn't exist yet)
  • getting the last status of a visit (so one can know if it's completed and get the id of its snapshot)
  • listing all statuses of a visit (?) (there is no need for it yet)

Example

Loaders will use the storage API like this:

# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))

and readers (mostly just swh-web) would get this from the API:

{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_status_get_latest(origin_url, id)

Work plan

  • Amend origin_visit_update to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
  • D2880: D3001: add OriginVisitStatus to swh-model
  • D2941: Update swh-model documentation
  • D2879: storage*: Split origin_visit table in backends (internal change): new origin_visit_status table
  • D3101: pg-storage: Write both origin-visit and origin-visit-status in parallel
  • Deploy storage (and migrate data, this now can occur while loaders are running)
  • migrate remaining data (data that did not get migrated during the first migration, while loaders continued their work)
  • D3180: pg-storage: Switch over the read queries to the new tables (=> revert D3101)
  • Deploy storage
  • D3212: storage*: add origin_visit_status_add endpoint
  • D3238: storage*: make origin-visit-add write origin-visit-status as well
  • D3244: storage*: make origin-visit-update write origin-visit-status as well
  • D3251: storage*: make origin-visit-upsert write origin-visit-status as well
  • Deploy storage
  • Migrate swh code away from origin-visit-update (use origin-visit-status-add endpoints)
    • D3253: loader (replace origin-visit-update calls)
    • D3259: swh-web (tests: replace origin-visit-update)
    • D3260: swh-indexer (tests: replace origin-visit-update)
  • Migrate swh code away from origin-visit-upsert (use origin-visit-add endpoint)
    • D3262: storage: Align origin-visit-add with other endpoint
    • D3264: loader: migrate to the new origin-visit-add contract
    • D3267: swh-web (test): migrate to the new origin-visit-add contract
    • D3265: indexer (test): migrate to the new origin-visit-add contract
    • D3273: journal: adapt swh.journal.replayer to stop using origin-visit-upsert (use origin-visit-add instead)
  • storage/journal: produce to new kafka topic (origin_visit_add, origin_visit_status)
  • storage*: remove inconsistent & unused endpoints (origin-visit-upsert, origin-visit-update)
    • D3276: Drop origin-visit-update
    • D3281: Drop origin-visit-upsert
  • Deploy storage/loaders
  • model: remove no longer used OriginVisit fields (snapshot, metadata, status)
    • D3296: storage: open origin-visit-status-get-latest
    • D3314: Deprecate storage.snapshot-get-latest, expose algos.snapshot.snapshot-get-latest instead
    • migrate clients to use origin-visit-status-get-latest/snapshot-get-latest
    • Deploy everything (storage & clients), make sure everything is ok. in-progress
    • D3359, D3350: Fix timeouts on loaders (large numbers of visits make snapshot-get-latest timeout)
    • Make obsolete fields optional to avoid breaking the world
      • D3340: Adapt model make those fields optional
      • D3344: Adapt journal test data
      • D3342: Adapt storage to remove those fields in backend
      • D3360: replay: update origin-visit fixer to drop now origin-visit's "extra" fields
      • Adapt clients code to stop providing unnecessary fields (loader, web, ...)
    • Check deployment is fine (docker, staging)
    • Deploy "everything" (storage, loader) + checks
    • D3380: Adapt some left-over unneeded conversion in storage with potential world-breakage
    • Deploy storage
    • D3337: adapt model to drop those fields
    • D3376: journal: Adapt some tests data
    • Tag swh-{model,journal}: All master builds should stay green
  • Deploy everything + checks
    • staging is fine
    • prod
  • D3416: Is the replayer fine in the end? Yes, only the origin-visit-add endpoint (pg) needed to stop allowing upsert change (internal change).
  • deploy storage (v0.9.2)
  • T2478: kafka fill-in-the-hole
    • D3299: backfiller: make it able to run origin-visit-status
    • backfill origin_visit_status
    • backfill origin_visit to remove dropped fields (can be done at the end)

Revisions and Commits

rDJNL Journal infrastructure
D3376
D3344
D3241
rDLDG Git loader
D3308
D3272
rDWAPPS Web applications
D3365
D3316
D3267
D3259
rDLDHG Mercurial loader
D3306
rDLDBASE Generic VCS/Package Loader
D3361
D3305
D3333
D3253
rDCIDX Metadata indexer
D3362
D3301
D3265
D3260
rDSTO Storage manager
D3416
D3380
D3350
D3363
D3360
D3342
D3314
D3313
D3296
D3281
D3276
D3273
D3278
D3279
D3275
D3262
D3251
D3244
D3238
D3212
D3180
D3101
D3080
D2939
D2938
D2937
rDMOD Data model
D3337
D3341
D3340
rDLDSVN Subversion (SVN) loader
D3307

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The main part is done, actually make the origin-visit immutable.
It's been deployed fully now.

Now remains the part about backfilling topics.
As it's dependent on hardware right now, i opened T2478
to be done when we can.

I'll close this one (which has been opened since what looks like an eternity).

ardumont claimed this task.