Page MenuHomeSoftware Heritage

Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql
Closed, MigratedEdits Locked

Description

  1. new column with the hash
  2. fill it (will need a migration with python code)
  3. add unique index
  4. make the new index a primary key?

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

vlorentz triaged this task as High priority.Feb 2 2021, 1:37 PM
vlorentz created this task.
vlorentz renamed this task from Allow querying raw_extrinsic_metadata by hash in swh.storage.postgresql to Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql.Feb 2 2021, 2:20 PM

After a lot of back and forth, and the release of swh.model v2.3.0 and swh.storage v0.26.0, this is now all done and deployed in staging and production.

We went the following way for the migration:

  • make swh.model write id fields in the journal
  • deploy swh.storage with the new swh.model (so all writes happen with the new model)
  • run swh storage backfill on the raw_extrinsic_metadata topic to fill the journal with objects using the new model
  • (make sure the journal gets compacted to remove old versions of the object, with a combination of topic.retention.ms and having to run the backfill multiple times for it to work on all the real-world data)
  • run swh storage replay on raw_extrinsic_metadata, using a fork of swh.storage that wrote objects to a new table (using the new schema)
  • once the replayer caught up, run some queries to spot check that all the data got properly migrated
  • once validated, stop the workers; stop replayer; deploy new version of swh.storage with new schema, move the new table in place of the old one (and take care of logical replication); then restart the workers