Page MenuHomeSoftware Heritage

Investigate how to upgrade the schema of the Cassandra storage
Closed, MigratedEdits Locked

Description

With Cassandra, we cannot just "ALTER TABLE" to make arbitrary changes like in postgresql, so we need a solution for schema upgrades if we want it in production.

Event Timeline

vlorentz triaged this task as Normal priority.Sep 15 2020, 1:56 PM
vlorentz created this task.

What we can do, however:

  • remove non-PK columns
  • add non-PK column, with all values set to null
  • add/remove tables

this, possibly combined with a short python script to fill in data, should be enough for simple migrations, including adding/removing index tables.

For the harder cases, that involve changes to the PK, we could do something like this:

  • create a new table with a new name (eg. revision_v[n+1]; like we do in swh-search except Cassandra does not support aliases)
  • start an extra storage backend, that reads from that table instead of the old one (eg. revision_v[n]), and also reads from all the other tables as usual
  • have a multiplexing storage proxy (like we have for the objstorage), that queries this new backend (which reads from v[n+1]), and falls back to the old backend (which reads from v[n])
  • run a script that reads from v[n], adds to v[n+1], then optionally removes from v[n] (I'm leaning toward not removing, for better perfs and to allow rollbacks)
  • when the script is done, drop v[n] and the old backend

the issue with this is interaction between multiple versions of the code. We need to either keep an old version of swh-storage deployed, or have code in swh-storage to deal with either versions; for as long as data from v[n] isn't completely migrated.

This can add accumulate if we want to push a new update/migration while the previous one isn't completely migrated.

Thoughts?

For the harder cases, that involve changes to the PK, we could do something like this:

  • create a new table with a new name (eg. revision_v[n+1]; like we do in swh-search except Cassandra does not support aliases)
  • start an extra storage backend, that reads from that table instead of the old one (eg. revision_v[n]), and also reads from all the other tables as usual
  • have a multiplexing storage proxy (like we have for the objstorage), that queries this new backend (which reads from v[n+1]), and falls back to the old backend (which reads from v[n])

And, I guess, somehow pushes writes to both backends?

  • run a script that reads from v[n], adds to v[n+1], then optionally removes from v[n] (I'm leaning toward not removing, for better perfs and to allow rollbacks)

(because if you want to support rollbacks, you need new writes to happen in both tables)

  • when the script is done, drop v[n] and the old backend

the issue with this is interaction between multiple versions of the code. We need to either keep an old version of swh-storage deployed, or have code in swh-storage to deal with either versions; for as long as data from v[n] isn't completely migrated.

I think we will need the multiplexed storage proxy to select which writes go to which backend, but in general the approach sounds okay.

This can add accumulate if we want to push a new update/migration while the previous one isn't completely migrated.

The biggest migration of this kind we'll have to handle is swhids v2, and I think this will be a long running multi-step process (because we'll have to compute new ids for all objects, in layers); I don't see us performing many such large migrations in parallel any time soon.

Thoughts?

I think having multiple concurrent versions of the same module running in parallel (rather than having layers and layers of compatibility code) is the most compelling option.

Unfortunately, that's really not something that our current deployment scenarios support well:

  • all packages are installed system-wide in a shared site-packages directory
  • all packages are installed with debian packages, which don't support the installation of multiple versions concurrently
  • our debian archive only supports one version per package anyway

But I think we really need to move towards being able to better control deployment versions, and sooner rather than later. This probably means generally migrating to virtualenv-based deployments, for which I see a ton of upsides (most notably, if we do it properly, we'll be getting better reproducibility of environments between tests, staging and production), and a few downsides (managing the up-to-dateness of dependencies in a set of virtualenvs is still an open problem, AFAICT).

When we have that, it should be easy enough to have a venv with the legacy code running as a fallback on which the multiplexer can direct requests if needed.

Doesn't this deserve a state-of-the-art kind of thing? Are there documentation material on the subject? How does other (big) cassandra users handle this?

Maybe ask questions on one or 2 mailing lists also?

Doesn't this deserve a state-of-the-art kind of thing?

As far as I can tell, the state-of-the-art is "having a bunch of .cql files (or hand-written Java classes), and a script to remember which one were already ran" (like what we do in postgresql with dbversion); all the existing tools I found follow this pattern.

Are there documentation material on the subject? How does other (big) cassandra users handle this?

All the discussions/documentation I could find on the subject either blissfully ignores that CQL isn't omnipotent, or assumes you get the primary key right on the first try, or expects you to write your own custom code.

And it's not exactly surprising, given the restrictions Cassandra has.

Maybe ask questions on one or 2 mailing lists also?

Someone did, here: https://community.datastax.com/questions/4201/mange-schema-change.html

I just discussed the multiplexer-based migration process I described above with ardumont/olasd/vsellier.

It's actually quite tricky to preserve consistency, because it breaks the transactionally of writes:

  1. if the multiplexer writes only to the "new" backend, then no rollback is possible (as mentioned by olasd)
  1. if the multiplexer writes to both backends, but can read from both, then an interrupted write will still show a hit when reading, so (depending on write order) either of the backends will silently be missing some objects
  1. if the multiplexer writes to both backends serially, but reads only from one of them (the last it writes to), then we will be missing objects after either the switchover or the rollback

Therefore, we discussed an other approach:

  1. deploy a new backend, loaders keep writing only to the old one
  1. setup a journal replayer, that fills the new backend from kafka
  1. wait for the replayer to be close to the end of the log (may take from a few days to maybe a month, let's wait for benchmarks)
  1. when the replayer is close to the end, stop loaders, and wait for the replayer to reach the end
  1. stop the replayer
  1. switchover to the new backend

This has the extra advantage of exercising the replayer on *all* objects before committing to the new schema (so we can catch bugs and inefficiencies)