Page MenuHomeSoftware Heritage

Add raw_extrinsic_metadata.id column in postgresql.
ClosedPublic

Authored by vlorentz on Feb 5 2021, 2:39 PM.

Details

Summary

For now, this has absolutely no effect on the API users,
as rows are already deduplicated based on a subset of the
fields hashed by the id.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 5 2021, 2:39 PM
Harbormaster failed remote builds in B19042: Diff 17923!

Build has FAILED

Patch application report for D5029 (id=17923)

Could not rebase; Attempt merge onto efd8815b89...

Merge made by the 'recursive' strategy.
 sql/upgrades/168.sql                               |  24 +
 swh/storage/cassandra/common.py                    |   5 -
 swh/storage/cassandra/converters.py                |   2 +-
 swh/storage/cassandra/cql.py                       |   3 +-
 swh/storage/cassandra/storage.py                   |   3 +-
 swh/storage/cli.py                                 |  82 ++
 swh/storage/postgresql/db.py                       |  14 +-
 swh/storage/postgresql/storage.py                  |   1 +
 swh/storage/sql/30-schema.sql                      |   4 +-
 swh/storage/sql/60-indexes.sql                     |   4 +
 .../tests/data/sql-v0.18.0/10-superuser-init.sql   |  27 +
 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql   |  22 +
 swh/storage/tests/data/sql-v0.18.0/20-enums.sql    |  23 +
 swh/storage/tests/data/sql-v0.18.0/30-schema.sql   | 499 +++++++++++
 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql    | 960 +++++++++++++++++++++
 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql  | 283 ++++++
 .../logical_replication/replication_source.sql     |  25 +
 swh/storage/tests/storage_tests.py                 |  59 +-
 swh/storage/tests/test_postgresql_migrated.py      |  63 ++
 swh/storage/tests/test_postgresql_migration.py     | 194 +++++
 swh/storage/utils.py                               |   5 +
 21 files changed, 2276 insertions(+), 26 deletions(-)
 create mode 100644 sql/upgrades/168.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/10-superuser-init.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/20-enums.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/30-schema.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/logical_replication/replication_source.sql
 create mode 100644 swh/storage/tests/test_postgresql_migrated.py
 create mode 100644 swh/storage/tests/test_postgresql_migration.py
Changes applied before test
commit 645dafed5454d6ce7e970dd8c3aeb9239a75cbc1
Merge: efd8815b 14bfef5c
Author: Jenkins user <jenkins@localhost>
Date:   Fri Feb 5 13:40:17 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 14bfef5ca8780291f71af7eae2e3b9c45051a101
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

commit 27440f9862b7df4e36b400a4fb74f1679d9ec6f4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 14:36:50 2021 +0100

    Add basic migration tests for postgresql
    
    This adds two test files:
    
    * `test_postgresql_migrated.py` applies an old schema definition, runs the
      migrations, then runs all the usual tests
    * `test_postgresql_migration.py` applies an old schema definition, inserts
      data, runs the migrations, and checks the data is still available
    
    `test_postgresql_migration.py` will probably break in some releases as
    it uses the old SQL with the new Python to insert, but it should be good
    enough, and we can disable it in some releases when needed.

commit e9441fef13c11c3eb500403275ba7337ed0c77e1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 14:29:38 2021 +0100

    postgresql: Fix dbversion() to return the max version instead of a random one.

commit 508399ce2abf21f813acc9c56422cbbccca0ae3d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 13:59:09 2021 +0100

    storage_tests: recompute ids when evolving RawExtrinsicMetadata objects.
    
    For now this does nothing as RawExtrinsicMetadata has no 'id' field,
    but the equality assertions will become errors when the next version
    of swh.model is released.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1137/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1137/console

vlorentz edited the test plan for this revision. (Show Details)

Build has FAILED

Patch application report for D5029 (id=17971)

Could not rebase; Attempt merge onto b0383833fe...

Updating b0383833..d3872c70
Fast-forward
 sql/upgrades/168.sql                               |  24 +
 swh/storage/cassandra/common.py                    |   5 -
 swh/storage/cassandra/converters.py                |   2 +-
 swh/storage/cassandra/cql.py                       |   3 +-
 swh/storage/cassandra/storage.py                   |   3 +-
 swh/storage/cli.py                                 |  82 ++
 swh/storage/postgresql/db.py                       |   5 +-
 swh/storage/postgresql/storage.py                  |   1 +
 swh/storage/sql/30-schema.sql                      |   4 +-
 swh/storage/sql/60-indexes.sql                     |   4 +
 .../tests/data/sql-v0.18.0/10-superuser-init.sql   |  27 +
 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql   |  22 +
 swh/storage/tests/data/sql-v0.18.0/20-enums.sql    |  23 +
 swh/storage/tests/data/sql-v0.18.0/30-schema.sql   | 499 +++++++++++
 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql    | 960 +++++++++++++++++++++
 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql  | 283 ++++++
 .../logical_replication/replication_source.sql     |  25 +
 swh/storage/tests/storage_tests.py                 |  59 +-
 swh/storage/tests/test_postgresql_migrated.py      |  63 ++
 swh/storage/tests/test_postgresql_migration.py     | 194 +++++
 swh/storage/utils.py                               |   5 +
 21 files changed, 2268 insertions(+), 25 deletions(-)
 create mode 100644 sql/upgrades/168.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/10-superuser-init.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/20-enums.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/30-schema.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/logical_replication/replication_source.sql
 create mode 100644 swh/storage/tests/test_postgresql_migrated.py
 create mode 100644 swh/storage/tests/test_postgresql_migration.py
Changes applied before test
commit d3872c70a5b0a431ef92a3b3c5cc9ddfc8f6251e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

commit ebe4dab07feed7eb8ae03cd407980ba7bea78b37
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 14:36:50 2021 +0100

    Add basic migration tests for postgresql
    
    This adds two test files:
    
    * `test_postgresql_migrated.py` applies an old schema definition, runs the
      migrations, then runs all the usual tests
    * `test_postgresql_migration.py` applies an old schema definition, inserts
      data, runs the migrations, and checks the data is still available
    
    `test_postgresql_migration.py` will probably break in some releases as
    it uses the old SQL with the new Python to insert, but it should be good
    enough, and we can disable it in some releases when needed.

commit 75939e4f9b02e130224a90929eb3ed2ba0730592
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 13:59:09 2021 +0100

    storage_tests: recompute ids when evolving RawExtrinsicMetadata objects.
    
    For now this does nothing as RawExtrinsicMetadata has no 'id' field,
    but the equality assertions will become errors when the next version
    of swh.model is released.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1144/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1144/console

douardda added inline comments.
sql/upgrades/168.sql
18 ↗(On Diff #17971)

add?

Build has FAILED

Patch application report for D5029 (id=17991)

Could not rebase; Attempt merge onto b0383833fe...

Updating b0383833..386bc2fb
Fast-forward
 sql/upgrades/168.sql                               |  24 +
 swh/storage/cassandra/common.py                    |   5 -
 swh/storage/cassandra/converters.py                |   2 +-
 swh/storage/cassandra/cql.py                       |   3 +-
 swh/storage/cassandra/storage.py                   |   3 +-
 swh/storage/cli.py                                 |  82 ++
 swh/storage/postgresql/db.py                       |   5 +-
 swh/storage/postgresql/storage.py                  |   1 +
 swh/storage/sql/30-schema.sql                      |   4 +-
 swh/storage/sql/60-indexes.sql                     |   4 +
 .../tests/data/sql-v0.18.0/10-superuser-init.sql   |  27 +
 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql   |  22 +
 swh/storage/tests/data/sql-v0.18.0/20-enums.sql    |  23 +
 swh/storage/tests/data/sql-v0.18.0/30-schema.sql   | 499 +++++++++++
 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql    | 960 +++++++++++++++++++++
 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql  | 283 ++++++
 .../logical_replication/replication_source.sql     |  25 +
 swh/storage/tests/storage_tests.py                 |  59 +-
 swh/storage/tests/test_postgresql_migrated.py      |  63 ++
 swh/storage/tests/test_postgresql_migration.py     | 194 +++++
 swh/storage/utils.py                               |   5 +
 21 files changed, 2268 insertions(+), 25 deletions(-)
 create mode 100644 sql/upgrades/168.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/10-superuser-init.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/20-enums.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/30-schema.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/logical_replication/replication_source.sql
 create mode 100644 swh/storage/tests/test_postgresql_migrated.py
 create mode 100644 swh/storage/tests/test_postgresql_migration.py
Changes applied before test
commit 386bc2fbe8de8bc46aba88247c42cd241927e8b2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

commit ebe4dab07feed7eb8ae03cd407980ba7bea78b37
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 14:36:50 2021 +0100

    Add basic migration tests for postgresql
    
    This adds two test files:
    
    * `test_postgresql_migrated.py` applies an old schema definition, runs the
      migrations, then runs all the usual tests
    * `test_postgresql_migration.py` applies an old schema definition, inserts
      data, runs the migrations, and checks the data is still available
    
    `test_postgresql_migration.py` will probably break in some releases as
    it uses the old SQL with the new Python to insert, but it should be good
    enough, and we can disable it in some releases when needed.

commit 75939e4f9b02e130224a90929eb3ed2ba0730592
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 13:59:09 2021 +0100

    storage_tests: recompute ids when evolving RawExtrinsicMetadata objects.
    
    For now this does nothing as RawExtrinsicMetadata has no 'id' field,
    but the equality assertions will become errors when the next version
    of swh.model is released.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1148/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1148/console

Build has FAILED

Patch application report for D5029 (id=18606)

Could not rebase; Attempt merge onto 88ff2c2fa0...

Updating 88ff2c2f..ed597441
Fast-forward
 sql/upgrades/169.sql                               |  26 +
 swh/storage/postgresql/db.py                       |   5 +-
 swh/storage/postgresql/storage.py                  |   1 +
 swh/storage/sql/30-schema.sql                      |   4 +-
 swh/storage/sql/60-indexes.sql                     |   4 +
 .../tests/data/sql-v0.18.0/10-superuser-init.sql   |  27 +
 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql   |  22 +
 swh/storage/tests/data/sql-v0.18.0/20-enums.sql    |  23 +
 swh/storage/tests/data/sql-v0.18.0/30-schema.sql   | 499 +++++++++++
 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql    | 960 +++++++++++++++++++++
 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql  | 283 ++++++
 .../logical_replication/replication_source.sql     |  25 +
 swh/storage/tests/test_postgresql_migrated.py      |  63 ++
 swh/storage/tests/test_postgresql_migration.py     | 194 +++++
 14 files changed, 2134 insertions(+), 2 deletions(-)
 create mode 100644 sql/upgrades/169.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/10-superuser-init.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/15-flavor.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/20-enums.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/30-schema.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/40-funcs.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/60-indexes.sql
 create mode 100644 swh/storage/tests/data/sql-v0.18.0/logical_replication/replication_source.sql
 create mode 100644 swh/storage/tests/test_postgresql_migrated.py
 create mode 100644 swh/storage/tests/test_postgresql_migration.py
Changes applied before test
commit ed597441f79dda6b4eb424e9b596e4c164e966d2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

commit e5fc49846c1d9ca9961b0b7edf7e0a993401daf6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 14:36:50 2021 +0100

    Add basic migration tests for postgresql
    
    This adds two test files:
    
    * `test_postgresql_migrated.py` applies an old schema definition, runs the
      migrations, then runs all the usual tests
    * `test_postgresql_migration.py` applies an old schema definition, inserts
      data, runs the migrations, and checks the data is still available
    
    `test_postgresql_migration.py` will probably break in some releases as
    it uses the old SQL with the new Python to insert, but it should be good
    enough, and we can disable it in some releases when needed.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1178/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1178/console

rebase without migration tests

Build has FAILED

Patch application report for D5029 (id=18607)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 7ed181fe084152ebb2ad5b1792a8d2d9e5a1c429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1179/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1179/console

Build is green

Patch application report for D5029 (id=18626)

Rebasing onto 88ff2c2fa0...

Current branch diff-target is up to date.
Changes applied before test
commit 7ed181fe084152ebb2ad5b1792a8d2d9e5a1c429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1183/ for more details.

The comment in the migration doesn't sound accurate; I'm not convinced the production kafka contains all the entries from this table, so if we want to use kafka, we need to truncate and backfill the kafka topic with object ids first, then we can do the update of the table.

Now I wonder if it wouldn't be simpler to just do an ad-hoc migration of the postgresql data directly:

  • stop storage
  • rename the existing table
  • create new table with the final new schema
  • restart storage, which will write new objects to the new table
  • read entries from the old table, compute their id, write them to the new table
  • drop the old table
  • truncate and backfill kafka from the new table

I don't remember if loaders actually read data from this table (in which case we can just stop them for the duration of the migration, which should go fairly quickly, for 5m entries) or not.

swh/storage/postgresql/db.py
1229–1230

this should be switched to the id column

swh/storage/sql/60-indexes.sql
271

this should probably be deuniqueified

swh/storage/postgresql/db.py
1229–1230
swh/storage/sql/60-indexes.sql
271

I prefer backfilling to, then replaying from kafka because

In D5029#131887, @olasd wrote:
  • read entries from the old table, compute their id, write them to the new table

is actually not as easy as it sounds, and I don't want to write and test throwaway code just for this while we already have the replayer.

And the replayer already handles parallelization and restart-on-error thanks to kafka

Build was aborted

Patch application report for D5029 (id=18960)

Rebasing onto 8dd9f7b635...

Current branch diff-target is up to date.
Changes applied before test
commit 2d540b0580cc9699bd8a593db45942b1f14d8e21
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1229/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1229/console

Build is green

Patch application report for D5029 (id=18960)

Rebasing onto 8dd9f7b635...

Current branch diff-target is up to date.
Changes applied before test
commit 2d540b0580cc9699bd8a593db45942b1f14d8e21
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Feb 5 13:56:15 2021 +0100

    Add raw_extrinsic_metadata.id column in postgresql.
    
    For now, this has absolutely no effect on the API users,
    as rows are already deduplicated based on a subset of the
    fields hashed by the id.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1231/ for more details.

This revision is now accepted and ready to land.Mar 22 2021, 10:45 AM