Paths

Table of Contentst

Differential D8760

Add payloads to ExtIDs
Needs RevisionPublic
Actions

Authored by samplet on Oct 25 2022, 5:28 AM.

Details

Reviewers

vlorentz

Group Reviewers

Reviewers

Summary

Update the storage layer to support the model changes
in D8759

I'm not entirely sure about the SQL schema migrartion. Running
the SQL manually seems to work, but I couldn't get it to
work from swh db upgrade. It seems to be an issue with
my test environment rather than something with the upgrade SQL
itself, though.

Diff Detail

Repository

rDSTO Storage manager

Branch

disarchive-3

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 32537
Build 50962: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 50961: arc lint + arc unit

Event Timeline

samplet created this revision.Oct 25 2022, 5:28 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 25 2022, 5:28 AM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1690/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1690/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 25 2022, 5:28 AM

Harbormaster failed remote builds in B32537: Diff 31583!

samplet retitled this revision from Add payloads to ExtIDs Depends on D8759 to Add payloads to ExtIDs.Oct 25 2022, 5:30 AM

samplet added a parent revision: D8759: model: Add payload to ExtID class.

samplet requested review of this revision.Oct 25 2022, 5:34 AM

vlorentz edited the summary of this revision. (Show Details)Oct 25 2022, 8:41 AM

vlorentz mentioned this in D8759: model: Add payload to ExtID class.Oct 25 2022, 9:07 AM

Thanks, looks good too.

A few nitpicks below

swh/storage/cassandra/model.py
320–328	I just realized this doesn't match the actual schema. Change it to: PARTITION_KEY = ("target", "target_type") CLUSTERING_KEY = ( "extid_version", "extid", "extid_type", "payload_type", "payload", ) (this doesn't really matter for now, because it would only affect the behavior of queries with sorted results; which we don't use on this table yet. Plus, these attributes are only used by the in-memory storage anyway)
swh/storage/cassandra/schema.py
283	Looks like this is our first schema change since Cassandra is running with actual data! We'll need to finally figure how to migrate schema when we land this :D
swh/storage/sql/30-schema.sql
521
swh/storage/sql/60-indexes.sql
328–329	hmmmm Is there no way to use the same index for both?
swh/storage/sql/upgrades/187.sql
17
swh/storage/tests/storage_tests.py
1706

This revision now requires changes to proceed.Oct 27 2022, 10:03 AM

Switch to using Git SHA-1 for the payload.

Build has FAILED

Patch application report for D8760 (id=31679)

Rebasing onto 82ad28bb29...

Current branch diff-target is up to date.

Changes applied before test

commit 2ac685ae155d2e62f1487dbd04ba3cb45068d64f
Author: Timothy Sample <samplet@ngyro.com>
Date:   Tue Sep 27 15:30:32 2022 -0400

    Add payloads to ExtIDs

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1691/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1691/console

Harbormaster failed remote builds in B32631: Diff 31679!Oct 28 2022, 1:24 AM

It looks like the changes fail because CI doesn't know to use the model updates in D8759. Is there something I can do to fix it or do I just wait to run CI after D8759 is landed?

In D8760#inline-62232, @vlorentz wrote:

I just realized this doesn't match the actual schema. Change it to: [...]

Should that be separate commit or should I just sneak it into the payload changes?

In D8760#inline-62235, @vlorentz wrote:

Is there no way to use the same index for both?

Starting in Postgres 15, you can mark a unique index as NULLS NOT DISTINCT. Before that, Postgres always lets you have multiple NULL values in a column included in a unique index. I want to ensure that you can't have duplicate ExtIDs with NULL payloads (preserving the old constraint). I also want to make sure that the payloads are unique. Using two partial indexes communicates this to Postgres. I agree it's a little cumbersome, but it should be harmless. The only other idea I came across is to coalesce the potentially NULL columns in the index, but you would have to decide on a sentinel value to replace NULL, which seems worse to me.

It looks like the changes fail because CI doesn't know to use the model updates in D8759.

yes, the ci build uses pulled dependencies out of the requirements* files (pulled from pypi.org).

Is there something I can do to fix it

Well, you could tweak the requirements-swh.txt to target a specific swh.model commit (and then update yet again that diff).
as in [1] (i'm targetting the line inside the requirements-swh.txt, prior to my old comment there).
But that feels like a lot of fumbling around (which you'll have to revert prior to landing this).

or...

[1] https://forge.softwareheritage.org/D8539#inline-60772

or do I just wait to run CI after D8759 is landed?

... you can just wait for the other diff to land and a new swh.model is released (which
will do as soon as this gets landed to unstuck this).

Cheers,

ardumont added a subscriber: vsellier.Oct 31 2022, 2:54 PM

ardumont added inline comments.

swh/storage/cassandra/schema.py
283	@vsellier heads up ^

olasd added a subscriber: olasd.Nov 2 2022, 2:39 PM

olasd added inline comments.

swh/storage/sql/60-indexes.sql
328–331	Is there any reason to distinguish these two indexes? PostgreSQL can index null columns just fine.

samplet added inline comments.Nov 2 2022, 10:06 PM

swh/storage/sql/60-indexes.sql
328–331	With one index, Postgres will allow duplicate rows (provided the payload is null). This is because it does not treat the null payloads as equal for the purposes of the unique constraint. I mentioned this in my comment here: D8760#228500. To be more concrete, here's an example: create table test (a integer not null, b integer); create unique index on test (a, b); insert into test (a) values (1), (1); select * from test; This will result in: a \| b ---+--- 1 \| 1 \| (2 rows) Intuitively, you would think it would violate the unique constraint, but it doesn't. Using one index and allowing null payloads amounts to saying "everything with a payload is unique." Using two indexes is like saying "(1) everything without a payload is unique, and (2) everything with a payload is unique."