Page MenuHomeSoftware Heritage

Add payloads to ExtIDs
Needs RevisionPublic

Authored by samplet on Oct 25 2022, 5:28 AM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Summary

Update the storage layer to support the model changes
in D8759

I'm not entirely sure about the SQL schema migrartion. Running
the SQL manually seems to work, but I couldn't get it to
work from swh db upgrade. It seems to be an issue with
my test environment rather than something with the upgrade SQL
itself, though.

Diff Detail

Repository
rDSTO Storage manager
Branch
disarchive-3
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32631
Build 51119: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51118: arc lint + arc unit

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 25 2022, 5:28 AM
Harbormaster failed remote builds in B32537: Diff 31583!
samplet retitled this revision from Add payloads to ExtIDs Depends on D8759 to Add payloads to ExtIDs.Oct 25 2022, 5:30 AM
vlorentz added a subscriber: vlorentz.

Thanks, looks good too.

A few nitpicks below

swh/storage/cassandra/model.py
320–329

I just realized this doesn't match the actual schema. Change it to:

PARTITION_KEY = ("target", "target_type")
CLUSTERING_KEY = (
    "extid_version",
    "extid",
    "extid_type",
    "payload_type",
    "payload",
)

(this doesn't really matter for now, because it would only affect the behavior of queries with sorted results; which we don't use on this table yet. Plus, these attributes are only used by the in-memory storage anyway)

swh/storage/cassandra/schema.py
282

Looks like this is our first schema change since Cassandra is running with actual data! We'll need to finally figure how to migrate schema when we land this :D

swh/storage/sql/30-schema.sql
520
swh/storage/sql/60-indexes.sql
328–331

hmmmm

Is there no way to use the same index for both?

swh/storage/sql/upgrades/187.sql
17
swh/storage/tests/storage_tests.py
1705
This revision now requires changes to proceed.Oct 27 2022, 10:03 AM

Switch to using Git SHA-1 for the payload.

Build has FAILED

Patch application report for D8760 (id=31679)

Rebasing onto 82ad28bb29...

Current branch diff-target is up to date.
Changes applied before test
commit 2ac685ae155d2e62f1487dbd04ba3cb45068d64f
Author: Timothy Sample <samplet@ngyro.com>
Date:   Tue Sep 27 15:30:32 2022 -0400

    Add payloads to ExtIDs

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1691/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1691/console

It looks like the changes fail because CI doesn't know to use the model updates in D8759. Is there something I can do to fix it or do I just wait to run CI after D8759 is landed?

I just realized this doesn't match the actual schema. Change it to: [...]

Should that be separate commit or should I just sneak it into the payload changes?

Is there no way to use the same index for both?

Starting in Postgres 15, you can mark a unique index as NULLS NOT DISTINCT. Before that, Postgres always lets you have multiple NULL values in a column included in a unique index. I want to ensure that you can't have duplicate ExtIDs with NULL payloads (preserving the old constraint). I also want to make sure that the payloads are unique. Using two partial indexes communicates this to Postgres. I agree it's a little cumbersome, but it should be harmless. The only other idea I came across is to coalesce the potentially NULL columns in the index, but you would have to decide on a sentinel value to replace NULL, which seems worse to me.

It looks like the changes fail because CI doesn't know to use the model updates in D8759.

yes, the ci build uses pulled dependencies out of the requirements* files (pulled from pypi.org).

Is there something I can do to fix it

Well, you could tweak the requirements-swh.txt to target a specific swh.model commit (and then update yet again that diff).
as in [1] (i'm targetting the line inside the requirements-swh.txt, prior to my old comment there).
But that feels like a lot of fumbling around (which you'll have to revert prior to landing this).

or...

[1] https://forge.softwareheritage.org/D8539#inline-60772

or do I just wait to run CI after D8759 is landed?

... you can just wait for the other diff to land and a new swh.model is released (which
will do as soon as this gets landed to unstuck this).

Cheers,

ardumont added inline comments.
swh/storage/cassandra/schema.py
282

@vsellier heads up ^

olasd added inline comments.
swh/storage/sql/60-indexes.sql
328–329

Is there any reason to distinguish these two indexes? PostgreSQL can index null columns just fine.

swh/storage/sql/60-indexes.sql
328–329

With one index, Postgres will allow duplicate rows (provided the payload is null). This is because it does not treat the null payloads as equal for the purposes of the unique constraint. I mentioned this in my comment here: D8760#228500. To be more concrete, here's an example:

create table test (a integer not null, b integer);
create unique index on test (a, b);
insert into test (a) values (1), (1);
select * from test;

This will result in:

 a | b
---+---
 1 |
 1 |
(2 rows)

Intuitively, you would think it would violate the unique constraint, but it doesn't. Using one index and allowing null payloads amounts to saying "everything with a payload is unique." Using two indexes is like saying "(1) everything without a payload is unique, and (2) everything with a payload is unique."

swh/storage/sql/60-indexes.sql
328–329

Ah, thanks, I always forget this quirk of postgresql unique indexes.

vlorentz added inline comments.
swh/storage/cassandra/model.py
320–329

you forgot this

swh/storage/sql/30-schema.sql
520

and this

swh/storage/sql/upgrades/187.sql
17

and this

This revision now requires changes to proceed.Nov 10 2022, 1:05 PM