Details

Reviewers

Group Reviewers

Commits

rDSTOC8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
rDSTOCabeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
rDSTOC76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
rDSTOCada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTOC0b5e9382abdf: sql/upgrades: add 103 -> 104
rDSTOC4ff43076886c: swh.storage: Add checksum blake2s256
rDSTOCc94ba898a288: d/control: Add python3-swh.journal dependency with version
R65:76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
R65:c94ba898a288: d/control: Add python3-swh.journal dependency with version
R65:8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
R65:0b5e9382abdf: sql/upgrades: add 103 -> 104
R65:abeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
R65:4ff43076886c: swh.storage: Add checksum blake2s256
R65:ada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTO76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
rDSTOabeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
rDSTO0b5e9382abdf: sql/upgrades: add 103 -> 104
rDSTO8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
rDSTOada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTO4ff43076886c: swh.storage: Add checksum blake2s256
rDSTOc94ba898a288: d/control: Add python3-swh.journal dependency with version

Summary

Tests are ok.
Run tested with loader-tar and ok.

What i tried and is ok:

Db installed from scratch and load some tarballs

-> All checksums are computed accordingly.

DB installed from prior version and migration (as in production):
- load tarballs (old-gnu)
- Migrate to 104 (current version to be reviewed)
- Recompute existing new checksums blake2s256.
- Import some more new contents

-> All checksums are computed accordingly.

Related T703

Diff Detail

Repository

rDSTO Storage manager

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Mar 25 2017, 12:54 AM

Harbormaster completed remote builds in B833: Diff 657.Mar 25 2017, 12:54 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 25 2017, 12:54 AM

Harbormaster failed remote builds in B833: Diff 657!Mar 25 2017, 12:55 AM

ardumont edited the summary of this revision. (Show Details)Mar 25 2017, 1:04 AM

ardumont added inline comments.

debian/control
34	out of scope, i should have pushed this before creating the diff.
sql/swh-indexes.sql
12 ↗	(On Diff #657)	This is the constraint which permits to use the upsert scheme in the swh_content_add.
78 ↗	(On Diff #657)	Same for upsert scheme used in swh_skipped_content_add.
sql/swh-schema.sql
46 ↗	(On Diff #657)	It's not null here for now since we need to run the computations on that column for the existing hash.
175 ↗	(On Diff #657)	same here

ardumont edited the summary of this revision. (Show Details)Mar 25 2017, 1:06 AM

ardumont retitled this revision from Deal with new checksum blake2s256 in schema to Deal with new checksum blake2s256 in storage.

ardumont added a project: Storage manager.

Fix sql formatting which was off

Harbormaster failed remote builds in B835: Diff 659!Mar 27 2017, 11:51 AM

ardumont edited the summary of this revision. (Show Details)Mar 27 2017, 12:13 PM

ardumont edited the summary of this revision. (Show Details)Mar 27 2017, 12:28 PM

There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.

We currently use sha1 as a primary key on our object storage, so this needs to stay unique for now, at least on the content table.
We currently use sha1_git as a foreign key in directory_entry_file; this needs to stay unique as well, on both content and skipped_content.

Both of those concerns should be alleviated once we make our foreign keys use constructed identifiers (T698).

There's no point in keeping sha256 unique, and there's no point in making blake2s256 unique either, so I think we can remove those constraints now.

sql/swh-func.sql
157 ↗	(On Diff #659)	This is a "primary key" check. It must not change until all the columns are `not null`, else the equality will fail and we'll get duplicate entries: the old entries with only the old checksums will not match newer entries with the new checksum.
269 ↗	(On Diff #659)	Unfortunately, for now, this check is not equivalent to the previous one.
287 ↗	(On Diff #659)	And neither is this one.
sql/swh-indexes.sql
12 ↗	(On Diff #657)	No it's not. `on conflict (foo, bar, baz, quux)` will do something if any constraint on columns foo, bar, baz or quux fails. As each checksum column has a unique index, those will always hit before the complex constraint.
78 ↗	(On Diff #657)	The same remark applies to this constraint as well.
swh/storage/storage.py
104 ↗	(On Diff #659)	This whole logic seems buggy. Why would sha1_git be "more important" than other checksums here?
121–122 ↗	(On Diff #659)	This should get factored out in db. (`db.Db.content_keys`?)
134–135 ↗	(On Diff #659)	Same here (`db.Db.skipped_content_keys`?)

In D200#4134, @olasd wrote:

There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.

sure

We currently use sha1 as a primary key on our object storage, so this needs to stay unique for now, at least on the content table.

We currently use sha1_git as a foreign key in directory_entry_file; this needs to stay unique as well, on both content and skipped_content.

Both of those concerns should be alleviated once we make our foreign keys use constructed identifiers (T698).

There's no point in keeping sha256 unique, and there's no point in making blake2s256 unique either, so I think we can remove those constraints now.

Ok, dropping the unique indexes and creating simple index then.

sql/swh-func.sql
157 ↗	(On Diff #659)	Yes indeed! This can be installed once the migration is complete (through the rehash routine is deployed and ran). That is once that the blake2s256 (or blake2b512) is constrained to not be null. (In a future 105 version for example).
269 ↗	(On Diff #659)	Yes, this goes in 105 version as well.
287 ↗	(On Diff #659)	Yes, this goes in 105 version as well.
sql/swh-indexes.sql
12 ↗	(On Diff #657)	No it's not. on conflict (foo, bar, baz, quux) will do something if any constraint on columns foo, bar, baz or quux fails. So, if i understood you right (and the doc), i shall update the `swh_content_add` and `swh_skipped_content_add` with the 'on conflict do update' (in the 105.sql script). As each checksum column has a unique index, well no longer all columns since we will drop unicity constraint on sha256 and blake2s256 (as per mentioned above). As each checksum column has a unique index, those will always hit before the complex constraint. Thus, again, rewriting the queries to add to `on conflict do nothing` seems appropriate.
swh/storage/storage.py
121–122 ↗	(On Diff #659)	db.Db.content_get_metadata_keys already holds the same list.
134–135 ↗	(On Diff #659)	right.

ardumont marked an inline comment as done.Mar 27 2017, 3:14 PM

swh.storage: Extract key variable for insertion
swh.storage: Use upsert scheme on (skipped_)content_add function
Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
db version 104: Update schema properly

Harbormaster failed remote builds in B837: Diff 661!Mar 27 2017, 3:15 PM

Harbormaster failed remote builds in B835: Diff 659!Mar 27 2017, 3:59 PM

Harbormaster completed remote builds in B835: Diff 659.Mar 27 2017, 4:02 PM

Harbormaster completed remote builds in B837: Diff 661.

swh.storage: Use agreggate key to filter on missing skipped contents

Harbormaster completed remote builds in B838: Diff 663.Mar 27 2017, 4:09 PM

ardumont marked an inline comment as done.Mar 27 2017, 4:31 PM

ardumont added inline comments.

sql/swh-indexes.sql
12 ↗	(On Diff #657)	Thus, again, rewriting the queries to add to on conflict do nothing seems appropriate. If this was possible though, which i doubt... Reading again the documentation... What I understand is that we can only put one 'on conflict' clause either on one constraint, or index, ... Also if i remember right, having directly one 'on conflict do nothing' (without specifying anything), does not work. I remember posgresql asking for a specific constraint/index/etc... to act upon. Anyway, moving on, that point is, for now, out of scope.
swh/storage/storage.py
104 ↗	(On Diff #659)	I don't know the initial reason. I assumed it was to be able to use sets. I kept that set logic but use an aggregate key of all hashes (in a specific order). This is then used later (on line 144) to check the existence of such skipped content.

Improve test on skipped_content_add

Harbormaster completed remote builds in B839: Diff 664.Mar 28 2017, 9:55 AM

Ok, dropping the unique indexes and creating simple index then.

Done. That is, I:

dropped the unique index on sha256 and on blake2s256
creating simple index on sha256 and blake2s256

Use sql/bin/db-upgrade to generate the 103-104 sql migration script

sql/upgrades: add 103 -> 104
swh.storage: Use aggregate key to filter on missing skipped contents

Harbormaster completed remote builds in B840: Diff 666.Mar 29 2017, 12:09 PM

Looks good with one easily solved caveat inline.

The 104.sql upgrade script will have to be run by hand, and probably not in that order, but it looks reasonable nonetheless.

swh/storage/storage.py
95–106 ↗	(On Diff #666)	This won't work if one of the values is `None`, which might happen (the git loader currently computes all checksums for contents that are too big, but nothing imposes it). That constraint should also be reflected in tests. It should work if you make the "unique key" a tuple of values (return `tuple(values)` instead of `b''.join(values)`).

Looks good with one easily solved caveat inline.

Yes, thanks, i will adapt.

The 104.sql upgrade script will have to be run by hand, and probably not in that order, but it looks reasonable nonetheless.

Ok.

sql/swh-func.sql
288 ↗	(On Diff #666)	bug `coalesce(blake2s256, '')`
sql/upgrades/104.sql
133 ↗	(On Diff #666)	bug `coalesce(blake2s256, '')`
swh/storage/storage.py
95–106 ↗	(On Diff #666)	This won't work if one of the values is None, which might happen (the git loader currently computes all checksums for contents that are too big, but nothing imposes it). Oh, right! That constraint should also be reflected in tests. Yes, indeed, adding the case in test make it break! Nice catch. It should work if you make the "unique key" a tuple of values (return tuple(values) instead of b''.join(values)). Right. I did not know we could use tuple in set, awesome thanks. That's way clearer. And \o/ This also had the fortunate effect to show a reverted bug fix in my sql. I already fixed this in the previous manually crafted 104.sql. But then, it was missing in swh-func.sql... (cf. `swh_skipped_content_add` function...)