Page MenuHomeSoftware Heritage

Deal with new checksum blake2s256 in storage
ClosedPublic

Authored by ardumont on Mar 25 2017, 12:54 AM.

Details

Reviewers
olasd
Group Reviewers
Reviewers
Commits
rDSTOC8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
rDSTOCabeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
rDSTOC76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
rDSTOCada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTOC0b5e9382abdf: sql/upgrades: add 103 -> 104
rDSTOC4ff43076886c: swh.storage: Add checksum blake2s256
rDSTOCc94ba898a288: d/control: Add python3-swh.journal dependency with version
R65:76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
R65:c94ba898a288: d/control: Add python3-swh.journal dependency with version
R65:8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
R65:0b5e9382abdf: sql/upgrades: add 103 -> 104
R65:abeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
R65:4ff43076886c: swh.storage: Add checksum blake2s256
R65:ada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTO76ed4584948d: swh.storage: Use aggregate key to filter on missing skipped contents
rDSTOabeb873111c4: Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
rDSTO0b5e9382abdf: sql/upgrades: add 103 -> 104
rDSTO8dddc91a7860: swh.storage: Use upsert scheme on (skipped_)content_add function
rDSTOada557cc8ecf: swh.storage: Extract key variable for insertion
rDSTO4ff43076886c: swh.storage: Add checksum blake2s256
rDSTOc94ba898a288: d/control: Add python3-swh.journal dependency with version
Summary
  • Tests are ok.
  • Run tested with loader-tar and ok.

What i tried and is ok:

  1. Db installed from scratch and load some tarballs

-> All checksums are computed accordingly.

  1. DB installed from prior version and migration (as in production):
    • load tarballs (old-gnu)
    • Migrate to 104 (current version to be reviewed)
    • Recompute existing new checksums blake2s256.
    • Import some more new contents

-> All checksums are computed accordingly.

Related T703

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 835
Build 1119: Software Heritage Python tests
Build 1118: arc lint + arc unit

Event Timeline

ardumont added inline comments.
debian/control
34

out of scope, i should have pushed this before creating the diff.

sql/swh-indexes.sql
11

This is the constraint which permits to use the upsert scheme in the swh_content_add.

78

Same for upsert scheme used in swh_skipped_content_add.

sql/swh-schema.sql
45

It's not null here for now since we need to run the computations on that column for the existing hash.

174

same here

ardumont retitled this revision from Deal with new checksum blake2s256 in schema to Deal with new checksum blake2s256 in storage.
ardumont added a project: Storage manager.

Fix sql formatting which was off

There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.

  • We currently use sha1 as a primary key on our object storage, so this needs to stay unique for now, at least on the content table.
  • We currently use sha1_git as a foreign key in directory_entry_file; this needs to stay unique as well, on both content and skipped_content.

Both of those concerns should be alleviated once we make our foreign keys use constructed identifiers (T698).

There's no point in keeping sha256 unique, and there's no point in making blake2s256 unique either, so I think we can remove those constraints now.

sql/swh-func.sql
157

This is a "primary key" check. It must not change until all the columns are not null, else the equality will fail and we'll get duplicate entries: the old entries with only the old checksums will not match newer entries with the new checksum.

269

Unfortunately, for now, this check is not equivalent to the previous one.

287

And neither is this one.

sql/swh-indexes.sql
11

No it's not. on conflict (foo, bar, baz, quux) will do something if any constraint on columns foo, bar, baz or quux fails.

As each checksum column has a unique index, those will always hit before the complex constraint.

78

The same remark applies to this constraint as well.

swh/storage/storage.py
104

This whole logic seems buggy. Why would sha1_git be "more important" than other checksums here?

121–122

This should get factored out in db. (db.Db.content_keys?)

134–135

Same here (db.Db.skipped_content_keys?)

ardumont marked 2 inline comments as done.EditedMar 27 2017, 3:04 PM
In D200#4134, @olasd wrote:

There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.

sure

  • We currently use sha1 as a primary key on our object storage, so this needs to stay unique for now, at least on the content table.
  • We currently use sha1_git as a foreign key in directory_entry_file; this needs to stay unique as well, on both content and skipped_content.

Both of those concerns should be alleviated once we make our foreign keys use constructed identifiers (T698).

ok

There's no point in keeping sha256 unique, and there's no point in making blake2s256 unique either, so I think we can remove those constraints now.

Ok, dropping the unique indexes and creating simple index then.

sql/swh-func.sql
157

Yes indeed!
This can be installed once the migration is complete (through the rehash routine is deployed and ran).
That is once that the blake2s256 (or blake2b512) is constrained to not be null.

(In a future 105 version for example).

269

Yes, this goes in 105 version as well.

287

Yes, this goes in 105 version as well.

sql/swh-indexes.sql
11

No it's not. on conflict (foo, bar, baz, quux) will do something if any constraint on columns foo, bar, baz or quux fails.

So, if i understood you right (and the doc), i shall update the swh_content_add and swh_skipped_content_add with the 'on conflict do update' (in the 105.sql script).

As each checksum column has a unique index,

well no longer all columns since we will drop unicity constraint on sha256 and blake2s256 (as per mentioned above).

As each checksum column has a unique index, those will always hit before the complex constraint.

Thus, again, rewriting the queries to add to on conflict do nothing seems appropriate.

swh/storage/storage.py
121–122

db.Db.content_get_metadata_keys already holds the same list.

134–135

right.

  • swh.storage: Extract key variable for insertion
  • swh.storage: Use upsert scheme on (skipped_)content_add function
  • Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
  • db version 104: Update schema properly
  • swh.storage: Use agreggate key to filter on missing skipped contents
ardumont added inline comments.
sql/swh-indexes.sql
11

Thus, again, rewriting the queries to add to on conflict do nothing seems appropriate.

If this was possible though, which i doubt...

Reading again the documentation...
What I understand is that we can only put one 'on conflict' clause either on one constraint, or index, ...

Also if i remember right, having directly one 'on conflict do nothing' (without specifying anything), does not work.
I remember posgresql asking for a specific constraint/index/etc... to act upon.

Anyway, moving on, that point is, for now, out of scope.

swh/storage/storage.py
104

I don't know the initial reason. I assumed it was to be able to use sets.

I kept that set logic but use an aggregate key of all hashes (in a specific order).
This is then used later (on line 144) to check the existence of such skipped content.

ardumont marked an inline comment as done.

Improve test on skipped_content_add

Ok, dropping the unique indexes and creating simple index then.

Done. That is, I:

  • dropped the unique index on sha256 and on blake2s256
  • creating simple index on sha256 and blake2s256

Use sql/bin/db-upgrade to generate the 103-104 sql migration script

  • sql/upgrades: add 103 -> 104
  • swh.storage: Use aggregate key to filter on missing skipped contents

Looks good with one easily solved caveat inline.

The 104.sql upgrade script will have to be run by hand, and probably not in that order, but it looks reasonable nonetheless.

swh/storage/storage.py
91–102

This won't work if one of the values is None, which might happen (the git loader currently computes all checksums for contents that are too big, but nothing imposes it). That constraint should also be reflected in tests.

It should work if you make the "unique key" a tuple of values (return tuple(values) instead of b''.join(values)).

Looks good with one easily solved caveat inline.

Yes, thanks, i will adapt.

The 104.sql upgrade script will have to be run by hand, and probably not in that order, but it looks reasonable nonetheless.

Ok.

sql/swh-func.sql
286–287

bug coalesce(blake2s256, '')

sql/upgrades/104.sql
133

bug coalesce(blake2s256, '')

swh/storage/storage.py
91–102

This won't work if one of the values is None, which might happen (the git loader currently computes all checksums for contents that are too big, but nothing imposes it).

Oh, right!

That constraint should also be reflected in tests.

Yes, indeed, adding the case in test make it break!
Nice catch.

It should work if you make the "unique key" a tuple of values (return tuple(values) instead of b''.join(values)).

Right. I did not know we could use tuple in set, awesome thanks. That's way clearer.

And \o/

This also had the fortunate effect to show a reverted bug fix in my sql.
I already fixed this in the previous manually crafted 104.sql.
But then, it was missing in swh-func.sql...

(cf. swh_skipped_content_add function...)

Fixes according to latest review:

  • fix unique key computation (using tuple)
  • fix sql about missing default value (on new column)

Squashed and rebased commits:

  • sql/upgrades: add 103 -> 104
  • swh.storage: Use aggregate key to filter on missing skipped contents

Do not use yet the new column blake2s256 for filtering (content_add, skipped_content_add)

This makes the db upgrade work with older version of python3-swh.storage (82 which is the latest on pergamon, tested and ok)

  • sql/upgrades: add 103 -> 104
  • swh.storage: Use aggregate key to filter on missing skipped contents