Page MenuHomeSoftware Heritage

Store the value of token(partition_key) in content_by_* table, instead of three hashes.
ClosedPublic

Authored by vlorentz on Tue, Mar 10, 1:51 PM.

Details

Summary

That's a big win in terms of disk space, and shouldn't affect performance
negatively (there's only an extra query in content_add on sha1/sha1_git collisions)

Resolves T2304.

When this diff is accepted, I'll do a similar one for skipped_content.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Tue, Mar 10, 1:51 PM
vlorentz edited the summary of this revision. (Show Details)Tue, Mar 10, 1:55 PM
vlorentz edited the summary of this revision. (Show Details)

Sounds fine.

Remains to look up the tests.

Don't forget to drop the dead (and short-lived) code mentioned ;)

swh/storage/cassandra/storage.py
98

Drop row_to_content_hashes if it's no longer used ;)

ardumont accepted this revision.Tue, Mar 10, 2:27 PM
This revision is now accepted and ready to land.Tue, Mar 10, 2:27 PM

How do you migrate the tables now?

I'm waiting for the current replay to finish to get the sha1 collision stats. Then I'll drop the tables, recreate them, reset offsets on swh.journal.objects.content and restart the replay

As the cassandra cluster is paused indefinitely, I'm landing this diff now

This revision was landed with ongoing or failed builds.Mon, Mar 23, 3:21 PM
This revision was automatically updated to reflect the committed changes.