Page MenuHomeSoftware Heritage

Store the value of token(partition_key) in content_by_* table, instead of three hashes.

Authored by vlorentz on Mar 10 2020, 1:51 PM.



That's a big win in terms of disk space, and shouldn't affect performance
negatively (there's only an extra query in content_add on sha1/sha1_git collisions)

Resolves T2304.

When this diff is accepted, I'll do a similar one for skipped_content.

Diff Detail

rDSTO Storage manager
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz edited the summary of this revision. (Show Details)

Sounds fine.

Remains to look up the tests.

Don't forget to drop the dead (and short-lived) code mentioned ;)


Drop row_to_content_hashes if it's no longer used ;)

This revision is now accepted and ready to land.Mar 10 2020, 2:27 PM

How do you migrate the tables now?

I'm waiting for the current replay to finish to get the sha1 collision stats. Then I'll drop the tables, recreate them, reset offsets on swh.journal.objects.content and restart the replay

As the cassandra cluster is paused indefinitely, I'm landing this diff now