Page MenuHomeSoftware Heritage

Store the value of token(partition_key) in content_by_* table, instead of three hashes.
ClosedPublic

Authored by vlorentz on Mar 10 2020, 1:51 PM.

Details

Summary

That's a big win in terms of disk space, and shouldn't affect performance
negatively (there's only an extra query in content_add on sha1/sha1_git collisions)

Resolves T2304.

When this diff is accepted, I'll do a similar one for skipped_content.

Diff Detail

Repository
rDSTO Storage manager
Branch
content-murmur3
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11286
Build 17062: tox-on-jenkinsJenkins
Build 17061: arc lint + arc unit

Event Timeline

vlorentz edited the summary of this revision. (Show Details)

Sounds fine.

Remains to look up the tests.

Don't forget to drop the dead (and short-lived) code mentioned ;)

swh/storage/cassandra/storage.py
98

Drop row_to_content_hashes if it's no longer used ;)

This revision is now accepted and ready to land.Mar 10 2020, 2:27 PM

How do you migrate the tables now?

I'm waiting for the current replay to finish to get the sha1 collision stats. Then I'll drop the tables, recreate them, reset offsets on swh.journal.objects.content and restart the replay

As the cassandra cluster is paused indefinitely, I'm landing this diff now