Page MenuHomeSoftware Heritage

Cassandra storage: Reduce the size of the "secondary lookup tables" for contents
Open, NormalPublic

Description

Currently, all the per-hash lookup tables for contents store the full quadruplet of hashes.

Lookups of contents by a given hash need to fetch one row in the lookup table, then fetch the matching row from the main content metadata table

Instead of storing all hashes in the "index" tables, we could store the token of the row in the main table only, then re-check that the hash is indeed matching when retreiving it.

This would save a substantial amount of storage space, and the performance should be fairly similar:

Before:

  • (client) Lookup hash quadruplet in lookup table
  • (client) Lookup full metadata by quadruplet
    • (cassandra driver) Compute partition token for returned hash quadruplet
    • (cassandra driver) query the server for the data
    • (cassandra driver) recheck that the quadruplet matches
    • (cassandra driver) return full metadata

After:

  • (client) Lookup partition token in lookup table
  • (client) Lookup rows by partition token
    • (cassandra driver) query server for the data
  • (client) Filter for matching quadruplets
  • (client) return metadata