Page MenuHomeSoftware Heritage

worker to efficiently (re)compute content blob checksums
Closed, MigratedEdits Locked

Description

To maintain our list of content blob hashes current on the long run, we need a highly parallel way to (re)compute a given set of checksums on contents available in our object storage, and update the content table accordingly.
Here is a rough specification of what this might look like:

  • should be a worker that we can spawn with as input a given set of content IDs, specified using the current primary key of the content table (note: in the future the key might end up being a tuple of column values, rather than a single value)
  • the worker will be parametric at least in the following extra config values:
    • the set of checksums that should be computed. For variable-length checksums a desired checksum length should also be provided. An example of this config parameter could then be: sha1, sha256, sha3:224, blake2:512 (syntax is totally hypothetical at this point)
    • whether checksums that already exist in the DB should be recomputed/updated or left untouched
  • for each content ID given as input, the worker will:
    • retrieve the set of checksums currently available from the content table, and determine which ones need to be (re)computed according to the given configuration
    • retrieve the content from the object storage (only once), and compute all checksums that need (re)computing
    • update the content table entry with the new and/or recomputed checksums

Note: the above assumes that the content table has already been modified to have one column for each checksum we might want to compute, and that's fine because we won't change that very often. We only need to pay attention to the fact that there is a bijective mapping between checksum names in the worker configuration and columns of the content table.

Related Objects

Event Timeline