worker to efficiently (re)compute content blob checksums
Closed, MigratedEdits Locked
Actions

Assigned To

gitlab-migration

Authored By

	zack
	Mar 1 2017, 9:57 AM

Tags

Subscribers

gitlab-migration

Description

To maintain our list of content blob hashes current on the long run, we need a highly parallel way to (re)compute a given set of checksums on contents available in our object storage, and update the content table accordingly.
Here is a rough specification of what this might look like:

should be a worker that we can spawn with as input a given set of content IDs, specified using the current primary key of the content table (note: in the future the key might end up being a tuple of column values, rather than a single value)
the worker will be parametric at least in the following extra config values:
- the set of checksums that should be computed. For variable-length checksums a desired checksum length should also be provided. An example of this config parameter could then be: sha1, sha256, sha3:224, blake2:512 (syntax is totally hypothetical at this point)
- whether checksums that already exist in the DB should be recomputed/updated or left untouched
for each content ID given as input, the worker will:
- retrieve the set of checksums currently available from the content table, and determine which ones need to be (re)computed according to the given configuration
- retrieve the content from the object storage (only once), and compute all checksums that need (re)computing
- update the content table entry with the new and/or recomputed checksums

Note: the above assumes that the content table has already been modified to have one column for each checksum we might want to compute, and that's fine because we won't change that very often. We only need to pay attention to the fact that there is a bijective mapping between checksum names in the worker configuration and columns of the content table.

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T692 worker to efficiently (re)compute content blob checksums
		Migrated	gitlab-migration	T712 Update existing contents with new hash blake2s256
		Migrated	gitlab-migration	T703 Make the loaders compute blake2s256 hash for new contents

Event Timeline

zack created this task.Mar 1 2017, 9:57 AM

ardumont claimed this task.Mar 1 2017, 5:27 PM

ardumont mentioned this in D185: storage: open content_update endpoint to permit update on content rows.Mar 2 2017, 10:59 AM

ardumont mentioned this in rDSTO96c0a217f1c7: storage: open content_update endpoint.Mar 3 2017, 10:26 AM

ardumont mentioned this in rDSNIP82f5b328ce3d: Recompute class to trigger an add/update hash checksums in storage.Mar 3 2017, 11:12 AM

ardumont mentioned this in rDSNIP2a47cb743b21: Revert "Recompute class to trigger an add/update hash checksums in storage".Mar 3 2017, 11:15 AM

zack edited projects, added Indexer; removed Object storage.Mar 3 2017, 11:26 AM

ardumont mentioned this in D186: Recompute class to trigger an add/update hash checksums in storage.Mar 3 2017, 11:48 AM

ardumont mentioned this in D192: Open variable length hash algorithm support in swh.model.hashutil._new_hash.Mar 14 2017, 4:05 PM

ardumont mentioned this in rDMOD9c25f8f587b6: swh.model.hashutil: Open variable length hash algorithm support.Mar 21 2017, 10:36 AM

ardumont mentioned this in rDMOD8776435d08e2: swh.model.hashutil: Simplify length hash algorithms instantiation.

ardumont mentioned this in rDMODf75be5a3386d: swh.model.hashutil: Make unknown variable length algo creation break.

ardumont mentioned this in rDMOD24f8dd4d1bb7: swh.model.hashutil: Adapt according to latest discussion.

ardumont mentioned this in rDMODa42c75e6fe07: swh.model.hashutil: Use pyblake2 dependency on python3 <= 3.4.

ardumont mentioned this in rDCIDX49b05e8688d9: swh.indexer.recompute: class to add/update hash checksums in storage.Mar 21 2017, 10:50 AM

ardumont mentioned this in rDCIDX51225e12cab2: swh.indexer.tasks: Open task to recompute checksums.

ardumont mentioned this in rDCIDX80eafec10b8f: swh.indexer.recompute: Add storage primary_key option for corruption check.

ardumont mentioned this in rDCIDXd3a9b21062a1: swh.indexer.rehash: Rename module to rehash + review adaptations.

ardumont mentioned this in rDCIDX4623c67ce5d6: swh.indexer.rehash: Respect semantics regarding hashes (re)computation.

ardumont mentioned this in rDCIDXdc451fcc5415: swh.indexer.tasks: Rename queue name to swh_indexer_rehash.

ardumont mentioned this in rDCIDX3b31e9cf99cc: swh.indexer.producer: Adapt producer to optionally send dict.

ardumont mentioned this in rDCIDX67be61085529: d/control: Bump dependency to latest swh-model.

ardumont mentioned this in rDCIDXb560a3192498: Adapt docstring example according to actual use cases.Mar 21 2017, 1:13 PM

ardumont mentioned this in T703: Make the loaders compute blake2s256 hash for new contents.Mar 24 2017, 2:12 PM

ardumont added subtasks: T712: Update existing contents with new hash blake2s256, T703: Make the loaders compute blake2s256 hash for new contents.Apr 26 2017, 11:43 AM

ardumont closed subtask T703: Make the loaders compute blake2s256 hash for new contents as Resolved.

ardumont changed the status of subtask T712: Update existing contents with new hash blake2s256 from Open to Work in Progress.May 5 2017, 2:37 PM

ardumont closed this task as Resolved.Sep 15 2017, 10:17 AM

olasd closed subtask T712: Update existing contents with new hash blake2s256 as Resolved.Sep 29 2017, 1:57 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T712: Update existing contents with new hash blake2s256 from Resolved to Migrated.Jan 8 2023, 4:20 PM

gitlab-migration changed the status of subtask T703: Make the loaders compute blake2s256 hash for new contents from Resolved to Migrated.Jan 8 2023, 9:57 PM