Page MenuHomeSoftware Heritage

regularly scrub all the data stores of swh
Closed, MigratedEdits Locked

Description

Make sure we have background jobs that regularly/constantly check data integrity in all the SWH data sources:

  • check hashes stored in the main postgresql storage (and replicas?)
  • check objects stored in kafka
  • check blob hashes for objects stored in all the objstorages (saam, azure, s3)

For example, doing mirroring tests, I found several blob objects in S3 that look to be corrupted (but original copies in the main objstorage are fine).

Event Timeline

douardda renamed this task from regularly scrub all the data sources of swh to regularly scrub all the data stores of swh.Jan 11 2022, 12:31 PM
douardda triaged this task as Normal priority.
douardda created this task.
douardda removed a project: Roadmap 2021.

I wrote a script to scrub postgres and kafka: https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/recheck_consistency.py

For all Kafka objects, it takes a couple of months. For postgresql, it takes ~15 day-threads for revisions and ~1350 day-threads for directories (releases/snapshots are negligible), and is highly parallelizeable (16 threads should not be an issue; so about 100 days in total).

Remaining questions below.


First, where to store the results? I am thinking of a simple postgresql table with these columns:

  1. SWHID
  2. first date it was found corrupted
  3. last date it was (or delete as soon as it isn't?)
  4. if recoverable, msgpacked model object (including raw_manifest)

(Rationale for postgresql: available concurrently and via the network; and scale shouldn't be an issue as there shouldn't be more than a million entries ever)


Next, currently, the script tries to find an origin and recover the object from there. Should it be split to a different script?

I don't see much reason either way


Finally, how to structure it? Currently it's a snippet, but I think I will move it to its own repository. We could also put a script to scrub objstorages there.

You'll need a column for which datastore has the corrupted object.

I think it's fine to remove the entries when we don't need them anymore (i.e. the object has been restored). Worst case, it'll be re-added at the next iteration of the script :-)

When you'll move the snippet to its own repository, I guess we'll want to structure the "checking" and the "recovery" algorithms in different submodules. You'll have different checks for kafka, postgres and cassandra, and they'll all want to use the same sort of logic for the recovery process.

In T3841#80779, @olasd wrote:

You'll need a column for which datastore has the corrupted object.

Actually, two: one for the datastore type (objstorage vs storage), one for the instance (URL/DSN)

In T3841#80779, @olasd wrote:

I think it's fine to remove the entries when we don't need them anymore (i.e. the object has been restored). Worst case, it'll be re-added at the next iteration of the script :-)

Actually it would make sense to have a separate "recovery" table with the swhid, the recovered_manifest, and lightly structured information on how the manifest was recovered (when it was recovered, whether pulled from another datastore, or pulled from an upstream origin, and if so which one).

vlorentz added a subtask: Restricted Maniphest Task.Jul 13 2022, 9:40 AM
vlorentz closed subtask Restricted Maniphest Task as Resolved.Aug 5 2022, 3:07 PM
vlorentz reopened subtask Restricted Maniphest Task as Open.Aug 8 2022, 2:23 PM
vlorentz closed subtask Restricted Maniphest Task as Resolved.Aug 9 2022, 10:16 AM
gitlab-migration changed the status of subtask Restricted Maniphest Task from Resolved to Migrated.Jan 8 2023, 10:04 PM