regularly scrub all the data stores of swh
Closed, MigratedEdits Locked
Actions

Description

Make sure we have background jobs that regularly/constantly check data integrity in all the SWH data sources:

check hashes stored in the main postgresql storage (and replicas?)
check objects stored in kafka
check blob hashes for objects stored in all the objstorages (saam, azure, s3)

For example, doing mirroring tests, I found several blob objects in S3 that look to be corrupted (but original copies in the main objstorage are fine).

Revisions and Commits

rDENV Development environment
	D7347	rDENVf8d6c825dc64 Add swh-scrubber to .mrconfig
rCJSWH Jenkins jobs
	D7346	rCJSWH40111ac0fb20 Add swh-scrubber package to the CI
rDSCRUB Datastore Scrubber
	D7360	rDSCRUBbe9a35c0c397 Initialize DB schema and postgresql storage checker

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3135 Improve integrity of ingested content
Migrated	gitlab-migration	T3841 regularly scrub all the data stores of swh
Migrated	gitlab-migration	T4136 Add an "history completeness check"
		Restricted Maniphest Task
		Restricted Maniphest Task
Migrated	gitlab-migration	T4371 Deploy swh-scrubber on all storage instances

Event Timeline

douardda renamed this task from regularly scrub all the data sources of swh to regularly scrub all the data stores of swh.Jan 11 2022, 12:31 PM

douardda triaged this task as Normal priority.

douardda created this task.

douardda removed a project: Roadmap 2021.

I wrote a script to scrub postgres and kafka: https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/recheck_consistency.py

For all Kafka objects, it takes a couple of months. For postgresql, it takes ~15 day-threads for revisions and ~1350 day-threads for directories (releases/snapshots are negligible), and is highly parallelizeable (16 threads should not be an issue; so about 100 days in total).

Remaining questions below.

First, where to store the results? I am thinking of a simple postgresql table with these columns:

SWHID
first date it was found corrupted
last date it was (or delete as soon as it isn't?)
if recoverable, msgpacked model object (including raw_manifest)

(Rationale for postgresql: available concurrently and via the network; and scale shouldn't be an issue as there shouldn't be more than a million entries ever)

Next, currently, the script tries to find an origin and recover the object from there. Should it be split to a different script?

I don't see much reason either way

Finally, how to structure it? Currently it's a snippet, but I think I will move it to its own repository. We could also put a script to scrub objstorages there.

You'll need a column for which datastore has the corrupted object.

I think it's fine to remove the entries when we don't need them anymore (i.e. the object has been restored). Worst case, it'll be re-added at the next iteration of the script :-)

When you'll move the snippet to its own repository, I guess we'll want to structure the "checking" and the "recovery" algorithms in different submodules. You'll have different checks for kafka, postgres and cassandra, and they'll all want to use the same sort of logic for the recovery process.

In T3841#80779, @olasd wrote:

You'll need a column for which datastore has the corrupted object.

Actually, two: one for the datastore type (objstorage vs storage), one for the instance (URL/DSN)

In T3841#80779, @olasd wrote:

I think it's fine to remove the entries when we don't need them anymore (i.e. the object has been restored). Worst case, it'll be re-added at the next iteration of the script :-)

Actually it would make sense to have a separate "recovery" table with the swhid, the recovered_manifest, and lightly structured information on how the manifest was recovered (when it was recovered, whether pulled from another datastore, or pulled from an upstream origin, and if so which one).

vlorentz claimed this task.Mar 15 2022, 11:25 AM

vlorentz removed a project: meta-task.

vlorentz mentioned this in D7346: Add swh-scrubber package to the CI.Mar 15 2022, 11:40 AM

vlorentz added a revision: D7346: Add swh-scrubber package to the CI.Mar 15 2022, 11:40 AM

vlorentz added a revision: D7347: Add swh-scrubber to .mrconfig.

vlorentz added a commit: rDENVf8d6c825dc64: Add swh-scrubber to .mrconfig.Mar 15 2022, 11:54 AM

vlorentz added a commit: rCJSWH40111ac0fb20: Add swh-scrubber package to the CI.

vlorentz mentioned this in T3878: Fix existing corrupt objects.Mar 15 2022, 1:46 PM

vlorentz added a revision: D7360: Initialize DB schema and postgresql storage checker.Mar 16 2022, 4:08 PM

vlorentz added a commit: rDSCRUBbe9a35c0c397: Initialize DB schema and postgresql storage checker.Mar 22 2022, 11:20 AM

bchauvet added projects: Roadmap 2022, meta-task.Mar 23 2022, 4:38 PM

vlorentz moved this task from Backlog to Work in progress on the Roadmap 2022 board.Jun 2 2022, 9:57 AM