Page MenuHomeSoftware Heritage

content integrity checker
Closed, MigratedEdits Locked

Description

No matter how many backup copies we have (see T239), each object contained in each backup copy should be periodically checked for integrity, for protection against bit flips and other sources of corruption.

While bit flip protection can be offered by low-level mechanisms at the disk and/or file system level, we might also still want to periodically check all objects (e.g., with swh.storage.ObjStorage.check) for protection against human- or application-level errors.

Event Timeline

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM

I see two options for the checker : it can run on each backup server, or only in the master storage which will manage the check scheduling and order.

  • The first option means whe won't increase the master's load, but require the slaves to have some softs (celery, cron ?)
  • The second have a bigger impact on master's load but would allow the slaves storages to remain storage-oriented with only the objstorage api.server module running. This also allow the master storage to get the result and eventually reschedule the archival of a corrupted content. However, that would also means the check request will be sent over http protocol (but the check itself remains fully in the backup).

Check order

In order to be sure to check all the content, an idea is to check the content with the oldest last check time. Each X <amount of time>, the checker will check the Y first contents sorted by oldest last check time.

A content is checked when archived, so there would be no need to catch-up the checks after the completion of the archival.

In T304#6389, @qcampos wrote:

I see two options for the checker : it can run on each backup server, or only in the master storage which will manage the check scheduling and order.

  • The first option means whe won't increase the master's load, but require the slaves to have some softs (celery, cron ?)

I think that in the beginning this option is preferrable, because it makes the archive copies more independent and more resilient (e.g., a single failure in the checking routine will not necessarily impact other copies).
I also like the fact that, upon corruption detection, individual copies can take care of restoring from other copies without central coordination.

In terms of deployment, we should aim at having an independent checker daemon that do not need external scheduling logic (a-la celery). It can either be something that runs in the background forever and does checking at its own schedule. Or something that is externally run by cron. Given the amount of data I suspect it will just be permanently busy, so running independently not even needing cron is probably better and easier.

Check order

In order to be sure to check all the content, an idea is to check the content with the oldest last check time. Each X <amount of time>, the checker will check the Y first contents sorted by oldest last check time.

I wasn't thinking of having stateful checks that keep track of when objects have been checked (e.g., in a DB table), but rather a stateless probabilistic approach that only guarantees that in the long run all objects get a chance of being checked. My rationale for that is that anyhow, a bit flip might occur the moment after you've checked an object, so no matter what what we're doing here is only giving guarantees up to a given failure probability. Hence the simplest approach that could possibly work is probably also the best one.

A content is checked when archived, so there would be no need to catch-up the checks after the completion of the archival.

Unfortunately no, due to the reason above (bit flip random appearance).

An update on this, there are 3 checker implementations (all of them are puppetized independently):

  • LogChecker which simply log corrupted contents
  • RepairChecker (inherits LogChecker) - It fixes any corrupted content encountered by asking nicely to the backup storages for their copies. It's an iteration over the backups until one sends the copy. If nothing is found, it logs it.
  • ArchiveNotifierChecker (inherits LogChecker) - It updates the archiver's db regarding that content with its new 'corrupted' or 'missing' status. It's up to the archiver then to do its job.

I think this can be closed.

@qcampos Am i missing something?

That's accurate, @ardumont. Nothing to add. Thanks for the puppet stuff!

This Closes T304 ; we just need to deploy the checker and it should be running.