Page MenuHomeSoftware Heritage

Provide a mecanism to report (with persistence) objects that fails to get replayed (mirror)
Closed, MigratedEdits Locked

Description

When mirroring the archive using the storage replayer tooling, we want both to go as fast as possible and not lose any object, or, if a problem occurs that prevent an object from being inserted in the destination storage, we want to be aware of it with a better communication channel than the logs of the process.

In order to make sure we do not insert invalid/corrupted objects, a mirroring session will have to use a ValidationProxyStorage step in the destination storage config.

Also, having a TenacityProxyStorage in the destination storage config pipeline makes sense; one does not want to not insert a batch of object when only one of them is invalid or fails to be inserted, and add a bit of resiliency in case of transient failures.

Currently, these 2 proxy storages do log insertion errors, but there is not mechanism to report in a consistent and persistent database the list of objects that failed to be inserted for some reason.

Ideally, the reported objects should be stored in a k/v like database, using a unique key as identifier, typically the swhid, the hash or something forged from the BaseModel.unique_key() (a bit like what is done in the kafka writer, but this later uses msgpack encoded keys, which makes them not very practical for a k/v store like redis). The value should be the kafka message so one can introspect the problem and possibly replay the insertion for these objects.

This task is about adding such a reporting mechanism.

Redis is probably good candidate to use as database backend for this reporting tool.

Event Timeline

douardda created this task.
douardda claimed this task.