Page MenuHomeSoftware Heritage

Content replayer may try to copy objects before they are available in an objstorage
Open, NormalPublic

Description

There is going to be an issue with the content replayer in steady state: content objects (without the data) are written to Kafka before the data is written to the objstorage.
So if content replayers are fast enough (and they probably will), they'll try to access the data in objstorage before it's there

(And if we wrote to Kafka after writing to the objstorage, then there would a risk of Kafka missing some content in case of failure, which is worse)

A possible solution is to have the content replayer retry reading objects, until they are available.

There is however the issue of missing objects (T1817), so it can't retry forever for all objects or it will get stuck. We see two possible solutions:

  • a retry timeout, but it means that some objects might be skipped when they shouldn't (eg. if the object takes a lot of time to be available in the objstorage)
  • "hardcoding" a list of missing objects in the configuration, but it could possibly grow large with time (hopefully it won't)

Event Timeline

vlorentz triaged this task as Normal priority.Sep 17 2019, 11:58 AM
vlorentz created this task.

My guts on this task tell me that what we need (what we really really need) is a 3rd solution:

Have a client-specific FIFO queue in which failed content ids are pushed. Then a garbage-collection process is in charge of pulling these content ids from the queue and attempt to copy them again a number of times (and give up after that, if any).

This would prevent the main 'steady-state' content replayers from being curbed by temp failures on some sort.
This needs to be marked out to ensure we do not overflow this queue (i.e. if a majority of object copies are in failure, we do not want to fille this queue as fast as kafka messages are pushed on the content topic.)

A simple solution would be to simply limit the capacity of this queue and make the content-replayer stop/crash if it needs to push an object id in this garbage-collector queue but cannot do so for some reason (like the queue is full).

This queue can be stored as a kafka topic, but this is an implementation detail, so it could be done by any FIFO capable provider. In fact, since it's purely a client-side, we do not want to provide any such 'service' as allowing a journal consumer (client) to create new topics in the main kafka.

olasd added a subscriber: olasd.Sep 17 2019, 1:33 PM

I agree with @douardda that the "failed content" queue + separate processor approach would be the most sensible.

I also agree that the depth of this queue must be limited, and that reaching the queue depth should prevent further reads from being committed to kafka.

We can't really know the size of a kafka topic, so I don't think this would be an appropriate solution. The queue needs to be resilient across restarts of the client infrastructure, so memcached is out; It also needs to be shared across workers (ideally), so we should probably look at something rabbitmq (ugh) or redis-based.

👍 to both your messages