Page MenuHomeSoftware Heritage

Tweak content backfill order to help content replayer
Closed, MigratedEdits Locked

Description

The current implementation of the content backfiller processes content metadata in sequential sha1 order.

This works fine, but forces the content replayer to process most contents in sha1 order as well. This gives us somewhat of a good cache effect when sourcing contents from a pathslicing objstorage, but doesn't help parallelization when using striped storages with separate prefixes as the source.

Before running the backfiller again, we should try to shuffle the order of contents to have the data in a better shape.

Two approaches have been suggested:

  • process contents in order of sha1_git or another hash, which will shuffle the contents sha1s.
  • process contents in order of sha1, but mixing up the ranges to provide different prefixes (e.g. 000000 100000 200000 ... f00000 000001 200001 ...).

The first approach would be more elegant (and also work with a 4-character diff), but the second approach still gives a slight cache effect, at the expense of a slightly larger code change.

Event Timeline

olasd triaged this task as High priority.Jun 18 2019, 3:44 PM
olasd created this task.

I'm enclined to prefer option 2, since performance is an issue we cannot underestimate...

I've launched 16 content backfillers in parallel for each hex digit prefix which should help with this.