Page MenuHomeSoftware Heritage

Improve directory journal backfill performance
Closed, MigratedEdits Locked

Description

The current performance of the directory backfiller is not great (the ETA at the current rate is one or two years).

This isn't too surprising as the current code is very naive: it processes directories in hash order, and for each directory it does a massive join to get the directory entries.

There's a few suggestions to improve this:

  • @douardda suggested processing directories in insertion order (~ the infamous internal object_id field), which may give us better performance due to cache effects (on the, shared, directory_entry_* tables).
  • We could just try using the new, more performant database server and see whether the ETA is more sensible.

Event Timeline

olasd triaged this task as High priority.Jun 18 2019, 3:57 PM
olasd created this task.

have we now any insight on the behavior of the backfiller against belvedere?

Running the directory backfiller (single instance) against belvedere yields an ETA of 250 days, which is around a 3x speedup from somerset.

Running 16 jobs in parallel gets the ETA down to 45 days, with a barely noticeable load increase on belvedere.

After adding more cores to getty to allievate the cpu-boundness, this should get trimmed down some more. I think that'd be acceptable?

olasd changed the task status from Open to Work in Progress.Jun 25 2019, 6:31 PM

With 16 processes in parallel still, adding more CPUs gives an ETA of ~1 month, which stays pretty bad.

1 month is good enough. Let's stick to this.

At this point, I don't think we'll make it much better with postgres as source.

(the backfill had, in fact, completed within a month)