Page MenuHomeSoftware Heritage

swh-loader-core: doesn't properly batch object insertions (for big requests)
Closed, MigratedEdits Locked

Description

The core loader uses a queue system to bundle loading of objects.

This works well when you repeatedly store small amounts of objects, as it allows to bundle up those requests in bigger requests. However, when loading a big number of objects at once (e.g. loading the contents of a big Debian package such as libreoffice), it doesn't split up the insertion in batches.

This needs to be fixed so that loading doesn't overwhelm the database.

Event Timeline

The core loader uses a queue system to bundle loading of objects.

Indeed, it's one queue per object type.
But it's not necessarily the same queue implementation for each object (the contents queue notably is different due to the size parameter which can vary a lot).

However, when loading a big number of objects at once (e.g. loading the contents of a big Debian package such as libreoffice), it doesn't split up the insertion in batches.

The queue system for the content file uses 2 properties to trigger the sending to the storage.
One property is the number of elements (default threshold of 10k contents), the second is the actual size of all contents stored at the moment (default threshold of 100Mib).
If one of both is reached, this triggers the actual sending of the actual queued data.

So adapting that max size threshold for that queue would help.

it doesn't split up the insertion in batches.

Oh, yeah, if some files are more than such default, they then must be split into batch of 1 content...
So maybe for the debian loader, the default content threshold configuration for the content (which is currently used according to swh-site/default.yaml) is not high enough.

Also, of course, if there exists a more pythonist way to improve on the actual implementation, I'm all for it :)

I'm going to try to clarify my point:

  • You configure the queue to send objects in batches of B objects (for instance, 1000 objects).
  • somehow, you queue N objects for being sent, with N >> B (for instance, you uncompressed the linux kernel and get 100000 objects to send).

expected:

  • the queue sends objects in ceil(N / B) batches of max B objects (for that instance, 100 batches of 1000 objects)

current behavior:

  • the queue sends one batch of N objects (for that instance, one batch of 100k objects)

The size limit should superimpose itself to the number of items limit, but the behavior should be the same.

The size limit should superimpose itself to the number of items limit, but the behavior should be the same.

Right! Thanks for the clarification.

The threshold reached limit then send the current object queue's objects for storage in one go currently implemented is too naive for N >> B.

Git Loader's send_in_packets function might come in handy there.