Page MenuHomeSoftware Heritage

Indexers: Send range of ids instead of list of ids
Closed, MigratedEdits Locked

Description

It has been discussed that sending contents ids for indexation is not that good of a pattern.

For one, as those are parameters, they are part of the message. So they can grow quite large.

One possible proposed improvement is to send range of ids instead.

Related Objects

Event Timeline

ardumont triaged this task as Normal priority.Mar 14 2018, 1:22 PM
ardumont created this task.

That would allow to be closer to use the real scheduler (swh-scheduler) [1] and not the volatile one [2]

[1] Well, that would not be enough. Today, the swh-scheduler has a mechanism to check for one queue length prior to schedule more tasks. For the indexer, it's slightly more complex as we need to check multiple queues' length (orchestrator, mimetype, fossology_license...)... That's what [2] does.

[2] https://forge.softwareheritage.org/source/snippets/browse/master/ardumont/schedule_with_queue_length_check.py

As an implementation strategy, I think we can aim at:

  • sending the workers a start and end of range
  • at the beginning of each worker task, the worker
    • if it's incremental :
      • lists, from the indexer database, the already indexed objects in the range (new APIs in swh.indexer.storage)
    • else:
      • uses an empty list as filter :)
    • fetches all the objects that exist in the given range, excluding the objects found by the previous step (new APIs in swh.storage)
  • then the worker can iterate its work on the fetched objects

This would reduce task generation to making a list of ranges, and this list of tasks-per-range would never change and could even be added to the scheduler as standard recurring tasks.

ardumont renamed this task from General improvement of the indexers: Send range of ids instead of raw ids to Indexers: Send range of ids instead of raw ids.Oct 3 2018, 12:00 PM

By the way, status on this:

  • mimetype indexer: migrated, deployed
  • fossology_license indexer: migrated, deployed
  • ctags: stand-by (not running in production)
  • language: stand-by (not running in production)

Now remains to determine the ranges.

ardumont renamed this task from Indexers: Send range of ids instead of raw ids to Indexers: Send range of ids instead of list of ids.Nov 21 2018, 3:26 PM