Page MenuHomeSoftware Heritage

Indexers: batch content analyzer infrastructure
Closed, MigratedEdits Locked

Description

We want to be able to analyze, in batch, all the content blobs stored by Software Heritage.

Sample use cases are:

  • compute mime type (service running)
  • detect the license using ninka/fossology (service running)
  • detect the programming language (service stopped)

To this end we need some scheduling tooling that allows to add/remove analyzer, (re)run analysis in batch, incrementally stay up to date with new incoming content blobs.

Related Objects

Event Timeline

zack updated the task description. (Show Details)
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM
ardumont renamed this task from batch blob analyzer infrastructure to Indexers: batch blob analyzer infrastructure.Oct 5 2018, 2:47 PM

Unplugging T528 as per discussion.

We need to rework the current indexer implementation to use range instead (T991).
After that, we can schedule 256 ranges of contents to index using the scheduler stack instead.
And see where that goes.

ardumont renamed this task from Indexers: batch blob analyzer infrastructure to Indexers: batch content analyzer infrastructure.Oct 19 2018, 8:44 AM
ardumont raised the priority of this task from Low to Normal.
ardumont updated the task description. (Show Details)
ardumont added a project: Indexer.
ardumont claimed this task.

We need to rework the current indexer implementation to use range instead (T991).
After that, we can schedule 256 ranges of contents to index using the scheduler stack instead.
And see where that goes.

Done.

So in effect:

To this end we need some scheduling tooling that allows to add/remove analyzer, (re)run analysis in batch, incrementally stay up to date with new incoming content blobs.

Done.