Page MenuHomeSoftware Heritage

Indexers: batch content analyzer infrastructure
Closed, ResolvedPublic

Description

We want to be able to analyze, in batch, all the content blobs stored by Software Heritage.

Sample use cases are:

  • compute mime type (service running)
  • detect the license using ninka/fossology (service running)
  • detect the programming language (service stopped)

To this end we need some scheduling tooling that allows to add/remove analyzer, (re)run analysis in batch, incrementally stay up to date with new incoming content blobs.

Event Timeline

zack created this task.Apr 1 2016, 10:51 AM
zack updated the task description. (Show Details)
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM

A POC is ongoing for as T548

ardumont renamed this task from batch blob analyzer infrastructure to Indexers: batch blob analyzer infrastructure.Oct 5 2018, 2:47 PM

Unplugging T528 as per discussion.

We need to rework the current indexer implementation to use range instead (T991).
After that, we can schedule 256 ranges of contents to index using the scheduler stack instead.
And see where that goes.

ardumont renamed this task from Indexers: batch blob analyzer infrastructure to Indexers: batch content analyzer infrastructure.
ardumont raised the priority of this task from Low to Normal.
ardumont updated the task description. (Show Details)
ardumont added a project: Indexer.
ardumont closed this task as Resolved.Jan 15 2019, 2:44 PM
ardumont claimed this task.

We need to rework the current indexer implementation to use range instead (T991).
After that, we can schedule 256 ranges of contents to index using the scheduler stack instead.
And see where that goes.

Done.

So in effect:

To this end we need some scheduling tooling that allows to add/remove analyzer, (re)run analysis in batch, incrementally stay up to date with new incoming content blobs.

Done.