Page MenuHomeSoftware Heritage

Pipeline to compute mimetype, encoding, languages, ctags, ...
Closed, MigratedEdits Locked

Description

project swh.indexer is in charge of this.

As explained through email:

  • a producer sends batch of contents (reading from azure storages), fetch corresponding filenames from swh storage, and send a content (sha1, one random filename) on next queue
  • first queue is in charge to read sha1s' content, enrich the content with it and send those to the next queue
  • 2nd queue is in charge of computing mimetype, encoding, enrich the content, save the result in storage, and send the enriched content to next queue
  • 3rd queue, in charge of computing the language programming, enrich the content, save the result, send the result to next queue
  • 4th queue, in charge of compute the ctags, enrich the content, save the result, stop (for now)
  • etc...

A POC is developed and deployed on worker01.euwest.azure.softwareheritage.org

Note:

  • The storage i keep making reference to is mongodb and runs on the same azure node i run tests on (so i need to deploy it elsewhere)
  • The raw content is not stored in the storage
  • The storage is to write only but at some point, it will be used for reading as well (typically for the full text search on ctags, from the webapp with new api endpoint)

Event Timeline

I add some issues with utf-8 decoding errors so now i use file early enough to help determine the encoding of the files and enrich the result.

So now, i exploit the encoding field to:

  • reduce decoding errors
  • filter out 'binary' encoding from language programming detection and ctags computations (it made no sense to me to permit it but i could be wrong).

Note that it's still possible to have decoding failures (since some encoding from the tool file make possibly no sense for python). Those contents are marked as such with a 'decoding_failure' key entry.

I also add lots of issues with many recognition tools which expects as default to have some default inputs:

  • ctags
  • github-linguist
  • ohcount (blackduck's)

In some form or another, they expect either the filename (for the extension surely) or the language to be provided. If they don't, they simply don't do anything.

Anyway, that's why now, i chose randomly one filename from the cache.

The library python3-pygments works by analyzing the code (well it has some heuristics as well, the shebang for one) but still it works well.

I also tried to exploit the language name but it cannot work.
ctags knows, by default only 40 languages or so (42 i think), but python3-pygments (which i use here) now 404...
So that cannot go well even by a long shot when mapping...

ardumont added a parent task: Unknown Object (Maniphest Task).Oct 5 2016, 10:45 AM
ardumont changed the edit policy from "All Users" to "Staff (Project)".
ardumont edited parent tasks, added: Unknown Object (Maniphest Task); removed: Unknown Object (Maniphest Task).