Pipeline to compute mimetype, encoding, languages, ctags, ...
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Sep 30 2016, 5:45 PM

Description

project swh.indexer is in charge of this.

As explained through email:

a producer sends batch of contents (reading from azure storages), fetch corresponding filenames from swh storage, and send a content (sha1, one random filename) on next queue
first queue is in charge to read sha1s' content, enrich the content with it and send those to the next queue
2nd queue is in charge of computing mimetype, encoding, enrich the content, save the result in storage, and send the enriched content to next queue
3rd queue, in charge of computing the language programming, enrich the content, save the result, send the result to next queue
4th queue, in charge of compute the ctags, enrich the content, save the result, stop (for now)
etc...

A POC is developed and deployed on worker01.euwest.azure.softwareheritage.org

Note:

The storage i keep making reference to is mongodb and runs on the same azure node i run tests on (so i need to deploy it elsewhere)
The raw content is not stored in the storage
The storage is to write only but at some point, it will be used for reading as well (typically for the full text search on ctags, from the webapp with new api endpoint)

Related Objects
Search...

		Status	Assigned	Task
				Unknown Object (Maniphest Task)
		Migrated	gitlab-migration	T571 Pipeline to compute mimetype, encoding, languages, ctags, ...

Event Timeline

ardumont created this task.Sep 30 2016, 5:45 PM

I add some issues with utf-8 decoding errors so now i use file early enough to help determine the encoding of the files and enrich the result.

So now, i exploit the encoding field to:

reduce decoding errors
filter out 'binary' encoding from language programming detection and ctags computations (it made no sense to me to permit it but i could be wrong).

Note that it's still possible to have decoding failures (since some encoding from the tool file make possibly no sense for python). Those contents are marked as such with a 'decoding_failure' key entry.

I also add lots of issues with many recognition tools which expects as default to have some default inputs:

ctags
github-linguist
ohcount (blackduck's)

In some form or another, they expect either the filename (for the extension surely) or the language to be provided. If they don't, they simply don't do anything.

Anyway, that's why now, i chose randomly one filename from the cache.

The library python3-pygments works by analyzing the code (well it has some heuristics as well, the shebang for one) but still it works well.

I also tried to exploit the language name but it cannot work.
ctags knows, by default only 40 languages or so (42 i think), but python3-pygments (which i use here) now 404...
So that cannot go well even by a long shot when mapping...

ardumont added a parent task: Unknown Object (Maniphest Task).Oct 5 2016, 10:45 AM

ardumont changed the edit policy from "All Users" to "Staff (Project)".

ardumont edited parent tasks, added: Unknown Object (Maniphest Task); removed: Unknown Object (Maniphest Task).

The need evolved - cf. T574

ardumont closed this task as Invalid.Oct 5 2016, 10:58 AM

ardumont mentioned this in T577: Mimetype/Encoding indexer.Oct 5 2016, 11:17 AM

ardumont mentioned this in T578: Language indexer.Oct 5 2016, 11:19 AM

This task has been migrated to GitLab.

Pipeline to compute mimetype, encoding, languages, ctags, ...Closed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Pipeline to compute mimetype, encoding, languages, ctags, ...
Closed, MigratedEdits Locked
Actions

Related Objects
Search...