Page MenuHomeSoftware Heritage

normalize encoding values across mimetype and language indexers
Open, NormalPublic

Description

In the language indexer, we need to detect the encoding to permit to compute the language from the text.

As we already compute the content to detect the mimetype and the encoding in a prior step, we should use that encoding.
But an implementation detail prevents this.

The encoding detected by the cli 'file' used in the mimetype indexer and the native decoding of our environment (python) does not match.
We should normalize this.

Event Timeline

zack renamed this task from Reuse encoding detected in mimetype indexer for language indexer to normalize encoding values across mimetype and language indexers.Jun 6 2017, 1:53 PM