Page MenuHomeSoftware Heritage

Improve language indexer performance
Closed, ResolvedPublic

Description

Indexer language is slow due to the the tool used underneath (pygments) and possibly the content's size.

To give some details, pygments is used to detect language since it's the tool which detects more language.
Problem is, its api is working only on text and not on bytes (and we deal with bytes). So we need to detect its encoding and then decode it appropriately.

It has been already improved recenly to detect the encoding incrementally (no task referencing this) but it's not enough.

Hints:

  • take only the first 10k of the raw contents (as a possible configuration option).
  • take only a percentage portion of the content (also a possible configuration option).
  • use the detected encoding from the mimetype indexer and pass along that optional information.

Related Objects

Mentioned In
rDCIDXd7d57350ee65: Added tests for language indexer (T722)
rDCIDX6ea5daa123a6: language: Improve decoding policy on bad chunking sequence
rSPSITE74f31d4f1b87: data/defaults: indexer: Balance concurrency between indexers
rDCIDX86b06785bfa1: swh.indexer.language: Reduce verbosity
rDCIDXc636a53b8f56: swh.indexer.mimetype: Fix wrong default configuration
rSPSITE8014c4e931e9: data/defaults: mimetype - Fix wrong configuration
rSPSITEfe47a239891e: data/defaults: Keep the quote in the configuration part
rSPSITE31c4ca30a0b9: data/defaults: indexer: Adapt configuration properly for latest version
rDSTO4bcd830d1b82: sql/upgrades: create db upgrade 105->106
rDCIDX8de98f7b5015: swh.indexer: Add tests on mimetype indexer
rDCIDX637110903931: swh.indexer: Update to latest swh.storage api to use indexer conf id
rDCIDX1b1cffa4f00c: swh.indexer.language: Use raw content's subset if content too large
rDSTO5ff3979b9d6c: swh.storage: Update db schema to new version
rDSTOa95a3c424c74: Add new entry for language indexer tool
rDSTO8f7a5c54d476: swh.storage: fossology license endpoints: use idx_configuration_id
rDSTO7738a768f63c: swh.storage: indexer endpoints: Fix filtering missing data issue
rDSTOf18e2dfa9315: swh.storage: ctags endpoints: use indexer_configuration_id
rDSTOa8ce0d9208f1: swh.storage.tests: Refactor reading the indexer tools
rDSTOf3600de87b22: swh.storage: language endpoints: use indexer_configuration_id
rDSTO45a923bb5edb: swh.storage: mimetype endpoints: use indexer_configuration_id
Mentioned Here
T728: normalize encoding values across mimetype and language indexers
P163 average length, variance, standard deviation on ~42m language indexed contents

Event Timeline

ardumont triaged this task as High priority.May 29 2017, 1:15 PM
ardumont renamed this task from Make the language Indexer faster to Improve language indexer performance.May 29 2017, 1:48 PM
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)EditedMay 30 2017, 11:40 AM

I created a paste P163 instead of directly commenting the results here.

date of computation: Tue May 29 2017

mean:

softwareheritage=> select avg(length) from content_language cl inner join content c on cl.id=c.sha1;
        avg
--------------------
 26862.385193867011
(1 row)

variance:

softwareheritage=> select variance(length) from content_language cl inner join content c on cl.id=c.sha1;
       variance
-----------------------
 125835685708.88915180
(1 row)

standard deviation:

softwareheritage=> select stddev(length) from content_language cl inner join content c on cl.id=c.sha1;
     stddev
-----------------
 355008.08759433
(1 row)

Those contents were extracted and stored in uffizi:/srv/storage/space/lists/content-language-id-size.txt.gz (format per line: <sha1>\t<length>).

Using https://forge.softwareheritage.org/rDSNIP49ffa63356d7bcee7ea259381d078a3a87359bed, graph https://forge.softwareheritage.org/F2250694 was drawn.

ardumont added a comment.EditedJun 6 2017, 10:57 AM
  • take only the first 10k of the raw contents (as a possible configuration option).

This has been implemented, tested and deployed.
This pulled a fix on a limitation about concurrent tooling (name/version with a different configuration).

take only a percentage portion of the content (also a possible configuration option).

This has not been tested (so not implemented).

use the detected encoding from the mimetype indexer and pass along that optional information.

This has not been implemented.

The output of the mimetype indexer (based on the 'file' cli) for the encoding does not match python's encoding/decoding feature.
This would mean maintaining a translation dict somewhere in the code to convert appropriately.
That was the main reason that was not used in the initial implementation.

I did not find a simple output explaining all possible encoding from 'file' so i did not look further into that direction.

A new task about this has been added (T728).

Note:
I have extracted an 'actual' state of encodings detected though. It's an extract from content_mimetype table (as of 31/05/2017).
It's stored in uffizi:/srv/storage/space/lists/unique-encodings-found.txt (only 8 so far: binary, ebcdic, iso-8859-1, unknown-8bit, us-ascii, utf-16be, utf-16le, utf-8).

ardumont closed this task as Resolved.Jun 6 2017, 10:59 AM