⚓ T722 Improve language indexer performance

ardumont created this task.May 29 2017, 12:22 PM

ardumont triaged this task as High priority.May 29 2017, 1:15 PM

ardumont renamed this task from Make the language Indexer faster to Improve language indexer performance.May 29 2017, 1:48 PM

ardumont updated the task description. (Show Details)

I created a paste P163 instead of directly commenting the results here.

date of computation: Tue May 29 2017

mean:

softwareheritage=> select avg(length) from content_language cl inner join content c on cl.id=c.sha1;
        avg
--------------------
 26862.385193867011
(1 row)

variance:

softwareheritage=> select variance(length) from content_language cl inner join content c on cl.id=c.sha1;
       variance
-----------------------
 125835685708.88915180
(1 row)

standard deviation:

softwareheritage=> select stddev(length) from content_language cl inner join content c on cl.id=c.sha1;
     stddev
-----------------
 355008.08759433
(1 row)

Those contents were extracted and stored in uffizi:/srv/storage/space/lists/content-language-id-size.txt.gz (format per line: <sha1>\t<length>).

Using https://forge.softwareheritage.org/rDSNIP49ffa63356d7bcee7ea259381d078a3a87359bed, graph https://forge.softwareheritage.org/F2250694 was drawn.

ardumont mentioned this in rDSTO45a923bb5edb: swh.storage: mimetype endpoints: use indexer_configuration_id.Jun 1 2017, 4:56 PM

ardumont mentioned this in rDSTOf3600de87b22: swh.storage: language endpoints: use indexer_configuration_id.

ardumont mentioned this in rDSTOa8ce0d9208f1: swh.storage.tests: Refactor reading the indexer tools.

ardumont mentioned this in rDSTOf18e2dfa9315: swh.storage: ctags endpoints: use indexer_configuration_id.

ardumont mentioned this in rDSTO7738a768f63c: swh.storage: indexer endpoints: Fix filtering missing data issue.

ardumont mentioned this in rDSTO8f7a5c54d476: swh.storage: fossology license endpoints: use idx_configuration_id.

ardumont mentioned this in rDSTOa95a3c424c74: Add new entry for language indexer tool.

ardumont mentioned this in rDSTO5ff3979b9d6c: swh.storage: Update db schema to new version.

ardumont mentioned this in rDCIDX1b1cffa4f00c: swh.indexer.language: Use raw content's subset if content too large.

ardumont mentioned this in rDCIDX637110903931: swh.indexer: Update to latest swh.storage api to use indexer conf id.

ardumont mentioned this in rDCIDX8de98f7b5015: swh.indexer: Add tests on mimetype indexer.Jun 2 2017, 1:26 PM

ardumont mentioned this in rDSTO4bcd830d1b82: sql/upgrades: create db upgrade 105->106.Jun 2 2017, 3:24 PM

ardumont mentioned this in rSPSITE31c4ca30a0b9: data/defaults: indexer: Adapt configuration properly for latest version.Jun 2 2017, 4:34 PM

ardumont mentioned this in rSPSITEfe47a239891e: data/defaults: Keep the quote in the configuration part.Jun 2 2017, 5:34 PM

ardumont mentioned this in rSPSITE8014c4e931e9: data/defaults: mimetype - Fix wrong configuration.Jun 2 2017, 5:50 PM

ardumont mentioned this in rDCIDXc636a53b8f56: swh.indexer.mimetype: Fix wrong default configuration.Jun 2 2017, 6:00 PM

ardumont mentioned this in rDCIDX86b06785bfa1: swh.indexer.language: Reduce verbosity.Jun 2 2017, 7:09 PM

take only the first 10k of the raw contents (as a possible configuration option).

This has been implemented, tested and deployed.
This pulled a fix on a limitation about concurrent tooling (name/version with a different configuration).

take only a percentage portion of the content (also a possible configuration option).

This has not been tested (so not implemented).

use the detected encoding from the mimetype indexer and pass along that optional information.

This has not been implemented.

The output of the mimetype indexer (based on the 'file' cli) for the encoding does not match python's encoding/decoding feature.
This would mean maintaining a translation dict somewhere in the code to convert appropriately.
That was the main reason that was not used in the initial implementation.

I did not find a simple output explaining all possible encoding from 'file' so i did not look further into that direction.

A new task about this has been added (T728).

Note:
I have extracted an 'actual' state of encodings detected though. It's an extract from content_mimetype table (as of 31/05/2017).
It's stored in uffizi:/srv/storage/space/lists/unique-encodings-found.txt (only 8 so far: binary, ebcdic, iso-8859-1, unknown-8bit, us-ascii, utf-16be, utf-16le, utf-8).

ardumont closed this task as Resolved.Jun 6 2017, 10:59 AM

ardumont mentioned this in rSPSITE74f31d4f1b87: data/defaults: indexer: Balance concurrency between indexers.Jun 6 2017, 1:51 PM

ardumont mentioned this in rDCIDX6ea5daa123a6: language: Improve decoding policy on bad chunking sequence.Jun 6 2017, 6:18 PM

Diffusion mentioned this in rDCIDXd7d57350ee65: Added tests for language indexer (T722).Jun 14 2017, 10:42 AM

This task has been migrated to GitLab.

Improve language indexer performance
Closed, MigratedEdits Locked
Actions

Description

Related Objects

Event Timeline

Improve language indexer performanceClosed, MigratedEdits LockedActions

Description

Related Objects

Event Timeline

Improve language indexer performance
Closed, MigratedEdits Locked
Actions