Page MenuHomeSoftware Heritage

browse/utils: Robustify content encoding detection
ClosedPublic

Authored by anlambert on Feb 17 2022, 5:07 PM.

Details

Summary

When attempting to re-encode non UTF-8 textual content, use chardet
to find the encoding first and use it if the detection confidence
is really high.

Previously some encoding like SHIFT_JIS (for japanese language) were
not correctly detected and thus content were badly rendered in the
browse Web UI (see example).

Diff Detail

Repository
rDWAPPS Web applications
Branch
robustify-content-encoding-detection
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26971
Build 42170: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 42169: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7197 (id=26100)

Rebasing onto d858c9b457...

Current branch diff-target is up to date.
Changes applied before test
commit d9944bdd56c4df9c6a4c614cea07b16a0e33728c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 17 16:57:29 2022 +0100

    browse/utils: Robustify content encoding detection
    
    When attempting to re-encode non UTF-8 textual content, use chardet
    to find the encoding first and use it if the detection confidence
    is really high.
    
    Previously some encoding like SHIFT_JIS (for japanese language) were
    not correctly detected and thus content were badly rendered in the
    browse Web UI.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1389/ for more details.

This revision is now accepted and ready to land.Feb 17 2022, 6:44 PM