Page MenuHomeSoftware Heritage

browse/utils: Robustify content encoding detection
ClosedPublic

Authored by anlambert on Feb 17 2022, 5:07 PM.

Details

Summary

When attempting to re-encode non UTF-8 textual content, use chardet
to find the encoding first and use it if the detection confidence
is really high.

Previously some encoding like SHIFT_JIS (for japanese language) were
not correctly detected and thus content were badly rendered in the
browse Web UI (see example).

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7197 (id=26100)

Rebasing onto d858c9b457...

Current branch diff-target is up to date.
Changes applied before test
commit d9944bdd56c4df9c6a4c614cea07b16a0e33728c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 17 16:57:29 2022 +0100

    browse/utils: Robustify content encoding detection
    
    When attempting to re-encode non UTF-8 textual content, use chardet
    to find the encoding first and use it if the detection confidence
    is really high.
    
    Previously some encoding like SHIFT_JIS (for japanese language) were
    not correctly detected and thus content were badly rendered in the
    browse Web UI.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1389/ for more details.

This revision is now accepted and ready to land.Feb 17 2022, 6:44 PM