Page MenuHomeSoftware Heritage

Decode textual content from utf-8 before displaying it
ClosedPublic

Authored by anlambert on Jan 14 2020, 5:50 PM.

Details

Summary

This fixes the rendering of textual contents when using Django >= 2.x.

The issue was spotted thanks to the cypress tests (see https://jenkins.softwareheritage.org/job/DWAPPS/job/cypress/431/console).

Depends on D2529

Diff Detail

Repository
rDWAPPS Web applications
Branch
django2-fix-content-rendering
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10107
Build 14975: Cypress tests for swh-web diffsJenkins
Build 14974: tox-on-jenkinsJenkins
Build 14973: arc lint + arc unit

Event Timeline

anlambert created this revision.Jan 14 2020, 5:50 PM

Tests are now failing since the release of swh-storage 0.0.163 because of that breaking API change (rDSTO024eaea78b8075bcf75a3eda34fa43f4fc5f3672).

I will fix them tomorrow.

olasd requested changes to this revision.Jan 15 2020, 11:44 AM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/web/browse/utils.py
291

I think this should be hardened a bit. We could:

  • use chardet to try to detect the actual encoding of the content
  • or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection)
  • or add a errors='replace' to avoid exploding on invalid utf-8

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).

This revision now requires changes to proceed.Jan 15 2020, 11:44 AM
anlambert added inline comments.Jan 15 2020, 12:03 PM
swh/web/browse/utils.py
291

I think this should be hardened a bit. We could:

use chardet to try to detect the actual encoding of the content
or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection)
or add a errors='replace' to avoid exploding on invalid utf-8

This is not really explicit in the diff but the input content_data is guaranteed to be UTF-8 encoded if the content is textual.

When fetching content from the archive, the [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/browse/utils.py$148-239 | request_content ]] function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function.

Nevertheless, that code is quite a mess and should be improved / simplified but this is out of scope for that diff.

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

We already have those kind of tests at an upper level in tests/browse/views/test_content.py. Refactoring the content preprocessing pipeline will allow us to add better tests though.

Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).

Agreed, it exists cases where the detected encoding will not be the right one.

anlambert added inline comments.Jan 15 2020, 12:31 PM
swh/web/browse/utils.py
291

or add a errors='replace' to avoid exploding on invalid utf-8

I will add that just in case.

anlambert updated this revision to Diff 9006.Jan 15 2020, 1:21 PM

Update:

  • rebase
  • decode base64 from ascii instead of utf-8
  • add errors='replace' parameter when attempting to decode textual content from utf-8
anlambert updated this revision to Diff 9013.Jan 15 2020, 2:03 PM

Update copyright years

vlorentz added inline comments.
swh/web/browse/utils.py
291

+1 on errors='replace'.

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

We already have those kind of tests at an upper level in tests/browse/views/test_content.py

no we don't. There's only one test for non-utf8 text, and it's on a well-formed utf-16le file.

Tests will be fixed when D2533 will be landed

olasd accepted this revision.Jan 15 2020, 2:38 PM

Accepting in so far as my concerns have been handled.

When fetching content from the archive, the request_content function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function.

I see. Ideally this would convert to a unicode string rather than dump stuff back to bytes again (to end up converting it once more when rendering) but I understand that would make handling of the return value a bit harder.

This revision is now accepted and ready to land.Jan 15 2020, 2:38 PM