Page MenuHomeSoftware Heritage

Decode textual content from utf-8 before displaying it
ClosedPublic

Authored by anlambert on Jan 14 2020, 5:50 PM.

Details

Summary

This fixes the rendering of textual contents when using Django >= 2.x.

The issue was spotted thanks to the cypress tests (see https://jenkins.softwareheritage.org/job/DWAPPS/job/cypress/431/console).

Depends on D2529

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Tests are now failing since the release of swh-storage 0.0.163 because of that breaking API change (rDSTO024eaea78b8075bcf75a3eda34fa43f4fc5f3672).

I will fix them tomorrow.

olasd requested changes to this revision.Jan 15 2020, 11:44 AM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/web/browse/utils.py
290

I think this should be hardened a bit. We could:

  • use chardet to try to detect the actual encoding of the content
  • or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection)
  • or add a errors='replace' to avoid exploding on invalid utf-8

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).

This revision now requires changes to proceed.Jan 15 2020, 11:44 AM
swh/web/browse/utils.py
290

I think this should be hardened a bit. We could:

use chardet to try to detect the actual encoding of the content
or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection)
or add a errors='replace' to avoid exploding on invalid utf-8

This is not really explicit in the diff but the input content_data is guaranteed to be UTF-8 encoded if the content is textual.

When fetching content from the archive, the [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/browse/utils.py$148-239 | request_content ]] function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function.

Nevertheless, that code is quite a mess and should be improved / simplified but this is out of scope for that diff.

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

We already have those kind of tests at an upper level in tests/browse/views/test_content.py. Refactoring the content preprocessing pipeline will allow us to add better tests though.

Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).

Agreed, it exists cases where the detected encoding will not be the right one.

swh/web/browse/utils.py
290

or add a errors='replace' to avoid exploding on invalid utf-8

I will add that just in case.

Update:

  • rebase
  • decode base64 from ascii instead of utf-8
  • add errors='replace' parameter when attempting to decode textual content from utf-8
vlorentz added inline comments.
swh/web/browse/utils.py
290

+1 on errors='replace'.

The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).

We already have those kind of tests at an upper level in tests/browse/views/test_content.py

no we don't. There's only one test for non-utf8 text, and it's on a well-formed utf-16le file.

Accepting in so far as my concerns have been handled.

When fetching content from the archive, the request_content function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function.

I see. Ideally this would convert to a unicode string rather than dump stuff back to bytes again (to end up converting it once more when rendering) but I understand that would make handling of the return value a bit harder.

This revision is now accepted and ready to land.Jan 15 2020, 2:38 PM