Details

swh/web/browse/utils.py
290	I think this should be hardened a bit. We could: use chardet to try to detect the actual encoding of the content or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection) or add a `errors='replace'` to avoid exploding on invalid utf-8 The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).

This revision now requires changes to proceed.Jan 15 2020, 11:44 AM

anlambert added inline comments.Jan 15 2020, 12:03 PM

swh/web/browse/utils.py
290	I think this should be hardened a bit. We could: use chardet to try to detect the actual encoding of the content or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection) or add a errors='replace' to avoid exploding on invalid utf-8 This is not really explicit in the diff but the input content_data is guaranteed to be UTF-8 encoded if the content is textual. When fetching content from the archive, the [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/browse/utils.py$148-239 \| `request_content` ]] function is used and will encode any textual content to UTF-8 before passing it to the `prepare_content_for_display` function. Nevertheless, that code is quite a mess and should be improved / simplified but this is out of scope for that diff. The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). We already have those kind of tests at an upper level in `tests/browse/views/test_content.py`. Refactoring the content preprocessing pipeline will allow us to add better tests though. Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff). Agreed, it exists cases where the detected encoding will not be the right one.

anlambert added inline comments.Jan 15 2020, 12:31 PM

swh/web/browse/utils.py
290	or add a errors='replace' to avoid exploding on invalid utf-8 I will add that just in case.

Update:

rebase
decode base64 from ascii instead of utf-8
add errors='replace' parameter when attempting to decode textual content from utf-8

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/885/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/885/console

Harbormaster failed remote builds in B10107: Diff 9006!Jan 15 2020, 1:27 PM

Build is green
See https://jenkins.softwareheritage.org/job/DWAPPS/job/cypress-diff/493/ for more details.

Update copyright years

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/888/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/888/console

Harbormaster failed remote builds in B10115: Diff 9013!Jan 15 2020, 2:13 PM

vlorentz added a subscriber: vlorentz.Jan 15 2020, 2:14 PM

vlorentz added inline comments.

swh/web/browse/utils.py
290	+1 on `errors='replace'`. The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). We already have those kind of tests at an upper level in tests/browse/views/test_content.py no we don't. There's only one test for non-utf8 text, and it's on a well-formed utf-16le file.

In D2530#60226, @swh-public-ci wrote:

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/888/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/888/console

Tests will be fixed when D2533 will be landed

Build is green
See https://jenkins.softwareheritage.org/job/DWAPPS/job/cypress-diff/496/ for more details.

Accepting in so far as my concerns have been handled.

When fetching content from the archive, the request_content function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function.

I see. Ideally this would convert to a unicode string rather than dump stuff back to bytes again (to end up converting it once more when rendering) but I understand that would make handling of the return value a bit harder.

This revision is now accepted and ready to land.Jan 15 2020, 2:38 PM

Rebase

Build is green
See https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/892/ for more details.

Build is green
See https://jenkins.softwareheritage.org/job/DWAPPS/job/cypress-diff/500/ for more details.

Harbormaster completed remote builds in B10120: Diff 9018.Jan 15 2020, 3:13 PM

Closed by commit rDWAPPSb2115d5aaf4d: browse/utils: Decode textual content from utf-8 before displaying it (authored by anlambert). · Explain WhyJan 15 2020, 3:14 PM

This revision was automatically updated to reflect the committed changes.

anlambert added a commit: rDWAPPSb2115d5aaf4d: browse/utils: Decode textual content from utf-8 before displaying it.