Changeset View
Standalone View
swh/web/browse/utils.py
# Copyright (C) 2017-2019 The Software Heritage developers | # Copyright (C) 2017-2020 The Software Heritage developers | ||||
# See the AUTHORS file at the top-level directory of this distribution | # See the AUTHORS file at the top-level directory of this distribution | ||||
# License: GNU Affero General Public License version 3, or any later version | # License: GNU Affero General Public License version 3, or any later version | ||||
# See top-level LICENSE file for more information | # See top-level LICENSE file for more information | ||||
import base64 | import base64 | ||||
import magic | import magic | ||||
import pypandoc | import pypandoc | ||||
import stat | import stat | ||||
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines | def prepare_content_for_display(content_data, mime_type, path): | ||||
if not language: | if not language: | ||||
language = 'nohighlight' | language = 'nohighlight' | ||||
elif mime_type.startswith('application/'): | elif mime_type.startswith('application/'): | ||||
mime_type = mime_type.replace('application/', 'text/') | mime_type = mime_type.replace('application/', 'text/') | ||||
if mime_type.startswith('image/'): | if mime_type.startswith('image/'): | ||||
if mime_type in _browsers_supported_image_mimes: | if mime_type in _browsers_supported_image_mimes: | ||||
content_data = base64.b64encode(content_data) | content_data = base64.b64encode(content_data).decode('ascii') | ||||
content_data = content_data.decode('utf-8') | |||||
else: | else: | ||||
content_data = None | content_data = None | ||||
if mime_type.startswith('image/svg'): | if mime_type.startswith('image/svg'): | ||||
mime_type = 'image/svg+xml' | mime_type = 'image/svg+xml' | ||||
if mime_type.startswith('text/'): | |||||
olasd: I think this should be hardened a bit. We could:
- use chardet to try to detect the actual… | |||||
Done Inline Actions
This is not really explicit in the diff but the input content_data is guaranteed to be UTF-8 encoded if the content is textual. When fetching content from the archive, the [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/browse/utils.py$148-239 | request_content ]] function is used and will encode any textual content to UTF-8 before passing it to the prepare_content_for_display function. Nevertheless, that code is quite a mess and should be improved / simplified but this is out of scope for that diff.
We already have those kind of tests at an upper level in tests/browse/views/test_content.py. Refactoring the content preprocessing pipeline will allow us to add better tests though.
Agreed, it exists cases where the detected encoding will not be the right one. anlambert: > I think this should be hardened a bit. We could:
>
> use chardet to try to detect the… | |||||
Done Inline Actions
I will add that just in case. anlambert: > or add a errors='replace' to avoid exploding on invalid utf-8
I will add that just in case. | |||||
Not Done Inline Actions+1 on errors='replace'.
no we don't. There's only one test for non-utf8 text, and it's on a well-formed utf-16le file. vlorentz: +1 on `errors='replace'`.
>> The tests should also be expanded to have more "adversarial"… | |||||
content_data = content_data.decode('utf-8', errors='replace') | |||||
return {'content_data': content_data, | return {'content_data': content_data, | ||||
'language': language, | 'language': language, | ||||
'mimetype': mime_type} | 'mimetype': mime_type} | ||||
def process_snapshot_branches(snapshot): | def process_snapshot_branches(snapshot): | ||||
""" | """ | ||||
Process a dictionary describing snapshot branches: extract those | Process a dictionary describing snapshot branches: extract those | ||||
▲ Show 20 Lines • Show All 806 Lines • Show Last 20 Lines |
I think this should be hardened a bit. We could:
The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...).
Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff).