Differential D2530 Diff 9018 swh/web/browse/utils.py

Changeset View

Standalone View

swh/web/browse/utils.py

# Copyright (C) 2017-2019 The Software Heritage developers		# Copyright (C) 2017-2020 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution		# See the AUTHORS file at the top-level directory of this distribution
# License: GNU Affero General Public License version 3, or any later version		# License: GNU Affero General Public License version 3, or any later version
# See top-level LICENSE file for more information		# See top-level LICENSE file for more information

import base64		import base64
import magic		import magic
import pypandoc		import pypandoc
import stat		import stat
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	def prepare_content_for_display(content_data, mime_type, path):

if not language:		if not language:
language = 'nohighlight'		language = 'nohighlight'
elif mime_type.startswith('application/'):		elif mime_type.startswith('application/'):
mime_type = mime_type.replace('application/', 'text/')		mime_type = mime_type.replace('application/', 'text/')

if mime_type.startswith('image/'):		if mime_type.startswith('image/'):
if mime_type in _browsers_supported_image_mimes:		if mime_type in _browsers_supported_image_mimes:
content_data = base64.b64encode(content_data)		content_data = base64.b64encode(content_data).decode('ascii')
content_data = content_data.decode('utf-8')
else:		else:
content_data = None		content_data = None

if mime_type.startswith('image/svg'):		if mime_type.startswith('image/svg'):
mime_type = 'image/svg+xml'		mime_type = 'image/svg+xml'

		if mime_type.startswith('text/'):
		olasdUnsubmitted Not Done Inline Actions I think this should be hardened a bit. We could: use chardet to try to detect the actual encoding of the content or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection) or add a `errors='replace'` to avoid exploding on invalid utf-8 The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff). olasd: I think this should be hardened a bit. We could: - use chardet to try to detect the actual…
		anlambertAuthorUnsubmitted Done Inline Actions I think this should be hardened a bit. We could: use chardet to try to detect the actual encoding of the content or pass the encoding information from the caller (I guess the mime type detection also does some sort of encoding detection) or add a errors='replace' to avoid exploding on invalid utf-8 This is not really explicit in the diff but the input content_data is guaranteed to be UTF-8 encoded if the content is textual. When fetching content from the archive, the [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/browse/utils.py$148-239 \| `request_content` ]] function is used and will encode any textual content to UTF-8 before passing it to the `prepare_content_for_display` function. Nevertheless, that code is quite a mess and should be improved / simplified but this is out of scope for that diff. The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). We already have those kind of tests at an upper level in `tests/browse/views/test_content.py`. Refactoring the content preprocessing pipeline will allow us to add better tests though. Finally, if that's possible, it'd be great to allow users to override the encoding detected in the UI, like we do for highlighting (but that's clearly out of scope for this diff). Agreed, it exists cases where the detected encoding will not be the right one. anlambert: > I think this should be hardened a bit. We could: > > use chardet to try to detect the…
		anlambertAuthorUnsubmitted Done Inline Actions or add a errors='replace' to avoid exploding on invalid utf-8 I will add that just in case. anlambert: > or add a errors='replace' to avoid exploding on invalid utf-8 I will add that just in case.
		vlorentzUnsubmitted Not Done Inline Actions +1 on `errors='replace'`. The tests should also be expanded to have more "adversarial" contents (legacy encodings, mojibake, ...). We already have those kind of tests at an upper level in tests/browse/views/test_content.py no we don't. There's only one test for non-utf8 text, and it's on a well-formed utf-16le file. vlorentz: +1 on `errors='replace'`. >> The tests should also be expanded to have more "adversarial"…
		content_data = content_data.decode('utf-8', errors='replace')

return {'content_data': content_data,		return {'content_data': content_data,
'language': language,		'language': language,
'mimetype': mime_type}		'mimetype': mime_type}


def process_snapshot_branches(snapshot):		def process_snapshot_branches(snapshot):
"""		"""
Process a dictionary describing snapshot branches: extract those		Process a dictionary describing snapshot branches: extract those
▲ Show 20 Lines • Show All 806 Lines • Show Last 20 Lines