Page MenuHomeSoftware Heritage

npm.client: Handle parsing of package.json file non utf8 encoded
ClosedPublic

Authored by anlambert on Apr 11 2019, 6:13 PM.

Details

Summary

Some package.json files may be encoded to something different from ascii/utf-8.
So detect file encoding using chardet before parsing it.

Previously, the following errors were raised:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 893, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 145, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 197, in _prepare_package_version
    package_json = json.load(package_json_file)
  File "/usr/lib/python3.5/json/__init__.py", line 265, in load
    return loads(fp.read(),
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 183: invalid continuation byte
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 893, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 145, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 197, in _prepare_package_version
    package_json = json.load(package_json_file)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 315, in loads
    s, 0)
json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

Related T1644

Test Plan

Test for package.json file non utf8 encoded has been added.

Some refactoring were also performed to ease the adding of new tests data.

Diff Detail

Repository
rDLDNPM npm loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Apr 11 2019, 6:24 PM
This revision was automatically updated to reflect the committed changes.