Page MenuHomeSoftware Heritage

npm.client: Ensure package.json parsing
ClosedPublic

Authored by anlambert on May 21 2019, 5:04 PM.

Details

Summary

Ensure package.json file can be parsed when its content can not be properly decoded
due to the encoding not properly detected.

So try to decode from utf-8 first, then use chardet as a fallback using the replace error hanling to replace characters that can not be decoded.

Even if the package.json content can not be correctly loaded, this is not critical
as these data are only added to a swh revision metadata. Original package.json file
can still be obtained from the archive content.

This should fix this kind of reported errors:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 895, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 149, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 207, in _prepare_package_version
    package_json = json.loads(package_json_bytes.decode(file_encoding))
  File "/usr/lib/python3.5/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 42: character maps to <undefined>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 895, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 145, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 197, in _prepare_package_version
    package_json = json.load(package_json_file)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 315, in loads
    s, 0)
json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 895, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 149, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 204, in _prepare_package_version
    with open(package_json_path, 'rb') as package_json_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/swh.loader.npm/swh.loader.npm.jrx67u3_-2344/@lpmraven/link-components/0.1.1/package/package.json'

Related T1726

Diff Detail

Repository
rDLDNPM npm loader
Branch
package-json-decode-error
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5863
Build 8030: tox-on-jenkinsJenkins
Build 8029: arc lint + arc unit

Event Timeline

It exists other type of package.json parsing errors not related to encoding issues, so planning changes to handle all of them in that diff.

How many packages do have that particular issue?

What if you try to read them as UTF-8, and fall back to chardet if it fails?

eg. I looked at one of the errors (for https://www.npmjs.com/package/stb-cli ), and this would fix the issue:

>>> b = open('package.json', 'rb').read()
>>> import chardet
>>> chardet.detect(b)
{'language': 'Turkish', 'confidence': 0.4514518884746974, 'encoding': 'Windows-1254'}
>>> s = b.decode('Windows-1254')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 115: character maps to <undefined>
>>> s = b.decode('utf-8')
>>>

Yes, I noticed that too. That's the direction I am currently taking for the decoding issue.

Update: refine JSON loading code and fix more errors related to package.json parsing

anlambert edited the summary of this revision. (Show Details)
anlambert edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.May 23 2019, 3:02 PM
This revision was automatically updated to reflect the committed changes.