Page MenuHomeSoftware Heritage

extract_npm_package_author: Handle list of dict authors layout
ClosedPublic

Authored by anlambert on Apr 11 2019, 2:50 PM.

Details

Summary

Some package.json files may contain an authors field consisting in
a list of dict. So handle that case to avoid errors such as:

[2019-04-11 12:03:21,650: ERROR/ForkPoolWorker-19] Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 893, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/npm/loader.py", line 203, in fetch_data
    data = next(self.new_versions)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 145, in prepare_package_versions
    version_data)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/client.py", line 200, in _prepare_package_version
    author = extract_npm_package_author(package_json)
  File "/usr/lib/python3/dist-packages/swh/loader/npm/utils.py", line 92, in extract_npm_package_author
    author_data = parse_npm_package_author(package_json['authors'][0])
  File "/usr/lib/python3/dist-packages/swh/loader/npm/utils.py", line 52, in parse_npm_package_author
    author_str.replace('<>', '').replace('()', ''),
AttributeError: 'dict' object has no attribute 'replace'

Related T1644

Diff Detail

Repository
rDLDNPM npm loader
Branch
authors-fix
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5337
Build 7237: tox-on-jenkinsJenkins
Build 7236: arc lint + arc unit

Event Timeline

olasd added a subscriber: olasd.

This change looks sensible in the context of what's already here, but I'm not sure if favoring the first author over other authors is the best choice here.

I don't remember how we've solved that issue in the case of the deposit, for instance.

Could we turn that question into an issue so we can make a consistent decision across our loaders?

This revision is now accepted and ready to land.Apr 11 2019, 4:11 PM

The deposit loader uses the tar loader under the hood which makes the author or each produced revision to Software Heritage <robot@softwareheritage.org>[1].
The real information about authors can be found in the revision metadata, see [2] as an example.

For npm, a dump of the package.json file is also available in each produced revision metadata, including the full authors list.
I agree that how handling the multiple authors case should be discussed, I have created T1645 on the subject.

[1] https://forge.softwareheritage.org/source/swh-loader-tar/browse/master/swh/loader/tar/build.py$71
[2] https://archive.softwareheritage.org/browse/revision/76b3c170a150af6ee788d799cefe6bf756cabadc/?origin=https://hal.archives-ouvertes.fr/hal-01882337

This revision was automatically updated to reflect the committed changes.