Page MenuHomeSoftware Heritage

Implement a loader for npm
ClosedPublic

Authored by anlambert on Mar 15 2019, 8:16 PM.

Details

Summary

First diff showcasing the work done so far in order to ingest into the archive
the numerous source packages hosted on the npm registry.

This work is greatly inspired from the PyPI loader written by @ardumont.
Apart some slight differences between PyPI and npm, the process to load
their content into the archive is almost the same: read package metatada
from a JSON endpoint then download tarballs and load them into the archive.

This clearly calls for a future refactoring to remove some code duplication
and establish a loading worflow to ingest content from other package managers.

The loader can be tested through the following command line:

$ python3 -m swh.loader.npm.loader <npm_package_name>

To correctly browse the results with swh-web, you will need to use its
development version as current package published on PyPI does not handle
browsing this new type of origin yet.

Related T1379

Diff Detail

Repository
rDLDNPM npm loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert added projects: Origin-npm, Npm loader.
anlambert edited the summary of this revision. (Show Details)

I have not read the tests yet but the runtime code sound good already (as a first feedback draft ;)

Update:

  • add more tests
  • some fixes and cleanup in the loader implementation

Next steps: add proper code documentation and README

Update:

  • finish code documentation and fill README.md
  • fix some tests as I improved the way to parse author (and thus some revision ids have changed)
  • add celery task for the npm loader
anlambert retitled this revision from [WIP] Implement a loader for npm to Implement a loader for npm.Apr 1 2019, 5:54 PM
anlambert edited the summary of this revision. (Show Details)

heh, just noticed you finalized it, i'll take a look tomorrow :)

swh/loader/npm/client.py
87

I tend to use os.path.basename even on url, that works, any counter indication to that?

134

Maybe log that we did not find it?

swh/loader/npm/tests/test_client.py
28 ↗(On Diff #4234)

Not blocking but as a heads up, we tend towards moving away from unittest now and try to embrace the pytest and its fixture ;)

What that entails is we can drop the class thing and use function directly, letting pytest injects its or even our own fixture dependencies (tmpdir, test data, client, etc...) as function parameter.

(Again, not blocking ;)

swh/loader/npm/tests/test_loader.py
248 ↗(On Diff #4234)

We could even demo that it's supposed to be the same snapshot (using the id is enough) as the first round.

As in (from previous test):

self.assertSnapshotEqual(_expected_new_snapshot_first_visit,
                         _expected_branches_first_visit)
277 ↗(On Diff #4234)

Why 2 snapshots?

douardda added inline comments.
swh/loader/npm/client.py
87

I do use urllib.parse.urlparse() for this kind of stuff. In this case, it helps managing params and query elements one can find in an URL.

Not sure if we can encounter real urls as tarball_url here, but I would rather use urllib stuff by default.

swh/loader/npm/client.py
87

Right!

This revision is now accepted and ready to land.Apr 2 2019, 12:32 PM

Update: CI build should work now

Build is green

\m/

I love it when a plan comes together ;)

swh/loader/npm/client.py
87

I would go for the os.path.basename approach as it induces less instructions to extract the filename, see below:

>>> urllib.parse.urlparse('https://registry.npmjs.org/webpack/-/webpack-0.1.2.tgz')
ParseResult(scheme='https', netloc='registry.npmjs.org', path='/webpack/-/webpack-0.1.2.tgz', params='', query='', fragment='')
>>> os.path.basename('https://registry.npmjs.org/webpack/-/webpack-0.1.2.tgz')
'webpack-0.1.2.tgz'
134

ack

swh/loader/npm/tests/test_client.py
28 ↗(On Diff #4234)

I need to document myself about pytest fixtures first. I keep it like this at the moment.

swh/loader/npm/tests/test_loader.py
248 ↗(On Diff #4234)

Indeed, I should have added that test here.

277 ↗(On Diff #4234)

Because a single instance of the npm loader is used in these tests. So after the second visit, two snapshots will be stored in the in-memory storage. I proceeded like this as I wanted to test that only new found objects are sent to the storage after the first visit. If I instantiated a loader before each test, the in-memory storage is recreated from scratch and I did not want that to happen.

swh/loader/npm/tests/test_client.py
28 ↗(On Diff #4234)

Sure thing!

swh/loader/npm/tests/test_loader.py
277 ↗(On Diff #4234)

ack.

Update:

  • address ardumont comments
  • pass the logger instance from the loader to the npm client
  • also test that temporary directory has been correctly removed once the loading process finished
vlorentz added inline comments.
swh/loader/npm/client.py
20–21

You should document that this class cannot be used to manipulate multiple packages at once.

60

known_versions should be renamed skip_versions, so it's clearer what it does.

65–66

Type could be Union[dict, set], as this function uses only features available for both

114–115

Same comment as above

118–125

You should use a namedtuple instead.

swh/loader/npm/loader.py
96

You don't need safe='', the way you parse package names prevents them from containing slashes.

211

I think you can use .encode('ascii') here. From npm's specs:

The name ends up being part of a URL, an argument on the command line, and a folder name. Therefore, the name can’t contain any non-URL-safe characters.

222

Same:

Version must be parseable by node-semver

274

Same

swh/loader/npm/tests/common.py
16 ↗(On Diff #4240)

No need for safe=''.

22 ↗(On Diff #4240)

No need to inherit from object explicitely

swh/loader/npm/client.py
20–21

ack

60

as said orally by @ardumont, let's keep the parameter names close to those that can be founded in the PyPI loader implementation. It will ease future refactoring.

swh/loader/npm/loader.py
96

Unfortunately, package names containing '/' exist, for instance https://www.npmjs.com/package/@01ht/ht-date, so that's why I need to override the safe parameter.

As you can see below:

https://replicate.npmjs.com/@01ht/ht-date => package not found
https://replicate.npmjs.com/%4001ht%2Fht-date => package found

211

Considering that dir_path is a concatenation of a temporary directory and the package name, I would keep the utf-8 decoding here. Indeed, the temporary directory can be provided by configuration so it may contain some non ascii characters in it.

222

ack

274

ack

swh/loader/npm/tests/common.py
16 ↗(On Diff #4240)

will simplify for test

22 ↗(On Diff #4240)

ack

Update: Address some vlorentz comments

This revision was automatically updated to reflect the committed changes.