Page MenuHomeSoftware Heritage

swh.lister.npm: Add an incremental npm lister
ClosedPublic

Authored by anlambert on Nov 30 2018, 2:57 PM.

Details

Summary

This new lister enables to get only new or updated npm packages since
the last listing operation.

As I explained in the task description, the idea would be to use the
full npm lister for creating the first batch of oneshot loading tasks
for npm packages. Then, use this incremental lister on a regular basis
to only get relevant packages to load again.

Related T1378
Closes T1398

Diff Detail

Repository
rDLS Listers
Branch
npm-incremental-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2766
Build 3471: tox-on-jenkinsJenkins
Build 3470: arc lint + arc unit

Event Timeline

Sounds fine.

As usual, the diff is long because of the test data file ;)

swh/lister/npm/lister.py
151

i do not see anything on the file passing this line, don't you have to add the pass keyword at least?

I have just a couple of questions ;)

swh/lister/npm/lister.py
57–58

Why oneshot?

anlambert added inline comments.
swh/lister/npm/lister.py
57–58

In order to only create loading tasks when it is needed. The npm registry is pretty huge (more than 800 000 packages in it) so to avoid having too much recurring tasks, I think the approach "ingest all packages once then ingest only those with updates at the next listing" is better here.

151

Good catch!

Update: Add missing pass keyword

swh/lister/npm/lister.py
57–58

I'm not completely convinced.
We are far from the 80M from the git repositories for example ;)

It's my understanding that the javascript world moves a lot. Won't that create a lot of oneshot tasks anyway?

Another question pops up. What's a package?
Is it an archive of source code archive at a specific version or a group of source code archives (1 per version)?

anlambert added inline comments.
swh/lister/npm/lister.py
57–58

It's my understanding that the javascript world moves a lot. Won't that create a lot of oneshot tasks anyway?

Surely, but less than having recurring tasks for all available packages. From my point of view, if we can benefit from a listing
that returns only updated packages since the last time, we should exploit it. Having a tons of recurring tasks that will
do nothing more after the first ingestion (for no more maintained packages for instance) feels wrong to me as it will
delay the ingestion of relevant ones.

Another question pops up. What's a package?
Is it an archive of source code archive at a specific version or a group of source code archives (1 per version)?

In the npm semantics, a package is a project so you can see it as a group of source code archives (1 per version).

ardumont added inline comments.
swh/lister/npm/lister.py
57–58

Surely, but less than having recurring tasks for all available packages. From my point of view, if we can benefit from a listing
that returns only updated packages since the last time, we should exploit it. Having a tons of recurring tasks that will
do nothing more after the first ingestion (for no more maintained packages for instance) feels wrong to me as it will
delay the ingestion of relevant ones.

Now, i'm sold \m/

This revision is now accepted and ready to land.Nov 30 2018, 4:11 PM

Our policy up until now was not to trust update feeds, and to keep low-frequency recurrent tasks for all origins, even when having an update feed available to do higher frequency oneshot tasks.

If we decide that we should change the policy, we should do it globally rather than in a single lister.

Our policy up until now was not to trust update feeds, and to keep low-frequency recurrent tasks for all origins, even when having an update feed available to do higher frequency oneshot tasks.
If we decide that we should change the policy, we should do it globally rather than in a single lister.

Not sure if we must change our policy globally but I think it worth a try to do it for npm as their feed update works like a charm.

Nevertheless, I will add the possibility to configure the policy when creating tasks.
This way all type of loading policy can be handled.

Update: Allow to set loading task policy through configuration

Update: Set default task policy to 'recurring'

Update: Use a different config file name for the incremental lister

This revision was automatically updated to reflect the committed changes.