Page MenuHomeSoftware Heritage

swh.lister.npm: Add an incremental npm lister
ClosedPublic

Authored by anlambert on Fri, Nov 30, 2:57 PM.

Details

Summary

This new lister enables to get only new or updated npm packages since
the last listing operation.

As I explained in the task description, the idea would be to use the
full npm lister for creating the first batch of oneshot loading tasks
for npm packages. Then, use this incremental lister on a regular basis
to only get relevant packages to load again.

Related T1378
Closes T1398

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Fri, Nov 30, 2:57 PM

Sounds fine.

As usual, the diff is long because of the test data file ;)

swh/lister/npm/lister.py
152

i do not see anything on the file passing this line, don't you have to add the pass keyword at least?

I have just a couple of questions ;)

swh/lister/npm/lister.py
57–58

Why oneshot?

anlambert marked 2 inline comments as done.Fri, Nov 30, 3:40 PM
anlambert added inline comments.
swh/lister/npm/lister.py
57–58

In order to only create loading tasks when it is needed. The npm registry is pretty huge (more than 800 000 packages in it) so to avoid having too much recurring tasks, I think the approach "ingest all packages once then ingest only those with updates at the next listing" is better here.

152

Good catch!

anlambert updated this revision to Diff 2369.EditedFri, Nov 30, 3:42 PM

Update: Add missing pass keyword

ardumont added inline comments.Fri, Nov 30, 3:52 PM
swh/lister/npm/lister.py
57–58

I'm not completely convinced.
We are far from the 80M from the git repositories for example ;)

It's my understanding that the javascript world moves a lot. Won't that create a lot of oneshot tasks anyway?

Another question pops up. What's a package?
Is it an archive of source code archive at a specific version or a group of source code archives (1 per version)?

anlambert marked an inline comment as done.Fri, Nov 30, 4:09 PM
anlambert added inline comments.
swh/lister/npm/lister.py
57–58

It's my understanding that the javascript world moves a lot. Won't that create a lot of oneshot tasks anyway?

Surely, but less than having recurring tasks for all available packages. From my point of view, if we can benefit from a listing
that returns only updated packages since the last time, we should exploit it. Having a tons of recurring tasks that will
do nothing more after the first ingestion (for no more maintained packages for instance) feels wrong to me as it will
delay the ingestion of relevant ones.

Another question pops up. What's a package?
Is it an archive of source code archive at a specific version or a group of source code archives (1 per version)?

In the npm semantics, a package is a project so you can see it as a group of source code archives (1 per version).

ardumont accepted this revision.Fri, Nov 30, 4:11 PM
ardumont added inline comments.
swh/lister/npm/lister.py
57–58

Surely, but less than having recurring tasks for all available packages. From my point of view, if we can benefit from a listing
that returns only updated packages since the last time, we should exploit it. Having a tons of recurring tasks that will
do nothing more after the first ingestion (for no more maintained packages for instance) feels wrong to me as it will
delay the ingestion of relevant ones.

Now, i'm sold \m/

This revision is now accepted and ready to land.Fri, Nov 30, 4:11 PM
olasd added a subscriber: olasd.Fri, Nov 30, 4:55 PM

Our policy up until now was not to trust update feeds, and to keep low-frequency recurrent tasks for all origins, even when having an update feed available to do higher frequency oneshot tasks.

If we decide that we should change the policy, we should do it globally rather than in a single lister.

Our policy up until now was not to trust update feeds, and to keep low-frequency recurrent tasks for all origins, even when having an update feed available to do higher frequency oneshot tasks.
If we decide that we should change the policy, we should do it globally rather than in a single lister.

Not sure if we must change our policy globally but I think it worth a try to do it for npm as their feed update works like a charm.

Nevertheless, I will add the possibility to configure the policy when creating tasks.
This way all type of loading policy can be handled.

anlambert updated this revision to Diff 2381.Fri, Nov 30, 5:59 PM

Update: Allow to set loading task policy through configuration

anlambert updated this revision to Diff 2382.Fri, Nov 30, 6:02 PM

Update: Set default task policy to 'recurring'

anlambert updated this revision to Diff 2400.Mon, Dec 3, 6:00 PM

Update: Use a different config file name for the incremental lister

This revision was automatically updated to reflect the committed changes.