Page MenuHomeSoftware Heritage

npm: Reimplement lister using new Lister API
ClosedPublic

Authored by anlambert on Jan 18 2021, 6:34 PM.

Details

Summary

Port npm lister to swh.lister.pattern.Lister API.

As before, the lister can be run in full or incremental mode.
When using incremental mode, only new and modified packages will
be returned since the last incremental listing process.
Otherwise, all packages will be listed in lexicographical order.

One major improvement to be noted, latest package update date
is now retrieved when available and sent to scheduler database.

I have successfully tested the two lister modes in docker environment,
all npm packages can be listed in a little more than one hour.
Subsequent execution of incremental listing is then much faster.

Closes T2972

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4877 (id=17316)

Rebasing onto 9fd91f007d...

Current branch diff-target is up to date.
Changes applied before test
commit 7d05a37b902a1bcf1a772fd2df1111ae8f52a909
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Jan 15 17:48:00 2021 +0100

    npm: Reimplement lister using new Lister API
    
    Port npm lister to `swh.lister.pattern.Lister` API.
    
    As before, the lister can be run in full or incremental mode.
    When using incremental mode, only new and modified packages will
    be returned since the last incremental listing process.
    Otherwise, all packages will be listed in lexicographical order.
    
    One major improvement to be noted, latest package update date
    is now retrieved when available and sent to scheduler database.
    
    Closes T2972

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/116/ for more details.

douardda added inline comments.
swh/lister/npm/lister.py
71

please explain this incremented page size

86

I guess it works also, but I'd prefer read {[...], "include_docs": True}

Also a line of comment to tell why this flag is set would be helpful IMHO

109

why a Union[int, str]? why not just an int (or Optional[int] if needed)?

121

no need for the response var here. Also body is a bit misleading (it usually refers to a the raw body of the response). I'd prefer a simple data = self.page_request(last_package_id).json() here

136

see above about the type of last_package_id, but this f-string looks weird here. Why not just keep an int everywhere?

183

why not swap these 2 lines? call self.get_state_from_scheduler() only if self.state.last_seq is not None

swh/lister/npm/lister.py
71

ack

86

Unfortunately, using True does not work here as the npm API expects the include_docs parameter value in lowercase only.

109

I guess it was the end of my dev day and just wanted to make mypy happy.

Indeed, a simple str type is enough here.

In incremental mode, the last_package_id is an integer while in non incremental mode last_package_id corresponds to the name of the last listed package.

Passing integer query parameter as string works so the union can go away.

121

Better naming indeed

136

see my comment above, in full listing mode last_package_id is a package name and it must be double quoted.

183

Well seen, I will rather merge the inverse condition in the first if as it is equivalent.

Build has FAILED

Patch application report for D4877 (id=17402)

Rebasing onto 62c825b8cb...

Current branch diff-target is up to date.
Changes applied before test
commit 9c6da23240f4322f786bec3e8f62eed042a73598
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Jan 15 17:48:00 2021 +0100

    npm: Reimplement lister using new Lister API
    
    Port npm lister to `swh.lister.pattern.Lister` API.
    
    As before, the lister can be run in full or incremental mode.
    When using incremental mode, only new and modified packages will
    be returned since the last incremental listing process.
    Otherwise, all packages will be listed in lexicographical order.
    
    One major improvement to be noted, latest package update date
    is now retrieved when available and sent to scheduler database.
    
    Closes T2972

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/124/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/124/console

Build is green

Patch application report for D4877 (id=17404)

Rebasing onto 62c825b8cb...

Current branch diff-target is up to date.
Changes applied before test
commit 5401f49e1ca4ec8ff1bcb9bba8231f2977173620
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Jan 15 17:48:00 2021 +0100

    npm: Reimplement lister using new Lister API
    
    Port npm lister to `swh.lister.pattern.Lister` API.
    
    As before, the lister can be run in full or incremental mode.
    When using incremental mode, only new and modified packages will
    be returned since the last incremental listing process.
    Otherwise, all packages will be listed in lexicographical order.
    
    One major improvement to be noted, latest package update date
    is now retrieved when available and sent to scheduler database.
    
    Closes T2972

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/125/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/lister/npm/lister.py
131–141

What about this?

This revision is now accepted and ready to land.Jan 22 2021, 10:44 AM
swh/lister/npm/lister.py
131–141

This could raise an error in full listing mode when processing the last page as seq field is missing in that case.
So keeping the code in its current state.

Update: Rebase, set rate limit log level to WARNING, improve docstring and remove debug print.

Build is green

Patch application report for D4877 (id=17509)

Rebasing onto 5411141e3a...

Current branch diff-target is up to date.
Changes applied before test
commit ff232f0d918000325f245657e43ce8fbea4d556e
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Jan 15 17:48:00 2021 +0100

    npm: Reimplement lister using new Lister API
    
    Port npm lister to `swh.lister.pattern.Lister` API.
    
    As before, the lister can be run in full or incremental mode.
    When using incremental mode, only new and modified packages will
    be returned since the last incremental listing process.
    Otherwise, all packages will be listed in lexicographical order.
    
    One major improvement to be noted, latest package update date
    is now retrieved when available and sent to scheduler database.
    
    Closes T2972

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/131/ for more details.