Page MenuHomeSoftware Heritage

debian: Reimplement lister using new Lister API
ClosedPublic

Authored by anlambert on Jan 22 2021, 12:18 PM.

Details

Summary

Port debian lister to swh.lister.pattern.Lister API.

The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.

The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.

It has been successfully tested for Debian and Ubuntu distributions.

Porting to the new API was a little bit tricky as that lister is different
from others based on querying a REST API.
Here, each page corresponds to the list of package sources info related
to a given debian suite (e.g. buster, bullseye) and component (e.g. main, universe).
Thus a package name can be seen multiple times across pages so I had to ensure
that the total number of listed origins was accurate in returned lister statistics.

Depends on D4924

Closes T2979

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4925 (id=17520)

Rebasing onto ff232f0d91...

Current branch diff-target is up to date.
Changes applied before test
commit d9614ef21bc855a7e99c14b47c4ff5ce65ef4aa1
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/132/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/132/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 22 2021, 12:19 PM
Harbormaster failed remote builds in B18652: Diff 17520!

Build has FAILED

Patch application report for D4925 (id=17526)

Rebasing onto ff232f0d91...

Current branch diff-target is up to date.
Changes applied before test
commit 924d4fbf0670d90124cb246766fbdf942bcbe2bd
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/133/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/133/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 22 2021, 12:27 PM
Harbormaster failed remote builds in B18658: Diff 17526!

Build has FAILED

Patch application report for D4925 (id=17520)

Rebasing onto ff232f0d91...

Current branch diff-target is up to date.
Changes applied before test
commit d9614ef21bc855a7e99c14b47c4ff5ce65ef4aa1
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/132/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/132/console

CI build will fail until D4924 gets landed and a new swh-scheduler tag created.

ardumont added inline comments.
swh/lister/debian/__init__.py
13

Can't we drop the entry altogether?

(we did in the cgit port but maybe we need to empty it as well, lister tests did not complain but we may have missed something)

Build is green

Patch application report for D4925 (id=17602)

Rebasing onto bea9d6d147...

Current branch diff-target is up to date.
Changes applied before test
commit 5dc3c949e0a2110878f24b141cae6b1f86c24530
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/172/ for more details.

tenma added inline comments.
swh/lister/debian/lister.py
75

why not "url"?

131

I'm not sure it is unexpected, many of these requests are bound to fail.
Do you have enough info with this log (debug, so not in prod) and the exception below, to find the potential problem?

209–240

You could use a variable to store the ListedOrigin and not do the 2 lookups in multiple spots after.
Not shown: one more spot at L258.

212

you may put this ListedOrigin in a variable, as you use it multiple times after (self.listed_origins[origin_url]).

swh/lister/debian/tasks.py
11

do we use **lister_args because there is no support for credentials?

In any case, bravo for the effort on this tough lister!

swh/lister/debian/tests/test_lister.py
49–50

You can also use ad-hoc custom types (either aliases or new types), when the same underlying type is used for different things like here.
In addition to improved readability, it can make the code more robust because of type checking.

swh/lister/debian/__init__.py
13

I did not known the implications of removing that field so I kept it at the moment. Looking at the lister codebase I am not sure it is even used but I guess it is related to the lister CLI or entrypoints.

I think we should remove all of them when all listers have been ported.

swh/lister/debian/lister.py
75

That's how the parameter was named in previous lister implementation and it allows to explicit that a debian packages mirror URL is expected as parameter.

131

I guess I could simply log that the Sources file with given compression is not available on the packages mirror.

209–240

same answer as above

212

Python dictionaries have an average time complexity of O(1) for lookup operations so the performance gain
will be negligible by using a variable, especially with string keys.

For the record, I did not observe any performance issues when testing the lister in docker, it runs in a couple of minutes depending on the number of debian suites and components listed.

swh/lister/debian/tasks.py
11

yes, here task parameters are the same as the lister ones

swh/lister/debian/tests/test_lister.py
49–50

Ack, good idea indeed.

swh/lister/debian/lister.py
212

Not just dicts, attribute lookup machinery is more expensive, but of course, if it is not significant, decision is yours :)

Build is green

Patch application report for D4925 (id=17618)

Rebasing onto bea9d6d147...

Current branch diff-target is up to date.
Changes applied before test
commit 53eccbabad7b339d39ce62dd4791540df6bddfd6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/176/ for more details.

This revision is now accepted and ready to land.Jan 26 2021, 3:11 PM

Build is green

Patch application report for D4925 (id=17626)

Rebasing onto 22eeb0956e...

Current branch diff-target is up to date.
Changes applied before test
commit 54382e44cf8775e2f05090f068f6b438102cafd8
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/180/ for more details.

Rebase and remove conftest.py file

Build is green

Patch application report for D4925 (id=17635)

Rebasing onto 97254a19f2...

Current branch diff-target is up to date.
Changes applied before test
commit 5a46ad608c5f4c893df26aaf0daba128189a2092
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/186/ for more details.

Build is green

Patch application report for D4925 (id=17639)

Rebasing onto 6cd31769c1...

Current branch diff-target is up to date.
Changes applied before test
commit bb0184c004d2c91c36affc298074981682b60915
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Jan 20 14:17:15 2021 +0100

    debian: Reimplement lister using new Lister API
    
    Port debian lister to `swh.lister.pattern.Lister` API.
    
    The new implementation will produce one instance of ListedOrigin model
    per package, notably containing the set of parameters expected by the
    debian loader.
    
    The lister is also stateful, meaning only new packages and those with
    new found versions since the last listing will be returned.
    
    Closes T2979

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/187/ for more details.