Page MenuHomeSoftware Heritage

maven-lister: initialise lister.
AbandonedPublic

Authored by borisbaldassari on Aug 25 2021, 2:25 PM.

Details

Reviewers
vlorentz
douardda
Group Reviewers
Reviewers
Maniphest Tasks
T1724: Maven Central repository support
Summary

The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Superseded by https://forge.softwareheritage.org/D6395

Related to T1724

Diff Detail

Repository
rDLS Listers
Branch
T1724_maven_central_lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23565
Build 36764: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 36763: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
borisbaldassari marked 2 inline comments as done.
  • maven-lister: Fix inaccurate comment (review D6133).

Build has FAILED

Patch application report for D6133 (id=22590)

Rebasing onto e904f4760e...

Current branch diff-target is up to date.
Changes applied before test
commit 73a36afd6676a719904d88495ff8cc4ce61057f3
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 19:02:28 2021 +0200

    maven-lister: Fix inaccurate comment (review D6133).

commit 4898b8b2f3dfbeb0ad585fd13d1734f0ce45b8cf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 18:47:29 2021 +0200

    maven-lister: add changes from review D6133.

commit 76a16ec85601ace455e787812b6b5a7849f73e30
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:38:12 2021 +0200

    maven-lister: set last modif time to epoch int.

commit 4da2ce52acd55d49d25016d0293b9efdb4808596
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:02:11 2021 +0200

    maven-lister: add last modif time to extra args for jar files.

commit 91756c69df863e5756d6d7c2f5bb3a835e017d66
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:05:41 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit c8efa950aed306824169cab4167c4f0c7734bff0
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:04:06 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit 6c6eb0ff2ee8a683b197039622d020168d50bb64
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:58:44 2021 +0200

    maven-lister: fix function doc comment for sphinx.

commit 4f541908f3d9f3194248cd90a0dbe964f2f21fdf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:44:39 2021 +0200

    maven-lister: add maven to list_cli for testing.

commit db31540ec9a515927e4d8e6872aae633a425a781
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:31:40 2021 +0200

    maven-lister: fix mocking of index download.

commit 5567fc3c4105c2bd44403146da4190f30ea7fd96
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:07:23 2021 +0200

    maven-lister: fix test_task with new parameters.

commit 64065c02dd4aac6669b3ae8b5f27af104bfc14ae
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:01:25 2021 +0200

    maven-lister: refactor to externalize index export.

commit 9c210e68de85db0057d2f5bf1bd6799d99711c9d
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:35:24 2021 +0200

    maven-lister: add docker in requirements.txt..

commit d6a4713170940b83d75c6d4e058d8a9f916c55e7
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:30:43 2021 +0200

    maven-lister: initialise lister.

commit 7b49ba6bf7dea3491c35e0807f6e16b0bff46bdf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:23:09 2021 +0200

    maven-lister: initialise lister.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/349/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/349/console

borisbaldassari marked an inline comment as done.
  • maven-lister: Fix tests (review D6133)

Build is green

Patch application report for D6133 (id=22630)

Rebasing onto e904f4760e...

Current branch diff-target is up to date.
Changes applied before test
commit d878c30064437b297c43847c1e41d4d10af45d1f
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Sep 14 08:25:23 2021 +0200

    maven-lister: Fix tests (review D6133)

commit 73a36afd6676a719904d88495ff8cc4ce61057f3
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 19:02:28 2021 +0200

    maven-lister: Fix inaccurate comment (review D6133).

commit 4898b8b2f3dfbeb0ad585fd13d1734f0ce45b8cf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 18:47:29 2021 +0200

    maven-lister: add changes from review D6133.

commit 76a16ec85601ace455e787812b6b5a7849f73e30
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:38:12 2021 +0200

    maven-lister: set last modif time to epoch int.

commit 4da2ce52acd55d49d25016d0293b9efdb4808596
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:02:11 2021 +0200

    maven-lister: add last modif time to extra args for jar files.

commit 91756c69df863e5756d6d7c2f5bb3a835e017d66
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:05:41 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit c8efa950aed306824169cab4167c4f0c7734bff0
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:04:06 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit 6c6eb0ff2ee8a683b197039622d020168d50bb64
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:58:44 2021 +0200

    maven-lister: fix function doc comment for sphinx.

commit 4f541908f3d9f3194248cd90a0dbe964f2f21fdf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:44:39 2021 +0200

    maven-lister: add maven to list_cli for testing.

commit db31540ec9a515927e4d8e6872aae633a425a781
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:31:40 2021 +0200

    maven-lister: fix mocking of index download.

commit 5567fc3c4105c2bd44403146da4190f30ea7fd96
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:07:23 2021 +0200

    maven-lister: fix test_task with new parameters.

commit 64065c02dd4aac6669b3ae8b5f27af104bfc14ae
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:01:25 2021 +0200

    maven-lister: refactor to externalize index export.

commit 9c210e68de85db0057d2f5bf1bd6799d99711c9d
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:35:24 2021 +0200

    maven-lister: add docker in requirements.txt..

commit d6a4713170940b83d75c6d4e058d8a9f916c55e7
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:30:43 2021 +0200

    maven-lister: initialise lister.

commit 7b49ba6bf7dea3491c35e0807f6e16b0bff46bdf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:23:09 2021 +0200

    maven-lister: initialise lister.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/350/ for more details.

douardda added a subscriber: douardda.

I'm not done yet but here is first review on my side.

Also, please stash your git revisions before the final review/ acceptance; this Diff should consist in one (maybe 2) git revisions in the end (using git rebase -i to rework your git history).

swh/lister/maven/lister.py
58

Why is docker involved here? I see no other mention of a docker stuff anywhere in this diff.
I don't understand what this "index_url" is and how it is supposed to be used.

127

why not use the context manager API of the NamedTemporaryFile here?

130–134

Please document a bit these regex. Also please prefer named match groups when possible (?P<name>...) which helps to "self-document" regexes.

141

a python text file object is iterable, so one would prefer the form:

for line in file_txt:
  [...]
143

while line: is enough here

147

this could be handled by the regex itself

150–163

why use urljoin while hand-building the URL by concatenation of strings with '/'?

I mean urljoin does support multiple arguments, like:

urljoin(self.BASE_URL, path, aid, version, f"{aid}-{version}.{ext}")
166

and ext in ("zip", "jar")

166

BTW, what about an upper case ext value?

184–186
out_src[url_src] = {"g": gid, "a": aid, "v": version}

Not sure out_src and out_pom really need to be defaultdict actually.

196

Now that I read this, why not do the processing on the flight? Why bother storing the file on disk then read it back line by line to do a bunch of regex?

requests does provide a nice API for this:
https://docs.python-requests.org/en/master/api/#requests.Response.iter_lines

201

no need for the .keys() here, iterating on a dict is iterating on its keys.
But more importantly, you probably want to actually iterate on both keys and values:

for src, val in out_src.items():
  ...
  yield {
    ...
    "time": val["t"],
      ...

BTW, if you use proper keys in out_src[*] (i.e. "time" instead of "t" and so on) you can just use it as is here:

  yield {
    "type": "jar",
    "url": src,
   **val
}
This revision now requires changes to proceed.Sep 20 2021, 4:33 PM

I'm not done yet but here is first review on my side.

Thanks for your time and interest.

Also, please stash your git revisions before the final review/ acceptance; this Diff should consist in one (maybe 2) git revisions in the end (using git rebase -i to rework your git history).

That's planned before the final review, and that should happen soon hopefully.
We still need to sort out a point with vlorentz about the jar/maven loader that impacts both the lister and loader. But that's not big, I am confident we're close to the end of the first iteration (i.e. without incremental update).
That's good, every little thing done now is not on the road ahead, and I learned a lot from your comments. Thanks!

swh/lister/maven/lister.py
58

It's been a long discussion held back in June on IRC and in task T1724. In a nutshell, we need two tools to transform the maven indexes into something readable: maven-indexer-cli and clue. Rather than have a virtual machine (there is no way to run java code in python without one) it was requested that the tools be put in a docker container. End of August, the docker image was ready and olasd asked me (08-25) to put it on a separate server so the lister would simply have to query it on the network.

so: index_url is the name (or IP address) of this local server that hosts the docker image.

127

As far as i can remember, because I wanted to stream it in order to reduce memory footprint: the download can be huge.
For maven-central the main download would be 49GB.

130–134

You're definitely right. Fixed, thanks.

141

Yes, I'm shameful.
The reason for this is there is a second readline later on in the loop, and I could not find a more elegant way yet. There are newlines in some of the fields, I didn't come up with a nice way to do it. Yet.

143

Fixed, thanks.

150–163

Hum. It yields [1] when I try, and that's not what I've read [2].

[1] "TypeError: urljoin() takes from 2 to 3 positional arguments but 6 were given"
[2] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin

Am I missing something?

Note: I have fixed the ugliness of if by using f-strings, and it looks a lot better.

166

ext in (a, b) => Far more elegant, of course. Fixed, thanks.

uppercase => Yes, good point. As a matter of fact there is no uppercase extensions on maven central (just checked) but I'm not sure why (part of the maven convention, maybe?) and that can surely happen.
Anyway, it's definitely good practice, fixed, thanks.

184–186

For uniqueness of entries (some entries tend to appear a few times). Is there a way to do it better with Python?

196

Short answer: because I didn't know of iter_*lines*. Oh god. :-)
Thanks!!

swh/lister/maven/lister.py
58

Ok, so please document all this some where (in this diff). The README file shoudl give some high level explanations, and this docstring should refer to this former doc.

Also, where is documented/specified this index file format? It should be somewhere.

127

The context manager is unrelated with loading the file in RAM or not.

150–163

No you are right, my mistake, I was assuming urljoin has a decent API, which is not the case. sorry.

184–186

I don't see how using defaultdict is related with this uniqueness question.
What does it bring that a simple dict would not?

some more :-)

swh/lister/maven/lister.py
219

I'm always nervous when I see a bytes.decode() called on some content coming from The Workd™.

Is there any change of getting some encoding error here?

234

use items() on the dict:

for src, project in out_pom.items():
   yield {"type": "scm", "url": src, "project": project}

BTW, why build the dict to yield its values just after building it?

Why not yielding values directly from the for pom in out_pom loop?

253–254

no need to compile the regex if it's used only once. Just use the match function directly:

m_scm = re.match(r"^scm:([^:]+):(.*)$", page["url"])

Also please prefer named group matching (https://docs.python.org/3/library/re.html#index-17)

260

what's the comment for?

264–271

do you really need a regex to check for a '.git' at the end? I means page["url"].enndswith(".git") should do the trick here.

278

why the commented line?

It would be nice to have a README fil in swh/lister/maven/tests/data explaining what the data files are, where they come from, how they have been generated, etc.

swh/lister/maven/tests/test_lister.py
53–68

Is it really necessary for these functions to be fixtures?

72

The "two collections" is not clear to me.

79–82

why not just assert sorted_lister_urls == sorted_scheduler_origins ?
Assuming you build sorted_scheduler_origins as sorted(x.url for x in scheduler_origins) I mean.

Also the list in list(sorted(x)) is not necessary I believe.

95–97

Why these local variables? They look unnecessary to me.

113

It's generally not a good idea to shadow a builtin function/type (list here). Better choose another variable name. Especially since it looks to be a dictionary...

115–130
148

please explain which HTTP error this is actually testing.

152

I don't get why we have 2 listed origins if there have been HTTP error in the process. If it's expected, it needs a bit of explanation.

borisbaldassari marked 2 inline comments as done.
  • maven-lister: Fix various comments from review (review D6133).
  • maven-lister: add README.md to describe design decisions (review D6133).
  • maven-lister: Fix useless storage of file + minor fixes (review D6133).

Build is green

Patch application report for D6133 (id=22991)

Rebasing onto 5ab6b00408...

First, rewinding head to replay your work on top of it...
Applying: maven-lister: initialise lister.
Applying: maven-lister: initialise lister.
Applying: maven-lister: add docker in requirements.txt..
Applying: maven-lister: refactor to externalize index export.
Applying: maven-lister: fix test_task with new parameters.
Applying: maven-lister: fix mocking of index download.
Applying: maven-lister: add maven to list_cli for testing.
Applying: maven-lister: fix function doc comment for sphinx.
Applying: maven-lister: add extra loader arguments for jar loader output.
Applying: maven-lister: add extra loader arguments for jar loader output.
Applying: maven-lister: add last modif time to extra args for jar files.
Applying: maven-lister: set last modif time to epoch int.
Applying: maven-lister: add changes from review D6133.
Applying: maven-lister: Fix inaccurate comment (review D6133).
Applying: maven-lister: Fix tests (review D6133)
Applying: maven-lister: Fix various comments from review (review D6133).
Applying: maven-lister: add README.md to describe design decisions (review D6133).
Applying: maven-lister: Fix useless storage of file + minor fixes (review D6133).
Changes applied before test
commit bd65ba11a74db9c80d47b43df6460fbd25458ec3
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Sep 22 23:34:59 2021 +0200

    maven-lister: Fix useless storage of file + minor fixes (review D6133).

commit a9ffb05c7005daf0f8195030f3697d808b8e9e43
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Sep 22 23:14:58 2021 +0200

    maven-lister: add README.md to describe design decisions (review D6133).

commit 6b9e490e7eaedb9f9d5c38c619a3cb34655c40f7
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Mon Sep 20 21:48:01 2021 +0200

    maven-lister: Fix various comments from review (review D6133).

commit b0cef1f1dcdcc994486220ef38bb2a87639470ca
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Sep 14 08:25:23 2021 +0200

    maven-lister: Fix tests (review D6133)

commit 22416930c84101903d859dc5f64f2218022fa3d1
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 19:02:28 2021 +0200

    maven-lister: Fix inaccurate comment (review D6133).

commit 0e70975749ea58eacea21f4988c6e1d432673474
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Sep 12 18:47:29 2021 +0200

    maven-lister: add changes from review D6133.

commit 3e31a4d453f6f28c5b35df5d9815194c9ff3ecb4
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:38:12 2021 +0200

    maven-lister: set last modif time to epoch int.

commit 81ac809bb57eb6722168afd992f14dcccd5df771
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 20:02:11 2021 +0200

    maven-lister: add last modif time to extra args for jar files.

commit acccc7929719cbcaa799fefc48e6938f71b78915
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:05:41 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit ac1cb0467cbc848d05a8a1db88fed3312c9a5cc6
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sun Aug 29 11:04:06 2021 +0200

    maven-lister: add extra loader arguments for jar loader output.

commit e3c4b484cd5a3bade744f2118ed76a679e875b19
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:58:44 2021 +0200

    maven-lister: fix function doc comment for sphinx.

commit e1b864ebcf100fd89c7a24edae291eb0d3f07dc7
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:44:39 2021 +0200

    maven-lister: add maven to list_cli for testing.

commit e829dc19d2bb3fb7b50069870471690f375b0e20
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Aug 28 10:31:40 2021 +0200

    maven-lister: fix mocking of index download.

commit c367d4d3316616ef94bf27c861279e89edd72aa3
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:07:23 2021 +0200

    maven-lister: fix test_task with new parameters.

commit c12b3f6d10b6ee1ab7a13396deb41a2341fd77e7
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Thu Aug 26 21:01:25 2021 +0200

    maven-lister: refactor to externalize index export.

commit 172ea2b428356008c0503b699f2ecd0018f3f19d
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:35:24 2021 +0200

    maven-lister: add docker in requirements.txt..

commit c8fb4bdd3f36677dbd0da7c539f5ceb52b7b4267
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:30:43 2021 +0200

    maven-lister: initialise lister.

commit de050dc02b4048b952a5738f9687c84f7e4bb7a4
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Aug 25 14:23:09 2021 +0200

    maven-lister: initialise lister.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/378/ for more details.

borisbaldassari marked 5 inline comments as done.EditedSep 22 2021, 11:57 PM

Hum. This diff is a mess after the refactoring of the main loop.
Please give me some time to fix all of the current issues, then rebase the diff to make it more readable. Hope it will clear up the lines mix.

swh/lister/maven/lister.py
58

so: index_url is the name (or IP address) of this local server that hosts the docker image.

so: index_url is the name (or IP address) of this local server that hosts the docker image *and* the exported indexes to be downloaded.

58

Added a README.md (in this arc diff) and added a link to the readme in the f-string (yet to commit).

141

Fixed with refactoring (see below).

150–163

I did assume that too at some point, rings a bell. yeah.

196

That's exactly why we need peers: to get out of our own train of thought.

Ok, fixed as you proposed. The file is now parsed as it is downloaded, and that solved a few other points. When the design is broken, everything looks weird, right?
Yet to commit: no more file kept.

Thanks a lot for the feedback!

borisbaldassari marked 24 inline comments as done.
  • maven-lister: initialise lister.
  • maven-lister: Refactor file get loop (review D6133).
  • maven-lister: update readme, set tests to use http_index
  • maven-lister: Add statefulness
  • maven-lister: Add incremental task and tests

Build is green

Patch application report for D6133 (id=23788)

Rebasing onto 24bc671679...

First, rewinding head to replay your work on top of it...
Applying: maven-lister: Add maven lister with review D6133.
Applying: maven-lister: Add statefulness
Changes applied before test
commit d72cd83d02f4a9356fd70d00e7f4c69827019287
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 23 11:06:31 2021 +0200

    maven-lister: Add statefulness
    
    maven-lister: Add incremental task and tests

commit da50be30204dc2ceb22c4e919b89a6c8bf3a24ed
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    maven-lister: Add maven lister with review D6133.
    maven-lister: Fix call to maven loader
    maven-lister: initialise lister.
    maven-lister: initialise lister.
    maven-lister: add docker in requirements.txt..
    maven-lister: refactor to externalize index export.
    maven-lister: fix test_task with new parameters.
    maven-lister: fix mocking of index download.
    maven-lister: add maven to list_cli for testing.
    maven-lister: fix function doc comment for sphinx.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add last modif time to extra args for jar files.
    maven-lister: set last modif time to epoch int.
    maven-lister: add changes from review D6133.
    maven-lister: Fix inaccurate comment (review D6133).
    maven-lister: Fix tests (review D6133)
    maven-lister: Fix various comments from review (review D6133).
    maven-lister: add README.md to describe design decisions (review D6133).
    maven-lister: Fix useless storage of file + minor fixes (review D6133).
    maven-lister: Document regexpes (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: Fix and add tests (review D6133).
    maven-lister: Refactor file get loop + minor fixes (review D6133).
    
    maven-lister: update readme, set tests to use http_index

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/403/ for more details.

swh/lister/maven/lister.py
127

Done.

147

Yes, it could. But it seems to me that readability is better this way.
As you say. Want me to do it?

184–186

Ok, I get it. That's probably an old Perl habit to explicitly have hashes of hashes.
Fixed to Dicts, thanks.

219

Thanks for spotting that, it needs some consideration.

Theoretically, no: we're decoding the content of a file downloaded from a local server (transport errors should be ok), which is output by clue asynchronously (i.e. file is not served if the process fails), so corruption or weird content is unlikely, but.. it's still clearly out of our control. So yes, we never know and you're definitely right.

OTOH we can't go on without that data, and I have the feeling we should rather fail (throw an exception about decoding and end execution) than pass silently (adding errors='ignore' to decode could do that). Would you like to try & catch, and then throw a specific error? What would you recommend?

=> added errors='ignore' so the list will simply be empty.
234

Very good point, moved it to the for pom in out_pom loop. Thanks!

253–254

Right, fixed, thank you! :-)

260

Removed.

swh/lister/maven/tests/test_lister.py
53–68

I'm quite new to python fixtures, but as I understand it I'd say yes. The script is expected to http-get these files, so fixtures are the right way to mock it, right? That's what I've seen on other listers too.

Is there a better way to do it?

79–82

Well spotted, thanks. I removed the function and did an assert == on sorted lists. And renamed the variable, you're once again definitely right. Thanks!

95–97

Agreed. Were there once for readability, not relevant anymore. Thanks!

113

Very good point. Changed to src. Thanks!

115–131

Refactored.

148

Added documentation and more tests:

  • http error on maven_index not found,
  • added error 404 error code.
152

Right. Yes, it's expected. Added this comment:

  1. If the maven_index step succeeded but not the get_pom step,
  2. then we get only the 2 jar origins (and not the 2 additional
  3. src origins).