Page MenuHomeSoftware Heritage

lister: Add new maven lister
ClosedPublic

Authored by borisbaldassari on Oct 2 2021, 9:21 PM.

Details

Summary

The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I will let you know when the full chain is executed.

Well, if works well enough to fill up my vm's hard drive. I'm not sure how to check
the actual loaded content, but it took Gigabytes silently running behind my back.

Hello, that's great news. It means you the chaining works then ;)

I'll add the diffs to review/land/deploy in my weekly plan for next week.

I need to check your last modifications, I did not yet had the chance to do it.

I don't think I can test further the full listing and ingestion; my own little
(bandwidth/resources) limits have been reached.

It's fine. Thanks already.

What shall we do next please?

I'd say if we are confident enough it does the right thing and it's well covered (I
think it is right now) you should be able to rebase all your diffs on latest master. And
then "land" [1] them!

Then we need to prepare our infrastructure to deploy the lister and loader. And then
make the "full chain" (lister + loader) run on the staging infra first.

That should lift the resource limit.

[1] git pull --rebase && git push origin master

Cheers,

I'd say if we are confident enough it does the right thing and it's well covered (I
think it is right now) you should be able to rebase all your diffs on latest master. And
then "land" [1] them!

Fine. I'll wait for your review and land the diff when you tell me to then.

Then we need to prepare our infrastructure to deploy the lister and loader. And then
make the "full chain" (lister + loader) run on the staging infra first.

Good.

For the records, I'm currently working on a list of maven servers that could be retrieved with this lister, i.e. servers that use the same version of maven indexer and lucene indexes as the central.
That should be ready soon.

@borisbaldassari I've renamed T1724 and gave a rather succinct summary

Can you please improve it a bit the summary, especially the part about the required
module that feeds data to the lister?

It would also definitely help to open the diff on the docker environment you had to make
to actually execute the full chain of indexer/lister/loader.

Thanks in advance.

I had another round.
I've added suggestions about docs, fixing log instructions, etc...

I also have one question about the possibility to retrieve the last_update to provide to the ListedOrigin constructor.

Otherwise, i'm still fine with it.

Please, just fix the papercuts i mentioned ;)

swh/lister/maven/lister.py
146–166

integrate this into the docstring ^ (dropping the comment) so
that makes it into the documentation site as well.

147

The link is missing, can you please add it.

180

Please, add a fixme explaining we should raise instead and let's move on.
So we can actually see what happens when deployed.

219–221
222

log instruction.

245

debug log instruction, please

297

if unused, drop it?

324

log ;)

354

Is it possible to retrieve the last_update as well [1]?

What's the timestamp present in the output of the indexer?
Could it be used for that purpose (parsing it first)?

[1] The last_update is actually used for scheduling purposes.

378

same question about last_update (see previous point)

384

same question about last_update (see previous point)

swh/lister/maven/tests/test_lister.py
42

My previous question about the "time", what is this time value?

This revision now requires changes to proceed.Nov 22 2021, 1:27 PM
borisbaldassari marked 9 inline comments as done.
  • maven-loader: fix last_update

Build has FAILED

Patch application report for D6395 (id=24228)

Rebasing onto 97553d8984...

First, rewinding head to replay your work on top of it...
Applying: maven-lister: Add maven lister with review D6133.
Applying: maven-lister: Add statefulness
Applying: maven-lister: Fix README, 88 chars, minor typos
Applying: maven-lister: Split parsing and processing
Applying: maven-lister: Fix groupId bug, add tests.
Applying: maven-loader: fix last_update
Changes applied before test
commit c8d523e9725a0971e36b5b77f75571e419acf866
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 23 07:49:13 2021 +0100

    maven-loader: fix last_update

commit 3a429b74fb2bb2d1fac4097540bcb34decc5fdac
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 2 15:16:56 2021 +0100

    maven-lister: Fix groupId bug, add tests.

commit 4380c017697cb8d07fac3fb6befae5b5112c6ec2
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Fri Oct 29 15:29:31 2021 +0200

    maven-lister: Split parsing and processing

commit 06c5b10a52ab863fff251a077e9840c8b1f2c0d8
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Oct 27 21:45:35 2021 +0200

    maven-lister: Fix README, 88 chars, minor typos
    maven-lister: Add datatype to describe RepoPage
    maven-lister: Various readability and minor fixes
    maven-lister: Various readability and minor fixes

commit 96ad71ef6f43a2dbe1d5f270ba3ce3b63df075ef
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 23 11:06:31 2021 +0200

    maven-lister: Add statefulness
    
    maven-lister: Add incremental task and tests
    
    maven-lister: Improve pom/doc recovery with incremental

commit 29adf6740acecf3d89af580d50d311dc97753809
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    maven-lister: Add maven lister with review D6133.
    maven-lister: Fix call to maven loader
    maven-lister: initialise lister.
    maven-lister: initialise lister.
    maven-lister: add docker in requirements.txt..
    maven-lister: refactor to externalize index export.
    maven-lister: fix test_task with new parameters.
    maven-lister: fix mocking of index download.
    maven-lister: add maven to list_cli for testing.
    maven-lister: fix function doc comment for sphinx.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add last modif time to extra args for jar files.
    maven-lister: set last modif time to epoch int.
    maven-lister: add changes from review D6133.
    maven-lister: Fix inaccurate comment (review D6133).
    maven-lister: Fix tests (review D6133)
    maven-lister: Fix various comments from review (review D6133).
    maven-lister: add README.md to describe design decisions (review D6133).
    maven-lister: Fix useless storage of file + minor fixes (review D6133).
    maven-lister: Document regexpes (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: Fix and add tests (review D6133).
    maven-lister: Refactor file get loop + minor fixes (review D6133).
    
    maven-lister: update readme, set tests to use http_index

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/411/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/411/console

borisbaldassari marked 4 inline comments as done.
  • maven-lister: Add maven lister with review D6133.
  • maven-lister: Add statefulness
  • maven-lister: Fix README, 88 chars, minor typos
  • maven-lister: Split parsing and processing
  • maven-lister: Fix groupId bug, add tests.
  • maven-loader: fix last_update

Build has FAILED

Patch application report for D6395 (id=24230)

Rebasing onto 97553d8984...

Current branch diff-target is up to date.
Changes applied before test
commit df26572cffa69de80efffcbf0f95b64c4478c1db
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 23 07:49:13 2021 +0100

    maven-loader: fix last_update

commit c4286e5726a7f6aa34aa2d2f24cf248755b90edf
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 2 15:16:56 2021 +0100

    maven-lister: Fix groupId bug, add tests.

commit 6480c41961b70cdb487cbe99bf380f5d18930252
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Fri Oct 29 15:29:31 2021 +0200

    maven-lister: Split parsing and processing

commit 329e79b835045efee30556aae23be253833c0c26
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Oct 27 21:45:35 2021 +0200

    maven-lister: Fix README, 88 chars, minor typos
    maven-lister: Add datatype to describe RepoPage
    maven-lister: Various readability and minor fixes
    maven-lister: Various readability and minor fixes

commit a0d11e5e91c0eac6211fb6204be5b21d4be8b803
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 23 11:06:31 2021 +0200

    maven-lister: Add statefulness
    
    maven-lister: Add incremental task and tests
    
    maven-lister: Improve pom/doc recovery with incremental

commit 43aacdfecbd74c01afe10af04ab4ee58bd1879e2
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    maven-lister: Add maven lister with review D6133.
    maven-lister: Fix call to maven loader
    maven-lister: initialise lister.
    maven-lister: initialise lister.
    maven-lister: add docker in requirements.txt..
    maven-lister: refactor to externalize index export.
    maven-lister: fix test_task with new parameters.
    maven-lister: fix mocking of index download.
    maven-lister: add maven to list_cli for testing.
    maven-lister: fix function doc comment for sphinx.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add last modif time to extra args for jar files.
    maven-lister: set last modif time to epoch int.
    maven-lister: add changes from review D6133.
    maven-lister: Fix inaccurate comment (review D6133).
    maven-lister: Fix tests (review D6133)
    maven-lister: Fix various comments from review (review D6133).
    maven-lister: add README.md to describe design decisions (review D6133).
    maven-lister: Fix useless storage of file + minor fixes (review D6133).
    maven-lister: Document regexpes (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: Fix and add tests (review D6133).
    maven-lister: Refactor file get loop + minor fixes (review D6133).
    
    maven-lister: update readme, set tests to use http_index

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/412/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/412/console

Hum. I don't understand why the tests fail on the sourceforge lister when it's not listed in my diff. Any idea please?

(note that the tests pass on my setup, including with the git pull --rebase).

swh/lister/maven/lister.py
146–166

Of course, it's far better. Thanks.

Also added the link to the README.md file in the current directory.

180

Ok, thanks.

Added a fixme with a link to this very place.

324

hum.. yeah, ok, I got it, sorry.. ;-)

354

The timestamp is indeed the last update time of the artefact, see the Maven indexer documentation for this release. So I guess it can be used for that purpose, yes.

I'll add the parameter (with some parsing) and submit the fix for all 3 following comments too. Thanks.

swh/lister/maven/tests/test_lister.py
42

Improved comment in loader.py to explain this variable: it's the last update time of the artefact on the server.

Thanks for the update.

Hum. I don't understand why the tests fail on the sourceforge lister when it's not
listed in my diff. Any idea please? (note that the tests pass on my setup, including
with the git pull --rebase).

It's a type issue unrelated to your diff, the master build must be broken as well. Since
nothing changed on the lister for a while, I guess a new mypy release (with some
internal change) make that type now break.

Don't worry about it, I'll have a look to fix master, push and then you can rebase so
your build gets green again.

It's a type issue unrelated to your diff, the master build must be broken as well. Since
nothing changed on the lister for a while, I guess a new mypy release (with some
internal change) make that type now break.

Don't worry about it, I'll have a look to fix master, push and then you can rebase so
your build gets green again.

Heads up, I was correct and that'd be D6666 for the fix.

Heads up, I was correct and that'd be D6666 for the fix.

Thanks @ardumont for the update. I'll do that Thursday (I'm busy until then).

Heads up, I was correct and that'd be D6666 for the fix.

Thanks @ardumont for the update. I'll do that Thursday (I'm busy until then).

Ack, sure.

fwiw, you can now rebase as i landed the diff.
Your build should get back to its feet ;)

Cheers,

ardumont retitled this revision from maven-lister: Add maven lister with review to lister: Add new maven lister.Nov 23 2021, 2:45 PM
ardumont edited the summary of this revision. (Show Details)
  • maven-lister: Add maven lister with review D6133.
  • maven-lister: Add statefulness
  • maven-lister: Fix README, 88 chars, minor typos
  • maven-lister: Split parsing and processing
  • maven-lister: Fix groupId bug, add tests.
  • maven-loader: fix last_update

Build has FAILED

Patch application report for D6395 (id=24302)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit ebde9f41646e5a54e0a0c8232079e4f83e8a7185
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 23 07:49:13 2021 +0100

    maven-loader: fix last_update

commit 5ebd9f26a1dc96f66e476855d81e8c00836f4c75
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Tue Nov 2 15:16:56 2021 +0100

    maven-lister: Fix groupId bug, add tests.

commit bee0c0c68a05e7334c44e09b4e5398bbf6516684
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Fri Oct 29 15:29:31 2021 +0200

    maven-lister: Split parsing and processing

commit 4afb2f5b7b602cf855ec3d612728ad782922779d
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Wed Oct 27 21:45:35 2021 +0200

    maven-lister: Fix README, 88 chars, minor typos
    maven-lister: Add datatype to describe RepoPage
    maven-lister: Various readability and minor fixes
    maven-lister: Various readability and minor fixes

commit 6c2ce03bc43f85c18f71187d59aadf6a96ccaec3
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 23 11:06:31 2021 +0200

    maven-lister: Add statefulness
    
    maven-lister: Add incremental task and tests
    
    maven-lister: Improve pom/doc recovery with incremental

commit 4f1574b22154c9e453547d6044af50978d324331
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    maven-lister: Add maven lister with review D6133.
    maven-lister: Fix call to maven loader
    maven-lister: initialise lister.
    maven-lister: initialise lister.
    maven-lister: add docker in requirements.txt..
    maven-lister: refactor to externalize index export.
    maven-lister: fix test_task with new parameters.
    maven-lister: fix mocking of index download.
    maven-lister: add maven to list_cli for testing.
    maven-lister: fix function doc comment for sphinx.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add extra loader arguments for jar loader output.
    maven-lister: add last modif time to extra args for jar files.
    maven-lister: set last modif time to epoch int.
    maven-lister: add changes from review D6133.
    maven-lister: Fix inaccurate comment (review D6133).
    maven-lister: Fix tests (review D6133)
    maven-lister: Fix various comments from review (review D6133).
    maven-lister: add README.md to describe design decisions (review D6133).
    maven-lister: Fix useless storage of file + minor fixes (review D6133).
    maven-lister: Document regexpes (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: various small fixes and optimisations (review D6133).
    maven-lister: Fix and add tests (review D6133).
    maven-lister: Refactor file get loop + minor fixes (review D6133).
    
    maven-lister: update readme, set tests to use http_index

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/415/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/415/console

I noticed you kept all your commits.
Can you please squash them into one?

Also the commit message should be amended.
I've tried to synthesize in the diff description so maybe it's good enough.

Cheers,

Build is green

Patch application report for D6395 (id=24307)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit fd3eea58846813cd5fbffa457dc4065f31c0f778
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    maven-lister: Add maven lister as per T1724 and D6395.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/416/ for more details.

Can you please squash them into one?
Also the commit message should be amended.

Done.

I've tried to synthesize in the diff description so maybe it's good enough.

Yes, LGTM.

Cheers,

Cheers!

Can you please squash them into one?
Also the commit message should be amended.

Done.

Not totally. I meant, can you please change the commit message (first line) to

lister: Add new maven lister

And then for the body of the commit copy the actual content of the diff description?
(It's in there that you can refer to other tasks.)

Note: The diff and the commit message are actually 2 distincts entities. It's just that
you do the commit and then later when you open a diff, arcanist by default will use the
commit to pre-fill the diff description. After that, it's up to the user to actually
synchronize them.

Thanks in advance.

Not totally. I meant, can you please change the commit message (first line) to

Done. :-)

Build is green

Patch application report for D6395 (id=24310)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit 7b9ffa67412493619783be5e9129f5119c51308b
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/418/ for more details.

Awesome, thanks for the last updates. LGTM.

I just read back and have one question about the page.time (that we pass as an int timestamp) inline.

Can you please check that prior to land it [1]?

Thanks in advance.

[1] It's a tad important since if we change after, it we'll need yhet another coordination
deployment between the lister and loader.

swh/lister/maven/lister.py
371

can't we make it the same value as last_update in string so as to avoid to have to do the same computation as line 358 in the loader?

Make repo.time an iso8601 string

Build has FAILED

Patch application report for D6395 (id=24311)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit 582d1340eee41f825a3dbea4209d91e1725d3a2a
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/419/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/419/console

Thanks for the update.

I'm ready to accept the diff, just make the build go green again ;)
I've suggested how.

Cheers,

swh/lister/maven/lister.py
357–362

Optional[datetime] is Union[None, datetime] but in shorter form ;)

Also to fix the current build failure [1], you need to ensure page.time is not None, hence the suggested assert.

[1]

11:12:55  swh/lister/maven/lister.py:360: error: Argument 1 to "parse_date" has incompatible type "Optional[str]"; expected "str"
borisbaldassari marked an inline comment as done.
  • maven-lister: set time to be iso8601 date str

Build has FAILED

Patch application report for D6395 (id=24335)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit 498824baa47751c594f2492b519a05c472c3440e
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Fri Nov 26 16:11:37 2021 +0100

    maven-lister: set time to be iso8601 date str

commit 582d1340eee41f825a3dbea4209d91e1725d3a2a
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/420/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/420/console

borisbaldassari marked an inline comment as done.

Fix timezone issue, rebase.

Build is green

Patch application report for D6395 (id=24338)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit e36342fbc51689f847e19c62a52e1e2372992e4e
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/421/ for more details.

This revision is now accepted and ready to land.Nov 26 2021, 6:18 PM

To land this, you just need to push

git push origin master

If the origin/master has not been updated since your last diff update, this will close the diff automatically.
If not, you just need to rebase your master on top of origin/master, update the diff first and then push.

Cheers,

Build is green

Patch application report for D6395 (id=24364)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit e36342fbc51689f847e19c62a52e1e2372992e4e
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/422/ for more details.

Rebase with latest changes from master

Build is green

Patch application report for D6395 (id=24375)

Rebasing onto 3ffea8f525...

Current branch diff-target is up to date.
Changes applied before test
commit 8991c625ea560cbb55af35cf8adc208b63ae5fc1
Author: Boris Baldassari <boris@chrysalice.org>
Date:   Sat Oct 2 21:20:08 2021 +0200

    lister: Add new maven lister
    
    The Maven lister retrieves the maven central indexes, exports them in a
    convenient text format, and parse them to identify all src archives and
    pom files in the maven repository. Then the pom files are downloaded and
    analysed to find and yield any scm reference.
    
    Note: This is a new version of the maven lister diff D6133 which takes
    into account the initial round of reviews.
    
    Related to T1724

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/423/ for more details.

This revision was automatically updated to reflect the committed changes.