Page MenuHomeSoftware Heritage

Maven Lister
Needs RevisionPublic

Authored by nahimilega on Tue, May 21, 1:21 PM.

Details

Reviewers
douardda
olasd
Group Reviewers
Reviewers
Maniphest Tasks
T1724: Maven Central repository Lister
Summary

Implemented a Maven Central lister which recursively lists the file system of the maven central website ( http://central.maven.org/maven2/) and then extracts the package names and URL from the file system.

Closes T1724

Diff Detail

Repository
rDLS Listers
Branch
maven (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5865
Build 8034: tox-on-jenkinsJenkins
Build 8033: arc lint + arc unit

Event Timeline

nahimilega created this revision.Tue, May 21, 1:21 PM
nahimilega updated this revision to Diff 4911.Tue, May 21, 1:29 PM

Removed useless commit messages

anlambert added inline comments.
swh/lister/maven_central/lister.py
32

FYI, we are currently renaming the scheduler task names to something more rational (see T1508).
This should be changed to:

_type = 'load-%s' % origin_type
nahimilega updated this revision to Diff 4924.Tue, May 21, 6:07 PM
nahimilega marked an inline comment as done.
  • Added Maven Central lister in README.md and cli.py
douardda requested changes to this revision.Wed, May 22, 9:58 AM
douardda added a subscriber: douardda.

Also, please use the imperative form also for your commit messages.

swh/lister/maven_central/lister.py
25

Please do not put this '(Override)' in your docstrings (here and below).

31–37

A detail, but I'm not sure I see any added value to use variables in this method. Could be:

def task_dict(self, origin_type, origin_url, **kwargs):
    '''[snip]'''
    return utils.create_task_dict(
        'load-%s' % origin_type, 'recurring', 
        kwargs.get('name'), origin_url,
        project_metadata_url=kwargs.get('html_url'))
33–34

I am a bit surprised by these 2 dict.get() here. Aren't these parameters mandatory for the loader task creation and execution to succeed? What is expected to happen if 'name' or 'html_url' are not found in the kwargs dict?

44–46

prefer the list comprehension form:

groups = [b.text for b in soup.find_all('a')]
48–49

Reading this, I'm a bit worried this code is very fragile... Where do these 1 and -6 come from? Why are they needed?

58

what? why are you keeping this transient state (currently visited page) in your instance (as the PAGE attribute)? This seems bad design to me.

Where is this self.PAGE used elsewhere in the class?

69

do not return None if the function/method is not expected to return anything, just use return.

72

s/of/if/ I guess

76–83

I don't like this 'almost duplicated' code here. I believe this method can be simplified and improved.

133

no need for parenthesis here. But more importantly, why no use re.findall()?

This revision now requires changes to proceed.Wed, May 22, 9:58 AM
olasd requested changes to this revision.Wed, May 22, 10:18 AM
olasd added subscribers: ardumont, olasd.

Thanks for this first pass at a Maven lister.

I'm afraid I hadn't had the time to think about the implementation until this morning, but here's a few things that jumped at me:

  • the repository format is "Maven"; "Maven Central" is only one instance of a Maven repository. There's many more public Maven repositories that would be useful to index, for instance Clojars or the Google Android maven repo : https://www.deps.co/guides/public-maven-repositories/. You'll need to rename the lister to "maven", and to modify the code to avoid hard-coding the maven repository root, making it an argument to the task instead (as we will want to list projects for several instances).
  • you went for a scraping approach, which is fine as a last resort. However, a quick search for "maven central index" brought up https://maven.apache.org/repository/central-index.html. Looks like these indexes are available to allow importing the full

It looks like these indexes are available at least for the following maven repositories :

The index also provides an incremental version (referenced in a properties file) which would allow for incremental updates without having to re-download the full index.

The Google repo also has an index https://developer.android.com/studio/build/dependencies.html#google-maven but it looks very different from the other maven repos I've found. However, it's fairly small compared to the others, so it shouldn't be too hard to sort it out as well.

Please investigate the format of these repository indexes, and the data they provide, and see whether that would be suitable for use as the data source for the lister.

As a final note your tests are testing the PyPI lister, which is probably not what you wanted to do! :)

In further tasks, please outline the implementation plan _in the task_, and leave @ardumont or myself some time to acknowledge the plan, before jumping right into the code. It's going to feel a bit slower, but it will avoid you wasting time on less-than-optimal approaches!

douardda added inline comments.Wed, May 22, 11:07 AM
swh/lister/maven_central/lister.py
133

or even just url.split('/') as starting point in fact...

nahimilega marked 2 inline comments as done.Wed, May 22, 10:59 PM
  • the repository format is "Maven"; "Maven Central" is only one instance of a Maven repository. There's many more public Maven repositories that would be useful to index, for instance Clojars or the Google Android maven repo : https://www.deps.co/guides/public-maven-repositories/. You'll need to rename the lister to "maven", and to modify the code to avoid hard-coding the maven repository root, making it an argument to the task instead (as we will want to list projects for several instances).
  • you went for a scraping approach, which is fine as a last resort. However, a quick search for "maven central index" brought up https://maven.apache.org/repository/central-index.html. Looks like these indexes are available to allow importing the full

It looks like these indexes are available at least for the following maven repositories :

The index also provides an incremental version (referenced in a properties file) which would allow for incremental updates without having to re-download the full index.
The Google repo also has an index https://developer.android.com/studio/build/dependencies.html#google-maven but it looks very different from the other maven repos I've found. However, it's fairly small compared to the others, so it shouldn't be too hard to sort it out as well.
Please investigate the format of these repository indexes, and the data they provide, and see whether that would be suitable for use as the data source for the lister.

Thanks for a heads up, I didn't knew about this. I will go through the repository indexes and their provided data and inform you about it by latest.

As a final note your tests are testing the PyPI lister, which is probably not what you wanted to do! :)

Oh sorry, I forgot to change it

In further tasks, please outline the implementation plan _in the task_, and leave @ardumont or myself some time to acknowledge the plan, before jumping right into the code. It's going to feel a bit slower, but it will avoid you wasting time on less-than-optimal approaches!

I will surely remember this.

And I will add your message in the task of MAVEN Lister (T1724) if you don't mind.

nahimilega retitled this revision from Maven Central Lister to Maven Lister.Tue, May 28, 7:50 PM