Page MenuHomeSoftware Heritage

Maven Central repository Lister
Open, NormalPublic

Details

Differential Revisions
D1497: Maven Lister

Related Objects

Event Timeline

nahimilega created this object in space S1 Public.
nahimilega triaged this task as Normal priority.
nahimilega renamed this task from Maven Central (JAVA) lister to Maven Lister.EditedMay 22 2019, 11:01 PM
nahimilega added subscribers: ardumont, olasd.

Comment by @olasd

  • the repository format is "Maven"; "Maven Central" is only one instance of a Maven repository. There's many more public Maven repositories that would be useful to index, for instance Clojars or the Google Android maven repo : https://www.deps.co/guides/public-maven-repositories/. You'll need to rename the lister to "maven", and to modify the code to avoid hard-coding the maven repository root, making it an argument to the task instead (as we will want to list projects for several instances).
  • you went for a scraping approach, which is fine as a last resort. However, a quick search for "maven central index" brought up https://maven.apache.org/repository/central-index.html. Looks like these indexes are available to allow importing the full

It looks like these indexes are available at least for the following maven repositories :

The index also provides an incremental version (referenced in a properties file) which would allow for incremental updates without having to re-download the full index.

The Google repo also has an index https://developer.android.com/studio/build/dependencies.html#google-maven but it looks very different from the other maven repos I've found. However, it's fairly small compared to the others, so it shouldn't be too hard to sort it out as well.

Please investigate the format of these repository indexes, and the data they provide, and see whether that would be suitable for use as the data source for the lister.

nahimilega updated the task description. (Show Details)May 22 2019, 11:02 PM

As recommended by @olasd I checkout out Maven Central index ( https://repo.maven.apache.org/maven2/.index/) this is a

This, when extracted, is a Lucene format file. The instructions to extract and view the data can be found on this link ( https://maven.apache.org/repository/central-index.html)
I converted this Lucene format file to XML file using the GUI mentioned in the link above. Here is a short snippet of that XML file
(https://forge.softwareheritage.org/P405)

Here we can get the file structure of the website in the field name ='u' tag

<doc id='40'>
<field name='1' flags='Idfp--S--------------'>
<val>aa5edf398f7c9451782d3f1aac68f11b91aa8469</val>
</field>
<field name='i' flags='------S--------------'>
<val>tar.gz|1122889985000|23943|2|2|0|tar.gz</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019343</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|0.2|src|tar.gz</val>
</field>
</doc>
<doc id='41'>
<field name='i' flags='------S--------------'>
<val>distribution-zip|1162775420000|62884|2|2|0|distribution-zip</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019344</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|0.2|src|distribution-zip</val>
</field>
</doc>

But it would be tricky to find a generalised way to iterate over all the rows and find the package link because there are quite irregular fashion like this one

<doc id='31'>
<field name='1' flags='Idfp--S--------------'>
<val>e4336eeae47ef9762ef441ffc54a630ce7df1887</val>
</field>
<field name='i' flags='------S--------------'>
<val>zip|1122889986000|1298621|0|0|0|zip</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019247</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|1.0-rc1|NA</val>
</field>
</doc>

As comparing both the xml we can see one state the file type after version and other don't, and there are many more type of irregularities.

Moreover, I could not find a way to extract the zipped index file to Lucene format file and convert it to XML using python.

Extending on what I wrote in the previous comment, I did a bit more research about this.

I found if we use JAVA, then the job can be done quickly; there are many examples of making a self-updating indexer for maven central( which could be extended to other) like this one https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic
But I am afraid I didn't found any way to do the job completely using python.

One more interesting thing I found is the code which I currently made to scrape Maven Central can be converted to be used for all the mavens (except Google Android Maven) by some minor changes as scraping for all the maven is pretty much the same although there is some difference in file structure which needs to be addressed to make the code for all maven.

So shall I proceed with JAVA approach or scraping approach?

It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.

This is mentioned in https://maven.apache.org/guides/introduction/introduction-to-repositories.html page in Setting up the Internal Repository section.
Is this any matter of concern?

Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.

If you really want to become an official mirror, contact us at MVNCENTRAL with your location and we'll work to get you setup.

As mentioned in https://maven.apache.org/guides/mini/guide-mirror-settings.html

hboutemy renamed this task from Maven Lister to Maven Central repository Lister.Jun 2 2019, 12:10 PM

Hi,

I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html

I discussed a few years ago with Roberto di Cosmo about Maven Central repository content. SInce then, I worked on Reproducible Builds for the JVM: https://reproducible-builds.org/docs/jvm/
There is one key issue in Central repository that I re-documented in previous page:

Notice that ${artifactId}-${version}-sources.jar files published in Maven repositories are not buildable sources, but sources for IDEs. Source tarballs, intended for building, are not always published in repositories but only sometimes, with 2 classical naming conventions:

${artifactId}-${version}-source-release.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:source-release
${artifactId}-${version}-src.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:src

This issue is probably the same with every repository in Maven format: in general, there is no buildable source files, at most IDE-oriented -sources.jar

IMHO, what you can get is just the scm entry of every pom, hoping that people have filed them properly: I can provide you an tar.bz2 archive that is only 304 MB with absolutely every pom.xml from central, so you can check before working on becoming a full Central mirror

I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html

Actually, we are trying to figure out some method which can work on all the maven instances.

Notice that ${artifactId}-${version}-sources.jar files published in Maven repositories are not buildable sources, but sources for IDEs. Source tarballs, intended for building, are not always published in repositories but only sometimes, with two classical naming conventions:

${artifactId}-${version}-source-release.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:source-release
${artifactId}-${version}-src.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:src

This issue is probably the same with every repository in Maven format: in general, there is no buildable source files, at most IDE-oriented -sources.jar

Thanks for this information, we are interested in archiving source code. If the file with this name specific format contains the source code, thus we only need to list down these types of files.

if you search for source-release or src, yes, you'll find archivable sources
the only issue is that you'll not find many content...

olasd added a comment.Jun 3 2019, 4:18 PM

Thanks a lot to @hboutemy for your valuable insights on sources in the Maven central repository, and for the pointer to Reproducible Builds on the JVM.

I think there's value in making the Maven lister support more Maven repositories than just Maven Central, even if we focus on Maven Central as the first proof of concept.

In my opinion, the most sensible way to work through listing of projects in a Maven repository would be through using the lucene indexes, but that poses an architectural challenge :

  • the infrastructure available to use lucene indexes in python code is not very practical : the official pylucene package isn't pip-installable and depends on having a jvm and the lucene jars available
  • adding yet another language component (any jvm-compatible language which would allow us to use the native lucene libraries) to the current lister infrastructure needs some forethought: this adds complexity to the docker development environment we're currently using, and it adds complexity to the production environment as well.

We're going to need to take a step back, and look at these challenges as a team, before we move further on this lister.

Coming back to the fact that maven itself doesn't hold the sources for a lot of projects; These questions are more for @hboutemy as I expect he has far more experience on maven than all of us combined ;)

  • What is present in the "IDE sources" (-sources.jar files) and how are they different from source releases? Would there a value in archiving them as a last resort?
  • From the search links you provided, 71k artifacts (packages ?) provide a -source-release.zip, and 9k artifacts a -src.zip. Looks like overall it would make sense to archive these.
  • Do you have an idea of how many artifacts reference an upstream source control repository that we could use to archive their source code?

It might make sense for the Maven lister to generate both

  • origins for artifacts with a source-release/src version
  • origins for upstream source control repositories

But that'll only make sense if extracting the upstream source control repo information doesn't cost us too much (i.e. I don't think parsing all the new pom files to get repository information will be very efficient - this kinda matches the problem we're having with the NuGET lister).

If the repository information isn't available in an easily harvestable format, getting a tarball of all the pom files and parsing them (as a one-off) to find potentially interesting new data sources (that is, forges that aren't available in our regular processing yet) could be useful.

I think there's value in making the Maven lister support more Maven repositories than just Maven Central, even if we focus on Maven Central as the first proof of concept.

makes sense. IMHO, it's good to have one issue per repository, since every repository will have its specific topics

What is present in the "IDE sources" (-sources.jar files) and how are they different from source releases?

For each .class in a .jar, *-sources.jar is expected to provide coresponding .java source file, to display in the IDE instead of decompiled bytecode.
In source release, this java source would be in src/main/java, or even [Maven-module]/src/main/java.
In addition, source-release has the build file (whatever the build tool), where *-sources.jar does not provide any build file.
*-sources.jar may also contain generated source code, because this would help the IDE

Would there a value in archiving them as a last resort?

If there is a high volume, perhaps this wlll be better than nothing, but these *-sources.jar have really less value than buildable source release content.

From the search links you provided, 71k artifacts (packages ?) provide a -source-release.zip, and 9k artifacts a -src.zip. Looks like overall it would make sense to archive these.

yes. I suppose you already have archived a lot of them, since projects that publish such archive are in general quite big ones. It would be useful to share the list of projects that provide source release archive but were not archived previously

Do you have an idea of how many artifacts reference an upstream source control repository that we could use to archive their source code?

sadly no