Page MenuHomeSoftware Heritage

Maven Central repository support
Open, NormalPublic

Description

As far as the latest discussion and development went so far, this means:

  • D6740: maven exporter/indexer [1]
  • D6395: maven lister
  • D6396: maven (package) loader
  • D6784: Enable the lister in Docker
  • D6955: Adds a container that runs the maven exporter in docker

[1] in charge of extracting information that the maven lister will be able to list
properly

The maven index exporter is in charge of downloading all index files, extracting and exporting them as raw text files. It is executed on a separate server and relies on a Docker image.
Once the text file is generated, it has to be made available on a local http server so the lister can fetch it and start the parsing (extract src jars and scm info).

The full export process is documented in [1] and a list of maven repositories that can be used/extracted with the tool is provided at [2]. The repository also provides a list of downloadable text files [3] ready-to-use by the lister. One can simply uncompress the tar.bz2 on the local http server and run the lister for tests.

[1] https://github.com/borisbaldassari/maven-index-exporter
[2] https://github.com/borisbaldassari/maven-index-exporter/tree/main/docs/maven_repositories
[3] https://icedrive.net/1/01BQpqC6rA

Revisions and Commits

rDENV Development environment
Accepted
D6784
rDLSMAVEXP maven-index-exporter
D6740
D6740
D6740
D6740
D6740
D6740
D6740
D6740
D6740
D6740
D6740
rDLDBASE Generic VCS/Package Loader
Abandoned
D6396
rDLS Listers
Abandoned
Abandoned
Abandoned
Abandoned
D6395
rDCORE Foundations and core functionalities
D6159
D6159

Event Timeline

nahimilega created this task.
nahimilega created this object in space S1 Public.
nahimilega renamed this task from Maven Central (JAVA) lister to Maven Lister.EditedMay 22 2019, 11:01 PM
nahimilega added subscribers: ardumont, olasd.

Comment by @olasd

  • the repository format is "Maven"; "Maven Central" is only one instance of a Maven repository. There's many more public Maven repositories that would be useful to index, for instance Clojars or the Google Android maven repo : https://www.deps.co/guides/public-maven-repositories/. You'll need to rename the lister to "maven", and to modify the code to avoid hard-coding the maven repository root, making it an argument to the task instead (as we will want to list projects for several instances).
  • you went for a scraping approach, which is fine as a last resort. However, a quick search for "maven central index" brought up https://maven.apache.org/repository/central-index.html. Looks like these indexes are available to allow importing the full

It looks like these indexes are available at least for the following maven repositories :

The index also provides an incremental version (referenced in a properties file) which would allow for incremental updates without having to re-download the full index.

The Google repo also has an index https://developer.android.com/studio/build/dependencies.html#google-maven but it looks very different from the other maven repos I've found. However, it's fairly small compared to the others, so it shouldn't be too hard to sort it out as well.

Please investigate the format of these repository indexes, and the data they provide, and see whether that would be suitable for use as the data source for the lister.

As recommended by @olasd I checkout out Maven Central index ( https://repo.maven.apache.org/maven2/.index/) this is a

This, when extracted, is a Lucene format file. The instructions to extract and view the data can be found on this link ( https://maven.apache.org/repository/central-index.html)
I converted this Lucene format file to XML file using the GUI mentioned in the link above. Here is a short snippet of that XML file
(https://forge.softwareheritage.org/P405)

Here we can get the file structure of the website in the field name ='u' tag

<doc id='40'>
<field name='1' flags='Idfp--S--------------'>
<val>aa5edf398f7c9451782d3f1aac68f11b91aa8469</val>
</field>
<field name='i' flags='------S--------------'>
<val>tar.gz|1122889985000|23943|2|2|0|tar.gz</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019343</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|0.2|src|tar.gz</val>
</field>
</doc>
<doc id='41'>
<field name='i' flags='------S--------------'>
<val>distribution-zip|1162775420000|62884|2|2|0|distribution-zip</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019344</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|0.2|src|distribution-zip</val>
</field>
</doc>

But it would be tricky to find a generalised way to iterate over all the rows and find the package link because there are quite irregular fashion like this one

<doc id='31'>
<field name='1' flags='Idfp--S--------------'>
<val>e4336eeae47ef9762ef441ffc54a630ce7df1887</val>
</field>
<field name='i' flags='------S--------------'>
<val>zip|1122889986000|1298621|0|0|0|zip</val>
</field>
<field name='m' flags='------S--------------'>
<val>1318434019247</val>
</field>
<field name='u' flags='Idfp--S--------------'>
<val>xstream|xstream|1.0-rc1|NA</val>
</field>
</doc>

As comparing both the xml we can see one state the file type after version and other don't, and there are many more type of irregularities.

Moreover, I could not find a way to extract the zipped index file to Lucene format file and convert it to XML using python.

Extending on what I wrote in the previous comment, I did a bit more research about this.

I found if we use JAVA, then the job can be done quickly; there are many examples of making a self-updating indexer for maven central( which could be extended to other) like this one https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic
But I am afraid I didn't found any way to do the job completely using python.

One more interesting thing I found is the code which I currently made to scrape Maven Central can be converted to be used for all the mavens (except Google Android Maven) by some minor changes as scraping for all the maven is pretty much the same although there is some difference in file structure which needs to be addressed to make the code for all maven.

So shall I proceed with JAVA approach or scraping approach?

It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.

This is mentioned in https://maven.apache.org/guides/introduction/introduction-to-repositories.html page in Setting up the Internal Repository section.
Is this any matter of concern?

Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.

If you really want to become an official mirror, contact us at MVNCENTRAL with your location and we'll work to get you setup.

As mentioned in https://maven.apache.org/guides/mini/guide-mirror-settings.html

hboutemy renamed this task from Maven Lister to Maven Central repository Lister.Jun 2 2019, 12:10 PM

Hi,

I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html

I discussed a few years ago with Roberto di Cosmo about Maven Central repository content. SInce then, I worked on Reproducible Builds for the JVM: https://reproducible-builds.org/docs/jvm/
There is one key issue in Central repository that I re-documented in previous page:

Notice that ${artifactId}-${version}-sources.jar files published in Maven repositories are not buildable sources, but sources for IDEs. Source tarballs, intended for building, are not always published in repositories but only sometimes, with 2 classical naming conventions:

${artifactId}-${version}-source-release.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:source-release
${artifactId}-${version}-src.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:src

This issue is probably the same with every repository in Maven format: in general, there is no buildable source files, at most IDE-oriented -sources.jar

IMHO, what you can get is just the scm entry of every pom, hoping that people have filed them properly: I can provide you an tar.bz2 archive that is only 304 MB with absolutely every pom.xml from central, so you can check before working on becoming a full Central mirror

I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html

Actually, we are trying to figure out some method which can work on all the maven instances.

Notice that ${artifactId}-${version}-sources.jar files published in Maven repositories are not buildable sources, but sources for IDEs. Source tarballs, intended for building, are not always published in repositories but only sometimes, with two classical naming conventions:

${artifactId}-${version}-source-release.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:source-release
${artifactId}-${version}-src.zip (see artifacts in Central providing such source tarballs) https://search.maven.org/search?q=l:src

This issue is probably the same with every repository in Maven format: in general, there is no buildable source files, at most IDE-oriented -sources.jar

Thanks for this information, we are interested in archiving source code. If the file with this name specific format contains the source code, thus we only need to list down these types of files.

if you search for source-release or src, yes, you'll find archivable sources
the only issue is that you'll not find many content...

Thanks a lot to @hboutemy for your valuable insights on sources in the Maven central repository, and for the pointer to Reproducible Builds on the JVM.

I think there's value in making the Maven lister support more Maven repositories than just Maven Central, even if we focus on Maven Central as the first proof of concept.

In my opinion, the most sensible way to work through listing of projects in a Maven repository would be through using the lucene indexes, but that poses an architectural challenge :

  • the infrastructure available to use lucene indexes in python code is not very practical : the official pylucene package isn't pip-installable and depends on having a jvm and the lucene jars available
  • adding yet another language component (any jvm-compatible language which would allow us to use the native lucene libraries) to the current lister infrastructure needs some forethought: this adds complexity to the docker development environment we're currently using, and it adds complexity to the production environment as well.

We're going to need to take a step back, and look at these challenges as a team, before we move further on this lister.

Coming back to the fact that maven itself doesn't hold the sources for a lot of projects; These questions are more for @hboutemy as I expect he has far more experience on maven than all of us combined ;)

  • What is present in the "IDE sources" (-sources.jar files) and how are they different from source releases? Would there a value in archiving them as a last resort?
  • From the search links you provided, 71k artifacts (packages ?) provide a -source-release.zip, and 9k artifacts a -src.zip. Looks like overall it would make sense to archive these.
  • Do you have an idea of how many artifacts reference an upstream source control repository that we could use to archive their source code?

It might make sense for the Maven lister to generate both

  • origins for artifacts with a source-release/src version
  • origins for upstream source control repositories

But that'll only make sense if extracting the upstream source control repo information doesn't cost us too much (i.e. I don't think parsing all the new pom files to get repository information will be very efficient - this kinda matches the problem we're having with the NuGET lister).

If the repository information isn't available in an easily harvestable format, getting a tarball of all the pom files and parsing them (as a one-off) to find potentially interesting new data sources (that is, forges that aren't available in our regular processing yet) could be useful.

I think there's value in making the Maven lister support more Maven repositories than just Maven Central, even if we focus on Maven Central as the first proof of concept.

makes sense. IMHO, it's good to have one issue per repository, since every repository will have its specific topics

What is present in the "IDE sources" (-sources.jar files) and how are they different from source releases?

For each .class in a .jar, *-sources.jar is expected to provide coresponding .java source file, to display in the IDE instead of decompiled bytecode.
In source release, this java source would be in src/main/java, or even [Maven-module]/src/main/java.
In addition, source-release has the build file (whatever the build tool), where *-sources.jar does not provide any build file.
*-sources.jar may also contain generated source code, because this would help the IDE

Would there a value in archiving them as a last resort?

If there is a high volume, perhaps this wlll be better than nothing, but these *-sources.jar have really less value than buildable source release content.

From the search links you provided, 71k artifacts (packages ?) provide a -source-release.zip, and 9k artifacts a -src.zip. Looks like overall it would make sense to archive these.

yes. I suppose you already have archived a lot of them, since projects that publish such archive are in general quite big ones. It would be useful to share the list of projects that provide source release archive but were not archived previously

Do you have an idea of how many artifacts reference an upstream source control repository that we could use to archive their source code?

sadly no

Here is an interesting update on the issue of listing Maven Central. Great people at the FASTEN EU project are analyzing software dependencies and for that they are working on a tool to download projects from various sources, including Maven.
The tool is here: https://github.com/fasten-project/source-populate
It appears to be more about downloading a known project source than listing the content of a repository, but we could try and share efforts in this space.

@hboutemy : I wonder if you are aware that we have now in place a grant program that allows to fund development of listers like this one.
All the information is available at https://www.softwareheritage.org/grants and you can mail me for more info if needed.

After recent exchanges with @hboutemy and Charles Sabourdin, here is a clarification of the scope of this task.
We need a Maven repository lister that addresses the following issues:

  • list the packages contained in a Maven repository (can be Maven Central, or one of the endangered repositories like jcenter and bintray https://www.infoq.com/news/2021/02/jfrog-jcenter-bintray-closure/)
  • harvest any source code bundle present there (as described above in the thread); considering that Maven is a binary package manager, we should not expect to find that many, though
  • extract from the package metadata (the POM files) the references to the source code, and load the corresponing repository. For example, this POM file for Sat4J, contains the stanza
  <scm>
		<connection>scm:git:https://gitlab.ow2.org/sat4j/sat4j/</connection>
		<url>https://gitlab.ow2.org/sat4j/sat4j/</url>
		<developerConnection>scm:git:https://gitlab.ow2.org/sat4j/sat4j/</developerConnection>
	  <tag>2_3_6</tag>
  </scm>

here we find that the source code is in a git repo located at https://gitlab.ow2.org/sat4j/sat4j/

Some stats on number of packages in Maven Central are here: https://search.maven.org/stats

Few more cents in the bucket..

  • scrapping is explicitly forbidden, see https://repo1.maven.org/terms.html -- however making contact first will help us go through most of the abuse-limiting rules I guess.
  • regarding fasten, there are indeed some bits that could be useful. However most of our difficulties are in getting a list of projects, whereas this information is already provided by the user in the case of fasten. So, interesting and useful, but not a game changer regarding the difficult part of our job.

So, to sum up the options we have.. Basically we "just" need all artifacts coordinates. From there for each artifact we can:

  1. fetch the pom.xml and look for the scm attribute.
  2. try to download the artifact-src or source-release package.

Once we have the artifactId/group we can easily get the list of versions (e.g. for updates) by reading the maven-metadata.xml file at the package level:

As for "how to get the list of artifacts":

  • Scrapping could work but is explicitly forbidden. Pages could easily be parsed through, and it would allow to identify *all* artifacts.
  • Using Maven indexes would require a jvm and bring some complexity to the docker & prod setups. OTOH it is the "official" way to retrieve information from a maven repository and most (if not all really) repositories provide this feature. It would also enable a smart incremental listing. The information we need (scm field in the pom) is NOT indexed however, so we would need another GET request to fetfh the pom and extract the scm tag. Some Maven-Indexer-related pages mention a REST API (which sounds like something we could http-retrieve) that IDE's can use to fetch the information required for their auto-completion and similar features, but I can't find any information to actually use it.
  • A third path could be to parse all the pom.xml's that we find and follow all artifactId's recursively, building a graph of dependencies and parent poms. This is more of a non-complete heuristic, and we would miss leaf nodes (i.e. artifacts that are not used by others), but it could help completing the list.

It should be noted also that there are two main implementations of maven repositories: Nexus and Artifactory. By being more specific we could use the respective APIs of these products to get information. But we'd lose any generic treatment doing so.

Some more information about the maven indexer. Beware people it's a bit dirty, and you're not going to like it infra-wise.

  • Maven-Indexer is actually a (thick) wrapper around lucene. It parses the repository and stores documents, fields and terms in an index. One can extract the lucene index from a maven index using the command: java -jar indexer-cli-5.1.1.jar --unpack nexus-maven-repository-index.gz --destination test --type full. Note however that 5.1.1 is an old version of maven indexer, but newer versions of the maven indexer won't work on the central indexes.
  • The unpacked lucene index is an OLD version of the tool. Format is Lucene 5.4, so we'll need an old version of lucene and co to actually read it. Clue is a CLI tool to read lucene indexes, and this (old) version works with our maven indexes. One can use the following command to export the index to text: java -jar clue-6.2.0-1.0.0.jar maven/central-lucene-index/ export central_export text. This is the closest I could get as of now to a readable content. We'll need to parse the text file to extract artifactId/group/extension information -- however scm attribute is NOT indexed so we'll need to get the pom in any case, from the artifact coordinates.

The exported text file looks like this:

doc 0
  field 0
    name u
    type string
    value com.redhat.rhevm.api|rhevm-api-powershell-jaxrs|1.0-rc1.16|javadoc|jar
  field 1
    name m
    type string
    value 1321264789727
  field 2
    name i
    type string
    value jar|1320743675000|768291|2|2|1|jar
  field 10
    name n
    type string
    value RHEV-M API Powershell Wrapper Implementation JAX-RS
  field 13
    name 1
    type string
    value 454eb6762e5bb14a75a21ae611ce2048dd548550

I still need to do some prototyping, but I think the information is there. To be continued.

As far as I can see, everything that uses the maven indexes will need a dedicated docker container I'm afraid..

Update for the Maven Indexer prototype: it works! (finally)

The above-mentioned export file indeed has all the information we need, and I could write a script to extract the download URL of:

  • all pom files from the repository (1121094 poms)
  • all sources jars from the repository (5774890 jars)

Doing some random testing (i.e. try to download some random URL/lines of the list) works well, although I didn't try *all* lines.

That will require some more testing/fine-tuning, but the "prototyping" step is done and ok imho.

Updates:

  • A ticket has been submitted in the Sonatype JIRA to let them know we will fetch maven poms and src jars soon.
  • An email has been sent on the maven-dev mailing list with a few kind answers, mainly stating to let Sonatype know through a JIRA issue.
  • Hervé Bouthemy provided some precious insights about the best way to use the poms; it seems we can get a near-complete list of maven repositories worldwide by parsing some pom arguments and following dependencies up. It should probably not be used directly by the lister (which should provide only the list of src jars and scm attributes to the loaders), but we can output it somewhere to feed the lister manually.
ardumont renamed this task from Maven Central repository Lister to Maven Central repository support.Nov 22 2021, 11:54 AM
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)

Hi there,

As stated elsewhere, I'm working on a list of maven servers that would be compatible with the lister (lucene/maven indexer wise). I should have a first release of that list by the end of the week, and will update this task with more information.
I'll also create a diff on the docker repository as requested.

Please stay tuned.

Hi @ardumont

I'm not sure what you mean by the docker diff. Is that the update of the maven-index-exporter repository at D6740?
The above-mentioned repository has documentation to build, test and run the text index generation. As mentioned there I've also created a bunch of compressed text index exports, that can be used to test the lister/loader without running the docker image immediately. They are all real-world extracts obtained by running the docker image on the list of Maven repositories I could get as of last week. They together represent a few million artefacts.

I'm not sure what you mean by the docker diff.

Prior to you actually landing the diffs on the lister and the loader, we asked you to
run your diffs through the swh docker-dev tool. So as to make sure it ran appropriately
according to your understanding and what you developed. And you did, as mentionned in
the discussion in one of those diffs.

I'm asking you for a diff with the exact changes you had to make in the
swh-environment/docker/docker-compose.yml (and other folders) to actually make it run.
That will definitely help for the deployment on staging.

Thanks in advance.

Is that the update of the maven-index-exporter repository at D6740?

Nope (see my explanation above ;)

The above-mentioned repository has documentation to build, test and run the text index
generation. As mentioned there I've also created a bunch of compressed text index
exports, that can be used to test the lister/loader without running the docker image
immediately. They are all real-world extracts obtained by running the docker image on
the list of Maven repositories I could get as of last week. They together represent a
few million artefacts.

Good, thanks for that information.

I'm asking you for a diff with the exact changes you had to make in the
swh-environment/docker/docker-compose.yml (and other folders) to actually make it run.
That will definitely help for the deployment on staging.

Ah, ok! Yes sure. I've just submitted the diff D6784 with what git identifies as my changes in the docker directory. Apart from the process described at [1] it's all I needed to fill up my hdd. ;-)

[1] https://docs.softwareheritage.org/devel/getting-started/using-docker.html#docker-environment

It seems the docker-compose.override.yml file is not picked up by the commit, so I'm providing it here.

Am I missing anything?

On second thoughts: in order to run the docker-dev setup, I also had to run a virtual machine alongside the swh setup to host the text index file, and make sure the swh vm could access it.
I suppose that any vm/docker/baremetal machine with an apache/nginx server could do for that, as long as the lister can http-fetch the .fld file.

As a reminder the HTTP URL of the fld file is provided as a parameter to the lister, and can be any form of URL, e.g. http://192.168.0.10/maven/_0.fld

I'm not sure how to pass on this information as this is an external resource, but please feel free to ask for more details.

vlorentz updated the task description. (Show Details)

On second thoughts: in order to run the docker-dev setup, I also had to run a virtual
machine alongside the swh setup to host the text index file, and make sure the swh vm
could access it. I suppose that any vm/docker/baremetal machine with an apache/nginx
server could do for that, as long as the lister can http-fetch the .fld file.

Can you please adapt the docker-dev setup to actually run that extra part ^ within the
docker-dev environment? There is an nginx container within our docker-dev setup btw. So
the lister maven within the docker environment could actually list out of an output
exposed by that nginx.

Overall, my pre-requisite is that the overall maven-swh scaffolding run within the
docker-dev first (like every other lister/loader we have). And then we can move on to
actually replicate it within the staging infrastructure.

That has multiple advantages:

  • other contributors can pick it up and hack on it without too much hassle (doco up).
  • demonstrates that currently the swh-maven scaffolding is ready (as long as it's not running too long because dev's hardware can become a limitation).
  • documents a bit in an executable form what's required for our infra to run it

Thanks in advance.

I'll be off in the next 2 weeks but @vlorentz [1] may be able to answer some more
questions (or as usual the #swh-devel irc chat can).

I've pinged him f2f about this ^ btw.

Hi @ardumont

Thanks! You did well, I had not been notified about your post and didn't know about it. Sorry for overlooking that. I'll have a look this week.
Happy new year btw, talk to you soon!

@ardumont I've added a nginx container to the main docker-compose file and made it serve one of the example fld files (in the conf/maven-index directory).
The served file can be accessed from the lister container, but for now the task doesn't pick anything -- I don't see it in the lister container logs at all, and (thus) the psql commands returns 0 rows. I'll investigate why (I made it work a month ago, so..), but a quick discussion about the scheduler might help on IRC. I'll be connected on IRC this monday, if we can take a chance to discuss the issue (and check that the compose thing is ok) that would be helpful.