diff --git a/README.md b/README.md index 43c2060..ad86d68 100644 --- a/README.md +++ b/README.md @@ -1,335 +1,15 @@ # Maven Index Exporter -This Docker image reads a Maven Indexer index and extract information about -the indexed documents as a convenient text file. +This Docker image reads a Maven Indexer index and extract information about the indexed documents as a convenient text file. -## Sequence +It takes as input the full set of Maven indexes files, as can be seen in the central maven repository, and uses two Java tools ([maven-indexer-cli](https://maven.apache.org/maven-indexer/) and [clue](https://github.com/javasoze/clue)) to extract the indexes (in `indexes/`) and export them in the `export/` directory. -The index files can be dowloaded from any maven repository that uses -maven-indexer, like maven central: +* You can read more about the sequence of actions in the `docs/` directory, including: +* [more information about the process](docs/README.md). +* [instructions to run the exporter](docs/run_maven-index-exporter.md). +* [instructions to build and test](docs/build_and_test.md) the Docker image. - https://repo1.maven.org/maven2/.index/ +An official Docker image is provided for quick tests on [DockerHub](https://hub.docker.com/r/bbaldassari/maven-index-exporter). -Copy all files (i.e. the main index, the updates and properties file) into -the volume directory (`$WORKDIR`). It will be mounted as `/work/` in the -docker image. -The export is then achieved in two steps: - -* Unpack the Lucene indexes from the Maven Indexer indexes using - `maven-indexer-cli`. The command used is: - -``` -$ java --illegal-access=permit -jar $INDEXER_JAR \ - --unpack $FILE_IN \ - --destination $WORKDIR/indexes/ \ - --type full -``` - -This generates a set of binary lucene files as shown below: - -``` -$ ls -lh $WORKDIR/indexes/ -total 5,2G --rw-r--r-- 1 root root 500M juil. 7 22:06 _4m.fdt --rw-r--r-- 1 root root 339K juil. 7 22:06 _4m.fdx --rw-r--r-- 1 root root 2,2K juil. 7 22:07 _4m.fnm --rw-r--r-- 1 root root 166M juil. 7 22:07 _4m_Lucene50_0.doc --rw-r--r-- 1 root root 147M juil. 7 22:07 _4m_Lucene50_0.pos -[SNIP] --rw-r--r-- 1 root root 363 juil. 7 22:06 _e0.si --rw-r--r-- 1 root root 1,7K juil. 7 22:07 segments_2 --rw-r--r-- 1 root root 8 juil. 7 21:54 timestamp -``` - -* Export the Lucene documents from the Lucene indexes using `clue`. This - generates a set of text files as shown below: - -``` -$ java --illegal-access=permit -jar $JAR_CLUE $WORKDIR/indexes/ \ - export $WORKDIR/export/ text -``` - -This generates a bunch of text files relating to the Lucene indexes, made -available in `$WORKDIR/export/`. For our purpose we only keep the `*.fld` -file that includes the indexed documents. - -## Output - -The clue command is documented on [its github page](https://github.com/javasoze/clue). -The indexed Lucene documents are located in the `*.fld` file. - -A description of the fields used by maven-indexer can be found in the project's -API docs: -https://maven.apache.org/maven-indexer-archives/maven-indexer-6.0.0/indexer-core/apidocs/org/apache/maven/index/ArtifactInfo.html - -## How to build - -The build downloads binaries for both tools (maven-indexer-cli and clue), so make sure there is an internet connection. -Go to the `docker/` dorectory and issue the folowing command: - -``` -$ docker build . -t bbaldassari/maven-index-exporter --no-cache -``` - -An up-to-date docker image is also available on docker hub at -[bbaldassari/maven-index-exporter](https://hub.docker.com/r/bbaldassari/maven-index-exporter). - -``` -$ docker pull bbaldassari/maven-index-exporter -``` - -## How to use - -The Docker image uses volumes to exchanges files. Prepare a directory with -enough space disk (see warning below) and pass it to docker: - -``` -$ docker run -v /local/work/dir:/work bbaldassari/maven-index-exporter -``` - -Please note that the local work dir MUST be an absolute path, as docker won't -mount relative paths as volumes. - -For our purpose only the fld file is kept, so if you need other export files -you should simply edit the `extract_indexes.sh` script and comment the lines -that do the cleaning. - -### Running as cron - -The `run_full_export.py` script located in `resources` provides an easy way to run the -export as a cron batch job, and copy the resulting text export to a specific location. - -Simply use and adapt the crontab command as follows: - -``` -cd /home/boris/resources/ && /home/boris/resources/myvenv/bin/python /home/boris/resources/run_full_export.py https://repo.maven.apache.org/maven2/ /tmp/maven-index\ --exporter/ /var/www/html/maven_index_exporter/ 2>&1 > /home/boris/run_maven_exporter_$(date +"%Y%m%d-%H%M%S").log - -``` - -The script takes three mandatory arguments: - -``` -Usage: run_full_export.py url work_dir publish_dir - - url is the base url of the maven repository instance. - Example: https://repo.maven.apache.org/maven2/ - - work_dir must be an absolute path to the temp directory. - Example: /tmp/maven-index-exporter/ - - publish_dir must be an absolute path to the final directory. - Example: /var/www/html/ -``` - -It is recommended to setup a virtual environment to run the script. - -``` -$ python3 -m venv myvenv -$ source venv/bin/activate -``` - -Python modules to be installed are provided in the `requirements.txt` file. - -### size of generated files - -Beware that maven indexes are compressed and text export can become huge. -When executed on the maven central indexes (1.2 GB), the process generates -5.2 GB of intermediate files and 49 GB of final text data on disk: - -``` -$ du -sh /work/* -49G /work/export -5,2G /work/indexes -1,2G /work/nexus-maven-repository-index.gz -``` - -## How to test (the quick way) - -There is a bash script called `test_docker_image.sh` in the `resources/` directory, -simply execute it. Tests cover the creation of the docker image, and the results after -execution. - -``` -$ bash test_docker_image.sh -Script started on 20210911_181912. -* Writing log to test_docker_image.log. -* Docker image [maven-index-exporter] doesn't exist. -* Building docker image. -PASS: docker build returned 0. -PASS: Docker image is listed. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has been created. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 7 docs. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 26 fields. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0-sources.jar. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0.pom. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1-sources.jar. -PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1.pom. -$ -``` - -## How to test (the long road) - -This repository has a simple, almost-empty maven-indexer index that can be used to test the docker build. To use it, make sure that the directory `repository_test/` is present and run this command: - -``` -$ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter -``` - -The exported files will be stored in `repository_test/export/`, and output should look like this: - -``` -$ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter -Docker Script started on 2021-08-27 06:32:22. -# Checks.. -* Content of /opt: -total 32156 --rw------- 1 root root 18000742 Jan 8 2018 clue-6.2.0-1.0.0.jar --rw-r--r-- 1 root root 2574 Aug 25 18:28 extract_indexes.sh --rw------- 1 root root 14914610 Nov 28 2017 indexer-cli-6.0.0.jar -drwxr-xr-x 3 root root 4096 Jun 29 16:23 java -* Content of /work: -total 36 --rw-r--r-- 1 1000 1000 254 Aug 26 09:21 nexus-maven-repository-index.1.gz --rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.1.gz.md5 --rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.1.gz.sha1 --rw-r--r-- 1 1000 1000 344 Aug 26 09:21 nexus-maven-repository-index.gz --rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.gz.md5 --rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.gz.sha1 --rw-r--r-- 1 1000 1000 193 Aug 26 09:21 nexus-maven-repository-index.properties --rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.properties.md5 --rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.properties.sha1 -* Will read files from [/work/nexus-maven-repository-index.gz]. -* Found file [/work/nexus-maven-repository-index.gz]. -* Found indexer [/opt/indexer-cli-6.0.0.jar]. -* Found clue [/opt/clue-6.2.0-1.0.0.jar]. -* Java version:. -openjdk version "11.0.11" 2021-04-20 -OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) -OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) -############################# -Unpacking [/work/nexus-maven-repository-index.gz] to /work/indexes -SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". -SLF4J: Defaulting to no-operation (NOP) logger implementation -SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. -Index Folder: /work -Output Folder: /work/indexes -Total time: 0 sec -Final memory: 41M/1004M -Unpacking finished on 2021-08-27 06:32:23. -############################# -Exporting indexes /work/indexes to /work/export -no configuration found, using default configuration -Analyzer: class org.apache.lucene.analysis.standard.StandardAnalyzer -Query Builder: class com.senseidb.clue.api.DefaultQueryBuilder -Directory Builder: class com.senseidb.clue.api.DefaultDirectoryBuilder -IndexReader Factory: class com.senseidb.clue.api.DefaultIndexReaderFactory -Term Bytesref Display: class com.senseidb.clue.api.StringBytesRefDisplay -Payload Bytesref Display: class com.senseidb.clue.api.RawBytesRefDisplay -exporting index to text -Exporting finished on 2021-08-27 06:32:23. -############################# -Cleaning useless files. -Size before cleaning: -32.0K /work/export -28.0K /work/indexes -4.0K /work/nexus-maven-repository-index.1.gz -4.0K /work/nexus-maven-repository-index.1.gz.md5 -4.0K /work/nexus-maven-repository-index.1.gz.sha1 -4.0K /work/nexus-maven-repository-index.gz -4.0K /work/nexus-maven-repository-index.gz.md5 -4.0K /work/nexus-maven-repository-index.gz.sha1 -4.0K /work/nexus-maven-repository-index.properties -4.0K /work/nexus-maven-repository-index.properties.md5 -4.0K /work/nexus-maven-repository-index.properties.sha1 -* Removing useless exports. - Keeping only fld text extract. - Size after cleaning: -8.0K /work/export -28.0K /work/indexes -4.0K /work/nexus-maven-repository-index.1.gz -4.0K /work/nexus-maven-repository-index.1.gz.md5 -4.0K /work/nexus-maven-repository-index.1.gz.sha1 -4.0K /work/nexus-maven-repository-index.gz -4.0K /work/nexus-maven-repository-index.gz.md5 -4.0K /work/nexus-maven-repository-index.gz.sha1 -4.0K /work/nexus-maven-repository-index.properties -4.0K /work/nexus-maven-repository-index.properties.md5 -4.0K /work/nexus-maven-repository-index.properties.sha1 -* Make files modifiable by the end-user. -Docker Script execution finished on 2021-08-27 06:32:23. -``` - -The `_1.fld` file contains the fields for each document: - -``` -$ head repository_test/export/_1.fld -doc 0 - field 0 - name u - type string - value al.aldi|sprova4j|0.1.0|sources|jar - field 1 - name m - type string - value 1626111735737 - field 2 -``` - -### Building the test repository - -The test repository `repository_test` can be rebuilt from the `repository_src` -structure using [indexer-cli](https://search.maven.org/remotecontent?filepath=org/apache/maven/indexer/indexer-cli/6.0.0/indexer-cli-6.0.0.jar) -with the following commands: - -``` -$ cd repository_src -$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo1 -s -c -SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". -SLF4J: Defaulting to no-operation (NOP) logger implementation -SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. -Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo1 -Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index -Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test -Repository name: index -Indexers: [min, jarContent] -Will create checksum files for all published files (sha1, md5). -Will create incremental chunks for changes, along with baseline file. -Scanning started -Artifacts added: 2 -Artifacts deleted: 0 -Total time: 1 sec -Final memory: 48M/1012M -$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo2 -s -c -SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". -SLF4J: Defaulting to no-operation (NOP) logger implementation -SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. -Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo2 -Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index -Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test -Repository name: index -Indexers: [min, jarContent] -Will create checksum files for all published files (sha1, md5). -Will create incremental chunks for changes, along with baseline file. -Scanning started -Artifacts added: 2 -Artifacts deleted: 0 -Total time: 0 sec -Final memory: 7M/1012M -$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo3 -s -c -SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". -SLF4J: Defaulting to no-operation (NOP) logger implementation -SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. -Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo3 -Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index -Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test -Repository name: index -Indexers: [min, jarContent] -Will create checksum files for all published files (sha1, md5). -Will create incremental chunks for changes, along with baseline file. -Scanning started -Artifacts added: 1 -Artifacts deleted: 2 -Total time: 0 sec -Final memory: 8M/1012M -$ -``` diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..efed837 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,78 @@ + +# Documentation + + +## Sequence + +The index files can be dowloaded from any maven repository that uses +maven-indexer, like maven central: + + https://repo1.maven.org/maven2/.index/ + +Copy all files (i.e. the main index, the updates and properties file) into +the volume directory (`$WORKDIR`). It will be mounted as `/work/` in the +docker image. + +The export is then achieved in two steps: + +* Unpack the Lucene indexes from the Maven Indexer indexes using + `maven-indexer-cli`. The command used is: + +``` +$ java --illegal-access=permit -jar $INDEXER_JAR \ + --unpack $FILE_IN \ + --destination $WORKDIR/indexes/ \ + --type full +``` + +This generates a set of binary lucene files as shown below: + +``` +$ ls -lh $WORKDIR/indexes/ +total 5,2G +-rw-r--r-- 1 root root 500M juil. 7 22:06 _4m.fdt +-rw-r--r-- 1 root root 339K juil. 7 22:06 _4m.fdx +-rw-r--r-- 1 root root 2,2K juil. 7 22:07 _4m.fnm +-rw-r--r-- 1 root root 166M juil. 7 22:07 _4m_Lucene50_0.doc +-rw-r--r-- 1 root root 147M juil. 7 22:07 _4m_Lucene50_0.pos +[SNIP] +-rw-r--r-- 1 root root 363 juil. 7 22:06 _e0.si +-rw-r--r-- 1 root root 1,7K juil. 7 22:07 segments_2 +-rw-r--r-- 1 root root 8 juil. 7 21:54 timestamp +``` + +* Export the Lucene documents from the Lucene indexes using `clue`. This + generates a set of text files as shown below: + +``` +$ java --illegal-access=permit -jar $JAR_CLUE $WORKDIR/indexes/ \ + export $WORKDIR/export/ text +``` + +This generates a bunch of text files relating to the Lucene indexes, made +available in `$WORKDIR/export/`. For our purpose we only keep the `*.fld` +file that includes the indexed documents. + +## Output + +The clue command is documented on [its github page](https://github.com/javasoze/clue). +The indexed Lucene documents are located in the `*.fld` file. + +A description of the fields used by maven-indexer can be found in the project's +API docs: https://maven.apache.org/maven-indexer-archives/maven-indexer-6.0.0/indexer-core/apidocs/org/apache/maven/index/ArtifactInfo.html + +## How to build + +The build downloads binaries for both tools (maven-indexer-cli and clue), so make sure there is an internet connection. +Go to the `docker/` dorectory and issue the folowing command: + +``` +$ docker build . -t bbaldassari/maven-index-exporter --no-cache +``` + +An up-to-date docker image is also available on docker hub at +[bbaldassari/maven-index-exporter](https://hub.docker.com/r/bbaldassari/maven-index-exporter). + +``` +$ docker pull bbaldassari/maven-index-exporter +``` diff --git a/README.md b/docs/build_and_test.md similarity index 64% copy from README.md copy to docs/build_and_test.md index 43c2060..7a58849 100644 --- a/README.md +++ b/docs/build_and_test.md @@ -1,335 +1,195 @@ -# Maven Index Exporter +# Build and test Maven index exporter -This Docker image reads a Maven Indexer index and extract information about -the indexed documents as a convenient text file. - -## Sequence - -The index files can be dowloaded from any maven repository that uses -maven-indexer, like maven central: - - https://repo1.maven.org/maven2/.index/ - -Copy all files (i.e. the main index, the updates and properties file) into -the volume directory (`$WORKDIR`). It will be mounted as `/work/` in the -docker image. - -The export is then achieved in two steps: - -* Unpack the Lucene indexes from the Maven Indexer indexes using - `maven-indexer-cli`. The command used is: - -``` -$ java --illegal-access=permit -jar $INDEXER_JAR \ - --unpack $FILE_IN \ - --destination $WORKDIR/indexes/ \ - --type full -``` - -This generates a set of binary lucene files as shown below: - -``` -$ ls -lh $WORKDIR/indexes/ -total 5,2G --rw-r--r-- 1 root root 500M juil. 7 22:06 _4m.fdt --rw-r--r-- 1 root root 339K juil. 7 22:06 _4m.fdx --rw-r--r-- 1 root root 2,2K juil. 7 22:07 _4m.fnm --rw-r--r-- 1 root root 166M juil. 7 22:07 _4m_Lucene50_0.doc --rw-r--r-- 1 root root 147M juil. 7 22:07 _4m_Lucene50_0.pos -[SNIP] --rw-r--r-- 1 root root 363 juil. 7 22:06 _e0.si --rw-r--r-- 1 root root 1,7K juil. 7 22:07 segments_2 --rw-r--r-- 1 root root 8 juil. 7 21:54 timestamp -``` - -* Export the Lucene documents from the Lucene indexes using `clue`. This - generates a set of text files as shown below: - -``` -$ java --illegal-access=permit -jar $JAR_CLUE $WORKDIR/indexes/ \ - export $WORKDIR/export/ text -``` - -This generates a bunch of text files relating to the Lucene indexes, made -available in `$WORKDIR/export/`. For our purpose we only keep the `*.fld` -file that includes the indexed documents. - -## Output - -The clue command is documented on [its github page](https://github.com/javasoze/clue). -The indexed Lucene documents are located in the `*.fld` file. - -A description of the fields used by maven-indexer can be found in the project's -API docs: -https://maven.apache.org/maven-indexer-archives/maven-indexer-6.0.0/indexer-core/apidocs/org/apache/maven/index/ArtifactInfo.html - -## How to build - -The build downloads binaries for both tools (maven-indexer-cli and clue), so make sure there is an internet connection. -Go to the `docker/` dorectory and issue the folowing command: - -``` -$ docker build . -t bbaldassari/maven-index-exporter --no-cache -``` - -An up-to-date docker image is also available on docker hub at -[bbaldassari/maven-index-exporter](https://hub.docker.com/r/bbaldassari/maven-index-exporter). - -``` -$ docker pull bbaldassari/maven-index-exporter -``` - -## How to use - -The Docker image uses volumes to exchanges files. Prepare a directory with -enough space disk (see warning below) and pass it to docker: - -``` -$ docker run -v /local/work/dir:/work bbaldassari/maven-index-exporter -``` - -Please note that the local work dir MUST be an absolute path, as docker won't -mount relative paths as volumes. - -For our purpose only the fld file is kept, so if you need other export files -you should simply edit the `extract_indexes.sh` script and comment the lines -that do the cleaning. - -### Running as cron - -The `run_full_export.py` script located in `resources` provides an easy way to run the -export as a cron batch job, and copy the resulting text export to a specific location. - -Simply use and adapt the crontab command as follows: - -``` -cd /home/boris/resources/ && /home/boris/resources/myvenv/bin/python /home/boris/resources/run_full_export.py https://repo.maven.apache.org/maven2/ /tmp/maven-index\ --exporter/ /var/www/html/maven_index_exporter/ 2>&1 > /home/boris/run_maven_exporter_$(date +"%Y%m%d-%H%M%S").log - -``` - -The script takes three mandatory arguments: - -``` -Usage: run_full_export.py url work_dir publish_dir - - url is the base url of the maven repository instance. - Example: https://repo.maven.apache.org/maven2/ - - work_dir must be an absolute path to the temp directory. - Example: /tmp/maven-index-exporter/ - - publish_dir must be an absolute path to the final directory. - Example: /var/www/html/ -``` - -It is recommended to setup a virtual environment to run the script. - -``` -$ python3 -m venv myvenv -$ source venv/bin/activate -``` - -Python modules to be installed are provided in the `requirements.txt` file. - -### size of generated files - -Beware that maven indexes are compressed and text export can become huge. -When executed on the maven central indexes (1.2 GB), the process generates -5.2 GB of intermediate files and 49 GB of final text data on disk: - -``` -$ du -sh /work/* -49G /work/export -5,2G /work/indexes -1,2G /work/nexus-maven-repository-index.gz -``` ## How to test (the quick way) -There is a bash script called `test_docker_image.sh` in the `resources/` directory, -simply execute it. Tests cover the creation of the docker image, and the results after -execution. +There is a bash script called `test_docker_image.sh` in the `scripts/` directory, +simply execute it. Tests cover the creation of the docker image, its execution, and the +resulting output. ``` $ bash test_docker_image.sh Script started on 20210911_181912. * Writing log to test_docker_image.log. * Docker image [maven-index-exporter] doesn't exist. * Building docker image. PASS: docker build returned 0. PASS: Docker image is listed. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has been created. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 7 docs. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 26 fields. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0-sources.jar. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0.pom. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1-sources.jar. PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1.pom. $ ``` + ## How to test (the long road) This repository has a simple, almost-empty maven-indexer index that can be used to test the docker build. To use it, make sure that the directory `repository_test/` is present and run this command: ``` $ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter ``` The exported files will be stored in `repository_test/export/`, and output should look like this: ``` $ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter Docker Script started on 2021-08-27 06:32:22. # Checks.. * Content of /opt: total 32156 -rw------- 1 root root 18000742 Jan 8 2018 clue-6.2.0-1.0.0.jar -rw-r--r-- 1 root root 2574 Aug 25 18:28 extract_indexes.sh -rw------- 1 root root 14914610 Nov 28 2017 indexer-cli-6.0.0.jar drwxr-xr-x 3 root root 4096 Jun 29 16:23 java * Content of /work: total 36 -rw-r--r-- 1 1000 1000 254 Aug 26 09:21 nexus-maven-repository-index.1.gz -rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.1.gz.md5 -rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.1.gz.sha1 -rw-r--r-- 1 1000 1000 344 Aug 26 09:21 nexus-maven-repository-index.gz -rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.gz.md5 -rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.gz.sha1 -rw-r--r-- 1 1000 1000 193 Aug 26 09:21 nexus-maven-repository-index.properties -rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.properties.md5 -rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.properties.sha1 * Will read files from [/work/nexus-maven-repository-index.gz]. * Found file [/work/nexus-maven-repository-index.gz]. * Found indexer [/opt/indexer-cli-6.0.0.jar]. * Found clue [/opt/clue-6.2.0-1.0.0.jar]. * Java version:. openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) ############################# Unpacking [/work/nexus-maven-repository-index.gz] to /work/indexes SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Index Folder: /work Output Folder: /work/indexes Total time: 0 sec Final memory: 41M/1004M Unpacking finished on 2021-08-27 06:32:23. ############################# Exporting indexes /work/indexes to /work/export no configuration found, using default configuration Analyzer: class org.apache.lucene.analysis.standard.StandardAnalyzer Query Builder: class com.senseidb.clue.api.DefaultQueryBuilder Directory Builder: class com.senseidb.clue.api.DefaultDirectoryBuilder IndexReader Factory: class com.senseidb.clue.api.DefaultIndexReaderFactory Term Bytesref Display: class com.senseidb.clue.api.StringBytesRefDisplay Payload Bytesref Display: class com.senseidb.clue.api.RawBytesRefDisplay exporting index to text Exporting finished on 2021-08-27 06:32:23. ############################# Cleaning useless files. Size before cleaning: 32.0K /work/export 28.0K /work/indexes 4.0K /work/nexus-maven-repository-index.1.gz 4.0K /work/nexus-maven-repository-index.1.gz.md5 4.0K /work/nexus-maven-repository-index.1.gz.sha1 4.0K /work/nexus-maven-repository-index.gz 4.0K /work/nexus-maven-repository-index.gz.md5 4.0K /work/nexus-maven-repository-index.gz.sha1 4.0K /work/nexus-maven-repository-index.properties 4.0K /work/nexus-maven-repository-index.properties.md5 4.0K /work/nexus-maven-repository-index.properties.sha1 * Removing useless exports. Keeping only fld text extract. Size after cleaning: 8.0K /work/export 28.0K /work/indexes 4.0K /work/nexus-maven-repository-index.1.gz 4.0K /work/nexus-maven-repository-index.1.gz.md5 4.0K /work/nexus-maven-repository-index.1.gz.sha1 4.0K /work/nexus-maven-repository-index.gz 4.0K /work/nexus-maven-repository-index.gz.md5 4.0K /work/nexus-maven-repository-index.gz.sha1 4.0K /work/nexus-maven-repository-index.properties 4.0K /work/nexus-maven-repository-index.properties.md5 4.0K /work/nexus-maven-repository-index.properties.sha1 * Make files modifiable by the end-user. Docker Script execution finished on 2021-08-27 06:32:23. ``` The `_1.fld` file contains the fields for each document: ``` $ head repository_test/export/_1.fld doc 0 field 0 name u type string value al.aldi|sprova4j|0.1.0|sources|jar field 1 name m type string value 1626111735737 field 2 ``` ### Building the test repository The test repository `repository_test` can be rebuilt from the `repository_src` structure using [indexer-cli](https://search.maven.org/remotecontent?filepath=org/apache/maven/indexer/indexer-cli/6.0.0/indexer-cli-6.0.0.jar) with the following commands: ``` $ cd repository_src $ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo1 -s -c SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo1 Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test Repository name: index Indexers: [min, jarContent] Will create checksum files for all published files (sha1, md5). Will create incremental chunks for changes, along with baseline file. Scanning started Artifacts added: 2 Artifacts deleted: 0 Total time: 1 sec Final memory: 48M/1012M $ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo2 -s -c SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo2 Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test Repository name: index Indexers: [min, jarContent] Will create checksum files for all published files (sha1, md5). Will create incremental chunks for changes, along with baseline file. Scanning started Artifacts added: 2 Artifacts deleted: 0 Total time: 0 sec Final memory: 7M/1012M $ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo3 -s -c SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo3 Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test Repository name: index Indexers: [min, jarContent] Will create checksum files for all published files (sha1, md5). Will create incremental chunks for changes, along with baseline file. Scanning started Artifacts added: 1 Artifacts deleted: 2 Total time: 0 sec Final memory: 8M/1012M $ ``` diff --git a/docs/maven_repositories.md b/docs/maven_repositories.md new file mode 100644 index 0000000..ba05382 --- /dev/null +++ b/docs/maven_repositories.md @@ -0,0 +1,119 @@ +A list of remote Maven repositories using [Maven Indexer](https://maven.apache.org/maven-indexer/) for their catalogue. + +# Introduction + +In the Maven ecosystem, dependencies and artefacts required to develop Java projects can be automatically downloaded from remote Maven repositories using a set of unique identifiers (aka coordinates): the groupId, artefactId and version. + +Maven repositories use a standard directory structure for their hosting, which enables to easily identify and download any artefact with its (groupId, artefactid, version) coordinates. Although it is technically not *required*, Maven repositories often provide an index of all the files they host, mostly for IDEs ( e.g. Eclipse, IntelliJ IDEA, or NetBeans). These index files are usually generated with [Maven Indexer](https://maven.apache.org/maven-indexer/) and consist of gzipped Lucene indexes stored in a `.index/` directory at the root of the repository. + +The largest and most used Maven repository is of course [Maven Central](https://search.maven.org/), but there are many, many [other repositories](https://mvnrepository.com/repos/central) available around. These are set up by individuals, companies and organisations to provide their own builds or domain-specific repositories. Since it is by no means necessary to register repositories, and as far as we know, there is no exhaustive list of Maven repositories. + +The resources in this directory are an attempt to identify a list of Maven repository servers, as complete as possible. We also publish a list of servers that provide public indexes that can be analysed and exported with the [Maven index exporter](https://github.com/borisbaldassari/maven-index-exporter) Docker image. + +# Method + +## Build a list of URLs from poms + +We started from a dump of all pom files hosted on Maven Central (6.9 million files XML files at the time of collection). For each pom we looked for XML nodes that can represent Maven repositories; starting from the root of the document and using XPath expressions we specifically looked for: + +* `.//m:repositories/m:repository/` +* `.//m:pluginRepository` +* `.//m:distributionManagement/m:snapshotRepository` +* `.//m:distributionManagement/m:repository` + +The transformation can be reproduced with the scripts in the `scripts/` directory: + +``` +time bash extract_repositories_from_stock.sh list_poms.txt | tee extract.log +``` + +The full execution took 61 hours and produced a list of "only" 928808 lines. Each line provides the origin of the URL in the POM, the repository id, and the URL itself. + +``` +distrib_snapshot,ossrh,https://oss.sonatype.org/content/repositories/snapshots +distrib_repo,ossrh,https://oss.sonatype.org/service/local/staging/deploy/maven2/ +``` + +## Download properties + +In the resulting set, there are many duplicates, non-existent, private or invalid URLs. + +To make sure that we only list publicly available servers we tried to download the Maven index properties file from every server. This properties file is mandatory in Maven indexer; it can be found at `.index/nexus-maven-repository-index.properties` and contains the list of incremental updates to the index. + +The sequence of actions is as follows: + +* Remove printed comments, sort and remove duplicate lines: + +``` +grep -Ev "^# " extract.log | sort -u > extract_uniq.txt +``` + +* Extract the list of URLs (3rd column) and filter all but http(s) links: + +``` +cat result_uniq.txt | cut -d, -f 3 | grep -E '^http' > list_urls.txt +``` + +* The output list has 7145 lines URLs to test. For each item, we try to get the file in `/.index/nexus-maven-repository-index.properties`. If it yields a file, save it. + +```shell +SUFFIX="/.index/nexus-maven-repository-index.properties" + +for url in `cat list_urls.txt`; do + echo "Testing URL [$url]." + full_url="${url}${SUFFIX}" + name=$(echo $url | cut -d/ -f 3- | tr '/' '_') + full_name="${name}.properties" + echo " Writing to [$full_name]." + wget -O servers/"$full_name" --tries=2 $full_url & +done +``` + +* This downloads in the `servers/` directory 3820 properties files. Most of them are empty or contain invalid information, leaving only files that contain an actual list of Maven indexer compressed files. +* Rebuild the list of URLs by removing 404s (i.e. servers that did not create a file). Remove trailing slashes to prevent duplicates, sort and make unique: + +```shell +for f in `ls ../servers/`; do + url=$(echo ${f%.properties} | tr '_' '/'); + grep ${url%/} list_urls_full.txt; +done | sed 's:/*$::' | sort -u > list_urls_final.txt +``` + +The result is a list of 339 unique URLs: to be downloaded here: + +[list_urls_final.txt](https://files.nuclino.com/files/e75205b3-354e-4794-a43a-d9f98ad08039/list_urls_final.txt) + +## Checking compatibility + +To ensure that these repositories can be actually parsed with the Maen index exporter, there is no better way than parsing them and generating the index and text export. For this, we first need to download all indexes from all servers: + +``` +bash scripts/convert_url_to_repo.sh +``` + +This will rely on the list of directories downloaded previously, and generate a series of subdirectories for each server, with the index files. If the index files already exist they won't be downloaded again. + +The next step is to execute the docker image from [bbaldassari/maven-index-exporter](https://github.com/borisbaldassari/maven-index-exporter) to export all text indexes in `/export/`. + +```shell +mkdir -p ../maven_repositories/ +for i in `ls`; do + time docker run -v /data/work/$i:/work bbaldassari/maven-index-exporter | tee ../logs/$i.log; + mv $i/ ../maven_repositories/; +done +``` + +This again filters out some servers that use a Maven Indexer version different from the Docker image's compatibility. + +# Result + +The final list contains only Maven repositories that: + +* use Maven Indexer for their indexing, +* are publicly available, +* are still available as of 2021-11-20, and +* can be extracted using the Maven index exporter Docker image. + +List of downloads: + +* list_maven_servers_with_indexes.txt \ No newline at end of file diff --git a/docs/maven_repositories/list_maven_ok.txt b/docs/maven_repositories/list_maven_ok.txt new file mode 100644 index 0000000..faf9f42 --- /dev/null +++ b/docs/maven_repositories/list_maven_ok.txt @@ -0,0 +1,307 @@ +http://apps.geomajas.org/nexus/content/repositories/snapshots/ +http://artifactory.javassh.com/opensource-releases/ +http://artifactory.javassh.com/opensource-snapshots/ +http://artifactory.mycore.de/libs-releases-local/ +http://artifactory.mycore.de/libs-snapshots-local/ +http://artifacts.codice.org/content/groups/public.tar.bz2 +http://artifacts.metaborg.org/content/repositories/releases/ +http://artifacts.metaborg.org/content/repositories/snapshots/ +http://bp-cms-commons.sourceforge.net/m2repo/ +http://ci.qaprosoft.com/8081/nexus/content/repositories/snapshots/ +http://clojars.org/repo/ +http://dist.wso2.org/maven2/ +http://dist.wso2.org/snapshots/maven2/ +http://files.couchbase.com/maven2/ +http://forum.soapui.org/repository/maven2/ +http://java.freehep.org/maven2/ +http://m2.duraspace.org/content/repositories/thirdparty/ +http://maven.alfresco.com/nexus/content/repositories/releases/ +http://maven.atlassian.com/public/ +http://maven.atlassian.com/repository/public/ +http://maven.ecs.soton.ac.uk/content/repositories/openimaj-releases/ +http://maven.ecs.soton.ac.uk/content/repositories/openimaj-snapshots/ +http://maven.geomajas.org/ +http://maven.imagej.net/content/groups/public/ +http://maven.imagej.net/content/repositories/thirdparty/ +http://maven.inria.fr/artifactory/malai-public-snapshot/ +http://maven.inria.fr/artifactory/spoon-public-snapshot/ +http://maven.jahia.org/maven2/ +http://maven.jarch.com.br/releases/ +http://maven.jarch.com.br/snapshots/ +http://maven.java.net/content/repositories/releases/ +http://maven.java.net/content/repositories/snapshots/ +http://maven.java.net/content/repositories/staging/ +http://maven.nuxeo.org/nexus/content/groups/public/ +http://maven.objectstyle.org/nexus/content/groups/cayenne-deps/ +http://maven.objectstyle.org/nexus/content/groups/linkrest/ +http://maven.objectstyle.org/nexus/content/repositories/bootique-snapshots/ +http://maven.objectstyle.org/nexus/content/repositories/linkrest-snapshots/ +http://maven.objectweb.org/maven2/ +http://maven.onehippo.com/maven2/ +http://maven.openimaj.org/ +http://maven.ow2.org/maven2/ +http://mavenrepo.openmrs.org/nexus/content/repositories/releases/ +http://mavenrepo.openmrs.org/nexus/content/repositories/snapshots/ +http://maven.repository.redhat.com/earlyaccess/all/ +http://maven.repository.redhat.com/techpreview/all/ +http://mavensync.zkoss.org/maven2/ +http://maven.vaadin.com/vaadin-addons/ +http://maven.vaadin.com/vaadin-prereleases/ +http://maven.wso2.org/nexus/content/groups/wso2-public/ +http://maven.wso2.org/nexus/content/repositories/releases/ +http://maven.wso2.org/nexus/content/repositories/snapshots/ +http://maven.xwiki.org/releases/ +http://nexus.ala.org.au/content/repositories/releases/ +http://nexus.ala.org.au/content/repositories/snapshots/ +http://nexus.fd.io/content/repositories/fd.io.release/ +http://nexus.nuiton.org/nexus/content/groups/scmwebeditor/ +http://nexus.opencast.org/nexus/content/groups/public/ +http://nexus.opendaylight.org/content/groups/public/ +http://nexus.opendaylight.org/content/repositories/opendaylight.release/ +http://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/ +http://nexus.opendaylight.org/content/repositories/public/ +http://nexus.synyx.de/content/repositories/public-releases/ +http://nexus.synyx.de/content/repositories/public-snapshots/ +http://nexus.xwiki.org/nexus/content/groups/public/ +http://nexus.xwiki.org/nexus/content/groups/public-snapshots/ +http://nexus.xwiki.org/nexus/content/repositories/releases/ +http://nexus.xwiki.org/nexus/content/repositories/snapshots/ +http://nexus.xwiki.org/nexus/content/repositories/snapshots// +http://nexus.yifengx.com/content/repositories/releases/ +http://origin-repository.jboss.org/nexus/content/groups/ea/ +http://oss.sonatype.org/content/repositories/releases/ +http://repo.adobe.com/nexus/content/groups/public/ +http://repo.anahata.uno/artifactory/anahata-public/ +http://repo.basepoint.su/content/groups/public/ +http://repo.evolvedbinary.com/repository/exist-db/ +http://repo.fusesource.com/maven2/ +http://repo.fusesource.com/maven2// +http://repo.fusesource.com/nexus/content/groups/public/ +http://repo.fusesource.com/nexus/content/repositories/releases/ +http://repo.fusesource.com/nexus/content/repositories/snapshots/ +http://repo.hedgecode.org/content/repositories/releases/ +http://repo.hedgecode.org/content/repositories/releases// +http://repo.hedgecode.org/content/repositories/snapshots/ +http://repo.hedgecode.org/content/repositories/snapshots// +http://repo.heigit.org/artifactory/main/ +http://repo.jenkins-ci.org/public/ +http://repo.jenkins-ci.org/releases/ +http://repository.apache.org/content/groups/snapshots/ +http://repository.apache.org/content/groups/snapshots// +http://repository.apache.org/content/groups/snapshots-group/ +http://repository.apache.org/content/repositories/snapshots/ +http://repository.apache.org/snapshots/ +http://repository.exoplatform.org/content/repositories/juzu-snapshots/ +http://repository.exoplatform.org/public/ +http://repository.jboss.org/maven2/ +http://repository.mulesoft.org/releases/ +http://repository.mulesoft.org/snapshots/ +http://repository.ow2.org/nexus/content/repositories/releases/ +http://repository.ow2.org/nexus/content/repositories/snapshots/ +http://repository.sonatype.org/content/groups/flexgroup/ +http://repository.sonatype.org/content/groups/forge/ +http://repository.sonatype.org/content/groups/public/ +http://repository.sonatype.org/content/groups/sonatype-public-grid/ +http://repository.sonatype.org/content/repositories/flexmojos-releases/ +http://repository.sonatype.org/content/repositories/flexmojos-snapshots/ +http://repository.sonatype.org/content/repositories/nexus-plugins-snapshots/ +http://repository.sonatype.org/content/repositories/releases/ +http://repository.sonatype.org/content/repositories/snapshots/ +http://repository.sonatype.org/content/repositories/tycho-pseudo-releases/ +http://repo.spring.io/libs-milestone/ +http://repo.spring.io/libs-milestone-local/ +http://repo.spring.io/libs-release/ +http://repo.spring.io/libs-release-local/ +http://repo.spring.io/libs-snapshot/ +http://repo.spring.io/libs-snapshot-local/ +http://repo.spring.io/milestone/ +http://repo.spring.io/milestone// +http://repo.spring.io/plugins-release/ +http://repo.spring.io/plugins-release// +http://repo.spring.io/plugins-release-local/ +http://repo.spring.io/plugins-snapshot-local/ +http://repo.spring.io/release/ +http://repo.spring.io/snapshot/ +http://repos.zeroturnaround.com/nexus/content/groups/zt-public/ +http://repo.terasoluna.org/nexus/content/repositories/terasoluna-batch-releases/ +http://repo.terasoluna.org/nexus/content/repositories/terasoluna-batch-snapshots/ +http://repo.terasoluna.org/nexus/content/repositories/terasoluna-gfw-releases/ +http://repo.terasoluna.org/nexus/content/repositories/terasoluna-gfw-snapshots/ +https://artifactory.openntf.org/openntf/ +https://artifacts.alfresco.com/nexus/content/groups/public/ +https://artifacts.alfresco.com/nexus/content/groups/public-snapshots/ +https://artifacts.alfresco.com/nexus/content/repositories/activiti-releases/ +https://artifacts.alfresco.com/nexus/content/repositories/activiti-snapshots/ +https://artifacts.alfresco.com/nexus/content/repositories/public/ +https://artifacts.alfresco.com/nexus/content/repositories/releases/ +https://artifacts.alfresco.com/nexus/content/repositories/snapshots/ +https://artifacts.metaborg.org/content/repositories/snapshots/ +https://artifacts.metaborg.org/content/repositories/snapshots// +https://artifacts-oss.talend.com/nexus/content/repositories/TalendOpenSourceRelease/ +https://artifacts-oss.talend.com/nexus/content/repositories/TalendOpenSourceSnapshot/ +https://artifacts-zl.talend.com/nexus/content/repositories/TalendOpenSourceRelease/ +https://artifacts-zl.talend.com/nexus/content/repositories/TalendOpenSourceSnapshot/ +https://build.shibboleth.net/nexus/content/repositories/public/ +https://build.shibboleth.net/nexus/content/repositories/releases/ +https://build.shibboleth.net/nexus/content/repositories/snapshots/ +https://ci.qaprosoft.com/nexus/content/repositories/snapshots/ +https://clojars.org/repo/ +https://dev.majordodo.org/nexus/content/repositories/releases/ +https://dev.majordodo.org/nexus/content/repositories/snapshots/ +https://dev.majordodo.org/nexus/content/repositories/snapshots// +https://dist.wso2.org/maven2/ +https://jakarta.oss.sonatype.org/content/repositories/snapshots/ +https://jakarta.oss.sonatype.org/content/repositories/staging/ +https://jitpack.io/ +https://m2.duraspace.org/content/repositories/releases/ +https://m2.duraspace.org/content/repositories/snapshots/ +https://m2.duraspace.org/content/repositories/thirdparty/ +https://m2proxy.atlassian.com/repository/public/ +https://maven.atlassian.com/central-snapshot/ +https://maven.atlassian.com/content/groups/public/ +https://maven.atlassian.com/content/repositories/atlassian-public/ +https://maven.atlassian.com/private/ +https://maven.atlassian.com/private-snapshot/ +https://maven.atlassian.com/public/ +https://maven.atlassian.com/public-snapshot/ +https://maven.atlassian.com/repository/public/ +https://maven-central.storage-download.googleapis.com/repos/central/data/ +https://maven.imagej.net/content/groups/public/ +https://maven.java.net/content/groups/promoted/ +https://maven.java.net/content/repositories/promoted/ +https://maven.java.net/content/repositories/promoted// +https://maven.java.net/content/repositories/releases/ +https://maven.java.net/content/repositories/releases// +https://maven.java.net/content/repositories/snapshots/ +https://maven.java.net/content/repositories/staging/ +https://maven.mag-news.it/content/repositories/releases/ +https://maven.objectstyle.org/nexus/content/groups/cayenne-deps/ +https://maven.objectstyle.org/nexus/content/groups/linkrest/ +https://maven.objectstyle.org/nexus/content/repositories/bootique-snapshots/ +https://maven.objectstyle.org/nexus/content/repositories/linkrest-snapshots/ +https://maven.oracle.com/ +https://mavenrepo.openmrs.org/releases/ +https://mavenrepo.openmrs.org/snapshots/ +https://maven.repository.redhat.com/earlyaccess/all/ +https://maven.repository.redhat.com/ga// +https://maven.repository.redhat.com/techpreview/all// +https://maven.scijava.org/content/groups/public/ +https://maven.vaadin.com/vaadin-prereleases// +https://maven.wso2.org/nexus/content/groups/wso2-public/ +https://maven.wso2.org/nexus/content/repositories/releases/ +https://maven.wso2.org/nexus/content/repositories/snapshots/ +https://nexus.ala.org.au/content/repositories/releases/ +https://nexus.ala.org.au/content/repositories/snapshots/ +https://nexus.nuiton.org/nexus/content/repositories/central-releases/ +https://nexus.nuiton.org/nexus/content/repositories/snapshots/ +https://nexus.opencast.org/nexus/content/groups/public/ +https://nexus.xwiki.org/nexus/content/groups/public/ +https://nexus.xwiki.org/nexus/content/groups/public-snapshots/ +https://nexus.xwiki.org/nexus/content/repositories/snapshots/ +https://origin-repository.jboss.org/nexus/content/groups/ea/ +https://oss.sonatype.org/content/repositories/releases/ +https://packages.atlassian.com/maven/central/ +https://packages.atlassian.com/maven/central-snapshot/ +https://packages.atlassian.com/maven/public/ +https://packages.atlassian.com/maven/public-snapshot/ +https://packages.atlassian.com/maven/repository/public/ +https://packages.atlassian.com/mvn/maven-external/ +https://repo1.maven.org/maven2/ +https://repo.adobe.com/nexus/content/groups/public/ +https://repo.eclipse.org/content/groups/cbi/ +https://repo.eclipse.org/content/groups/microprofile/ +https://repo.eclipse.org/content/groups/releases// +https://repo.eclipse.org/content/groups/snapshots/ +https://repo.eclipse.org/content/repositories/californium-releases// +https://repo.eclipse.org/content/repositories/californium-snapshots/ +https://repo.eclipse.org/content/repositories/cbi/ +https://repo.eclipse.org/content/repositories/cbi-releases// +https://repo.eclipse.org/content/repositories/cbi-snapshots// +https://repo.eclipse.org/content/repositories/dash-licenses-releases/ +https://repo.eclipse.org/content/repositories/dash-licenses-snapshots// +https://repo.eclipse.org/content/repositories/ditto-releases/ +https://repo.eclipse.org/content/repositories/ditto-snapshots/ +https://repo.eclipse.org/content/repositories/ebr-releases/ +https://repo.eclipse.org/content/repositories/ebr-snapshots/ +https://repo.eclipse.org/content/repositories/ecf-releases/ +https://repo.eclipse.org/content/repositories/ecf-snapshots/ +https://repo.eclipse.org/content/repositories/geomesa-releases/ +https://repo.eclipse.org/content/repositories/geomesa-snapshots/ +https://repo.eclipse.org/content/repositories/hawkbit-releases/ +https://repo.eclipse.org/content/repositories/hawkbit-snapshots/ +https://repo.eclipse.org/content/repositories/hono-releases/ +https://repo.eclipse.org/content/repositories/hono-snapshots/ +https://repo.eclipse.org/content/repositories/jax-rs-api-releases/ +https://repo.eclipse.org/content/repositories/jax-rs-api-snapshots/ +https://repo.eclipse.org/content/repositories/jgit-releases/ +https://repo.eclipse.org/content/repositories/jgit-snapshots/ +https://repo.eclipse.org/content/repositories/jts-snapshots/ +https://repo.eclipse.org/content/repositories/leshan-releases/ +https://repo.eclipse.org/content/repositories/leshan-snapshots/ +https://repo.eclipse.org/content/repositories/lyo-releases/ +https://repo.eclipse.org/content/repositories/lyo-snapshots/ +https://repo.eclipse.org/content/repositories/microprofile-releases/ +https://repo.eclipse.org/content/repositories/microprofile-snapshots/ +https://repo.eclipse.org/content/repositories/nattable-releases/ +https://repo.eclipse.org/content/repositories/nattable-snapshots/ +https://repo.eclipse.org/content/repositories/paho-releases/ +https://repo.eclipse.org/content/repositories/paho-snapshots/ +https://repo.eclipse.org/content/repositories/proj4j-releases/ +https://repo.eclipse.org/content/repositories/proj4j-snapshots/ +https://repo.eclipse.org/content/repositories/releases// +https://repo.eclipse.org/content/repositories/scout-releases/ +https://repo.eclipse.org/content/repositories/scout-snapshots/ +https://repo.eclipse.org/content/repositories/snapshots/ +https://repo.eclipse.org/content/repositories/tycho-snapshots/ +https://repo.eclipse.org/content/repositories/yasson-releases/ +https://repo.eclipse.org/content/repositories/yasson-snapshots/ +https://repo.fusesource.com/nexus/content/groups/public/ +https://repo.hedgecode.org/content/repositories/releases/ +https://repo.hedgecode.org/content/repositories/snapshots/ +https://repo.heigit.org/artifactory/main/ +https://repo.huaweicloud.com/repository/maven/ +https://repo.jenkins-ci.org/public/ +https://repo.jenkins-ci.org/releases// +https://repo.locationtech.org/content/repositories/geomesa-releases/ +https://repo.locationtech.org/content/repositories/geomesa-snapshots/ +https://repo.locationtech.org/content/repositories/jts-snapshots/ +https://repo.locationtech.org/content/repositories/proj4j-releases/ +https://repo.locationtech.org/content/repositories/proj4j-snapshots/ +https://repo.openminted.eu/content/repositories/snapshots/ +https://repo.osgeo.org/repository/geotools-releases/ +https://repo.osgeo.org/repository/geotools-snapshots/ +https://repo.osgeo.org/repository/release/ +https://repo.osgeo.org/repository/snapshot/ +https://repository.apache.org/content/groups/public// +https://repository.apache.org/content/groups/snapshots// +https://repository.apache.org/content/groups/snapshots-group// +https://repository.apache.org/content/groups/staging/ +https://repository.apache.org/content/repositories/releases// +https://repository.apache.org/content/repositories/snapshots// +https://repository.apache.org/snapshots/ +https://repository.cloudera.com/artifactory/cloudera-repos/ +https://repository.cloudera.com/artifactory/ext-release-local/ +https://repository.jboss.org/maven2/ +https://repository.jboss.org/nexus/content/repositories/deprecated/ +https://repository.jboss.org/nexus/content/repositories/fs-releases/ +https://repository.jboss.org/nexus/content/repositories/fs-snapshots/ +https://repository.jboss.org/nexus/content/repositories/releases/ +https://repository.jboss.org/nexus/content/repositories/thirdparty-releases/ +https://repository.liferay.com/nexus/content/groups/public/ +https://repository.liferay.com/nexus/content/repositories/liferay-public-releases/ +https://repository.liferay.com/nexus/content/repositories/liferay-releases-ce/ +https://repository-master.mulesoft.org/nexus/content/repositories/releases// +https://repository-master.mulesoft.org/nexus/content/repositories/snapshots/ +https://repository-master.mulesoft.org/releases/ +https://repository-master.mulesoft.org/snapshots/ +https://repository.mulesoft.org/nexus/content/repositories/public/ +https://repository.mulesoft.org/nexus/content/repositories/releases/ +https://repository.mulesoft.org/nexus/content/repositories/snapshots/ +https://repository.mulesoft.org/releases// +https://repository.mulesoft.org/snapshots// +https://repository.ow2.org/nexus/content/repositories/snapshots/ +https://repository.sonatype.org/content/groups/flexgroup/ +https://repository.sonatype.org/content/groups/forge/ +https://repository.sonatype.org/content/groups/forge// +https://repository.sonatype.org/content/groups/public/ +https://repository.sonatype.org/content/groups/sonatype-public-grid// diff --git a/docs/run_maven-index-exporter.md b/docs/run_maven-index-exporter.md new file mode 100644 index 0000000..3174d3b --- /dev/null +++ b/docs/run_maven-index-exporter.md @@ -0,0 +1,60 @@ + +# Run Maven index exporter + + +## Running the full export + +The `run_full_export.py` script located in `scripts/` provides an easy way to run the +export as a cron batch job, and copy the resulting text export to a specific location. + + +## Running the image only + +The Docker image uses volumes to exchanges files. Prepare a directory with +enough space disk (see warning below) and pass it to docker: + +``` +$ docker run -v /local/work/dir:/work bbaldassari/maven-index-exporter +``` + +Please note that the local work dir MUST be an absolute path, as docker won't +mount relative paths as volumes. + +For our purpose only the fld file is kept, so if you need other export files +you should simply edit the `extract_indexes.sh` script and comment the lines +that do the cleaning. Then rebuild the Docker image and run it. + + +## Running as cron + +The `run_full_export.py` script located in `scripts/` provides an easy way to run the +export as a cron batch job, and copy the resulting text export to a specific location. + +Simply use and adapt the crontab command as follows: + +``` +cd /home/boris/maven-index-exporter/scripts/ && /home/boris/maven-index-exporter/scripts/myvenv/bin/python /home/boris/maven-index-exporter/scripts/run_full_export.py https://repo.maven.apache.org/maven2/ /tmp/maven-index\ +-exporter/ /var/www/html/maven_index_exporter/ 2>&1 > /home/boris/run_maven_exporter_$(date +"%Y%m%d-%H%M%S").log + +``` + +The script takes three mandatory arguments: + +``` +Usage: run_full_export.py url work_dir publish_dir + - url is the base url of the maven repository instance. + Example: https://repo.maven.apache.org/maven2/ + - work_dir must be an absolute path to the temp directory. + Example: /tmp/maven-index-exporter/ + - publish_dir must be an absolute path to the final directory. + Example: /var/www/html/ +``` + +It is recommended to setup a virtual environment to run the script. + +``` +$ python3 -m venv myvenv +$ source venv/bin/activate +``` + +Python modules to be installed are provided in the `requirements.txt` file.