diff --git a/README.md b/README.md
index 2018d86..43c2060 100644
--- a/README.md
+++ b/README.md
@@ -1,319 +1,335 @@
# Maven Index Exporter
This Docker image reads a Maven Indexer index and extract information about
the indexed documents as a convenient text file.
## Sequence
The index files can be dowloaded from any maven repository that uses
maven-indexer, like maven central:
https://repo1.maven.org/maven2/.index/
Copy all files (i.e. the main index, the updates and properties file) into
the volume directory (`$WORKDIR`). It will be mounted as `/work/` in the
docker image.
The export is then achieved in two steps:
* Unpack the Lucene indexes from the Maven Indexer indexes using
`maven-indexer-cli`. The command used is:
-````
+```
$ java --illegal-access=permit -jar $INDEXER_JAR \
--unpack $FILE_IN \
--destination $WORKDIR/indexes/ \
--type full
-````
+```
This generates a set of binary lucene files as shown below:
-````
+```
$ ls -lh $WORKDIR/indexes/
total 5,2G
-rw-r--r-- 1 root root 500M juil. 7 22:06 _4m.fdt
-rw-r--r-- 1 root root 339K juil. 7 22:06 _4m.fdx
-rw-r--r-- 1 root root 2,2K juil. 7 22:07 _4m.fnm
-rw-r--r-- 1 root root 166M juil. 7 22:07 _4m_Lucene50_0.doc
-rw-r--r-- 1 root root 147M juil. 7 22:07 _4m_Lucene50_0.pos
[SNIP]
-rw-r--r-- 1 root root 363 juil. 7 22:06 _e0.si
-rw-r--r-- 1 root root 1,7K juil. 7 22:07 segments_2
-rw-r--r-- 1 root root 8 juil. 7 21:54 timestamp
-````
+```
* Export the Lucene documents from the Lucene indexes using `clue`. This
generates a set of text files as shown below:
-````
+```
$ java --illegal-access=permit -jar $JAR_CLUE $WORKDIR/indexes/ \
export $WORKDIR/export/ text
-````
+```
This generates a bunch of text files relating to the Lucene indexes, made
available in `$WORKDIR/export/`. For our purpose we only keep the `*.fld`
file that includes the indexed documents.
## Output
The clue command is documented on [its github page](https://github.com/javasoze/clue).
The indexed Lucene documents are located in the `*.fld` file.
A description of the fields used by maven-indexer can be found in the project's
API docs:
https://maven.apache.org/maven-indexer-archives/maven-indexer-6.0.0/indexer-core/apidocs/org/apache/maven/index/ArtifactInfo.html
## How to build
The build downloads binaries for both tools (maven-indexer-cli and clue), so make sure there is an internet connection.
Go to the `docker/` dorectory and issue the folowing command:
```
$ docker build . -t bbaldassari/maven-index-exporter --no-cache
```
An up-to-date docker image is also available on docker hub at
[bbaldassari/maven-index-exporter](https://hub.docker.com/r/bbaldassari/maven-index-exporter).
-````
+```
$ docker pull bbaldassari/maven-index-exporter
-````
+```
## How to use
The Docker image uses volumes to exchanges files. Prepare a directory with
enough space disk (see warning below) and pass it to docker:
```
$ docker run -v /local/work/dir:/work bbaldassari/maven-index-exporter
```
Please note that the local work dir MUST be an absolute path, as docker won't
mount relative paths as volumes.
For our purpose only the fld file is kept, so if you need other export files
you should simply edit the `extract_indexes.sh` script and comment the lines
that do the cleaning.
### Running as cron
The `run_full_export.py` script located in `resources` provides an easy way to run the
export as a cron batch job, and copy the resulting text export to a specific location.
Simply use and adapt the crontab command as follows:
```
cd /home/boris/resources/ && /home/boris/resources/myvenv/bin/python /home/boris/resources/run_full_export.py https://repo.maven.apache.org/maven2/ /tmp/maven-index\
-exporter/ /var/www/html/maven_index_exporter/ 2>&1 > /home/boris/run_maven_exporter_$(date +"%Y%m%d-%H%M%S").log
```
The script takes three mandatory arguments:
```
Usage: run_full_export.py url work_dir publish_dir
- url is the base url of the maven repository instance.
Example: https://repo.maven.apache.org/maven2/
- work_dir must be an absolute path to the temp directory.
Example: /tmp/maven-index-exporter/
- publish_dir must be an absolute path to the final directory.
Example: /var/www/html/
```
It is recommended to setup a virtual environment to run the script.
```
$ python3 -m venv myvenv
$ source venv/bin/activate
```
Python modules to be installed are provided in the `requirements.txt` file.
### size of generated files
Beware that maven indexes are compressed and text export can become huge.
When executed on the maven central indexes (1.2 GB), the process generates
5.2 GB of intermediate files and 49 GB of final text data on disk:
```
$ du -sh /work/*
49G /work/export
5,2G /work/indexes
1,2G /work/nexus-maven-repository-index.gz
```
## How to test (the quick way)
There is a bash script called `test_docker_image.sh` in the `resources/` directory,
simply execute it. Tests cover the creation of the docker image, and the results after
execution.
-````
+```
$ bash test_docker_image.sh
Script started on 20210911_181912.
* Writing log to test_docker_image.log.
* Docker image [maven-index-exporter] doesn't exist.
* Building docker image.
PASS: docker build returned 0.
PASS: Docker image is listed.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has been created.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 7 docs.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has 26 fields.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0-sources.jar.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.0.pom.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1-sources.jar.
PASS: file [/home/boris/Projects/gh_maven-index-exporter/repository_test/export/_1.fld] has sprova4j-0.1.1.pom.
$
-````
+```
## How to test (the long road)
This repository has a simple, almost-empty maven-indexer index that can be used to test the docker build. To use it, make sure that the directory `repository_test/` is present and run this command:
-````
+```
$ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter
-````
+```
The exported files will be stored in `repository_test/export/`, and output should look like this:
-````
+```
$ docker run -v $(pwd)/repository_test:/work bbaldassari/maven-index-exporter
Docker Script started on 2021-08-27 06:32:22.
# Checks..
* Content of /opt:
total 32156
-rw------- 1 root root 18000742 Jan 8 2018 clue-6.2.0-1.0.0.jar
-rw-r--r-- 1 root root 2574 Aug 25 18:28 extract_indexes.sh
-rw------- 1 root root 14914610 Nov 28 2017 indexer-cli-6.0.0.jar
drwxr-xr-x 3 root root 4096 Jun 29 16:23 java
* Content of /work:
total 36
-rw-r--r-- 1 1000 1000 254 Aug 26 09:21 nexus-maven-repository-index.1.gz
-rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.1.gz.md5
-rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.1.gz.sha1
-rw-r--r-- 1 1000 1000 344 Aug 26 09:21 nexus-maven-repository-index.gz
-rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.gz.md5
-rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.gz.sha1
-rw-r--r-- 1 1000 1000 193 Aug 26 09:21 nexus-maven-repository-index.properties
-rw-r--r-- 1 1000 1000 32 Aug 26 09:21 nexus-maven-repository-index.properties.md5
-rw-r--r-- 1 1000 1000 40 Aug 26 09:21 nexus-maven-repository-index.properties.sha1
* Will read files from [/work/nexus-maven-repository-index.gz].
* Found file [/work/nexus-maven-repository-index.gz].
* Found indexer [/opt/indexer-cli-6.0.0.jar].
* Found clue [/opt/clue-6.2.0-1.0.0.jar].
* Java version:.
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
#############################
Unpacking [/work/nexus-maven-repository-index.gz] to /work/indexes
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Index Folder: /work
Output Folder: /work/indexes
Total time: 0 sec
Final memory: 41M/1004M
Unpacking finished on 2021-08-27 06:32:23.
#############################
Exporting indexes /work/indexes to /work/export
no configuration found, using default configuration
Analyzer: class org.apache.lucene.analysis.standard.StandardAnalyzer
Query Builder: class com.senseidb.clue.api.DefaultQueryBuilder
Directory Builder: class com.senseidb.clue.api.DefaultDirectoryBuilder
IndexReader Factory: class com.senseidb.clue.api.DefaultIndexReaderFactory
Term Bytesref Display: class com.senseidb.clue.api.StringBytesRefDisplay
Payload Bytesref Display: class com.senseidb.clue.api.RawBytesRefDisplay
exporting index to text
Exporting finished on 2021-08-27 06:32:23.
#############################
Cleaning useless files.
Size before cleaning:
32.0K /work/export
28.0K /work/indexes
4.0K /work/nexus-maven-repository-index.1.gz
4.0K /work/nexus-maven-repository-index.1.gz.md5
4.0K /work/nexus-maven-repository-index.1.gz.sha1
4.0K /work/nexus-maven-repository-index.gz
4.0K /work/nexus-maven-repository-index.gz.md5
4.0K /work/nexus-maven-repository-index.gz.sha1
4.0K /work/nexus-maven-repository-index.properties
4.0K /work/nexus-maven-repository-index.properties.md5
4.0K /work/nexus-maven-repository-index.properties.sha1
* Removing useless exports.
Keeping only fld text extract.
Size after cleaning:
8.0K /work/export
28.0K /work/indexes
4.0K /work/nexus-maven-repository-index.1.gz
4.0K /work/nexus-maven-repository-index.1.gz.md5
4.0K /work/nexus-maven-repository-index.1.gz.sha1
4.0K /work/nexus-maven-repository-index.gz
4.0K /work/nexus-maven-repository-index.gz.md5
4.0K /work/nexus-maven-repository-index.gz.sha1
4.0K /work/nexus-maven-repository-index.properties
4.0K /work/nexus-maven-repository-index.properties.md5
4.0K /work/nexus-maven-repository-index.properties.sha1
* Make files modifiable by the end-user.
Docker Script execution finished on 2021-08-27 06:32:23.
-````
+```
The `_1.fld` file contains the fields for each document:
-````
+```
$ head repository_test/export/_1.fld
doc 0
field 0
name u
type string
value al.aldi|sprova4j|0.1.0|sources|jar
field 1
name m
type string
value 1626111735737
field 2
-````
+```
### Building the test repository
The test repository `repository_test` can be rebuilt from the `repository_src`
structure using [indexer-cli](https://search.maven.org/remotecontent?filepath=org/apache/maven/indexer/indexer-cli/6.0.0/indexer-cli-6.0.0.jar)
with the following commands:
-````
+```
$ cd repository_src
$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo1 -s -c
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo1
Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index
Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test
Repository name: index
Indexers: [min, jarContent]
Will create checksum files for all published files (sha1, md5).
Will create incremental chunks for changes, along with baseline file.
Scanning started
Artifacts added: 2
Artifacts deleted: 0
Total time: 1 sec
Final memory: 48M/1012M
$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo2 -s -c
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo2
Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index
Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test
Repository name: index
Indexers: [min, jarContent]
Will create checksum files for all published files (sha1, md5).
Will create incremental chunks for changes, along with baseline file.
Scanning started
Artifacts added: 2
Artifacts deleted: 0
Total time: 0 sec
Final memory: 7M/1012M
+$ java -jar ~/Downloads/indexer-cli-6.0.0.jar -i index/ -d repository_test/ -r repo3 -s -c
+SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
+SLF4J: Defaulting to no-operation (NOP) logger implementation
+SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+Repository Folder: /home/boris/Projects/maven-index-exporter/repository_src/repo3
+Index Folder: /home/boris/Projects/maven-index-exporter/repository_src/index
+Output Folder: /home/boris/Projects/maven-index-exporter/repository_src/repository_test
+Repository name: index
+Indexers: [min, jarContent]
+Will create checksum files for all published files (sha1, md5).
+Will create incremental chunks for changes, along with baseline file.
+Scanning started
+Artifacts added: 1
+Artifacts deleted: 2
+Total time: 0 sec
+Final memory: 8M/1012M
$
-````
+```
diff --git a/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1-sources.jar b/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1-sources.jar
new file mode 100644
index 0000000..06fbedb
Binary files /dev/null and b/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1-sources.jar differ
diff --git a/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom b/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom
new file mode 100644
index 0000000..156640c
--- /dev/null
+++ b/repository_src/repo3/al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom
@@ -0,0 +1,87 @@
+
+
+ 4.0.0
+ al.aldi
+ sprova4j
+ 0.1.1
+ sprova4j
+ Java client for Sprova Test Management
+ https://github.com/aldialimucaj/sprova4j
+ 2018
+
+
+ The Apache Software License, Version 2.0
+ http://www.apache.org/licenses/LICENSE-2.0.txt
+ repo
+
+
+
+
+ aldi
+ Aldi Alimucaj
+ aldi.alimucaj@gmail.com
+
+
+
+ scm:https://github.com/aldialimucaj/sprova4j.git
+ scm:ghttps://github.com/aldialimucaj/sprova4j.git
+ https://github.com/aldialimucaj/sprova4j
+
+
+
+ ch.qos.logback
+ logback-classic
+ 1.2.3
+ runtime
+
+
+ com.google.code.gson
+ gson
+ 2.8.5
+ runtime
+
+
+ com.squareup.okhttp3
+ okhttp
+ 3.10.0
+ runtime
+
+
+ com.squareup.okio
+ okio
+ 1.14.1
+ runtime
+
+
+ org.glassfish
+ javax.json
+ 1.1.2
+ runtime
+
+
+ javax.json
+ javax.json-api
+ 1.1.2
+ runtime
+
+
+ javax.validation
+ validation-api
+ 2.0.1.Final
+ runtime
+
+
+ junit
+ junit
+ 4.12
+ test
+
+
+ com.squareup.okhttp3
+ mockwebserver
+ 3.10.0
+ test
+
+
+
diff --git a/repository_src/repo3/com/jolira/wicket-guicier-parent/2.0.12/wicket-guicier-parent-2.0.12.pom b/repository_src/repo3/com/jolira/wicket-guicier-parent/2.0.12/wicket-guicier-parent-2.0.12.pom
new file mode 100644
index 0000000..6e32078
--- /dev/null
+++ b/repository_src/repo3/com/jolira/wicket-guicier-parent/2.0.12/wicket-guicier-parent-2.0.12.pom
@@ -0,0 +1,90 @@
+
+
+
+ superpom
+ com.jolira
+ 1.1.4
+
+ 4.0.0
+ wicket-guicier-parent
+ pom
+ 2.0.12
+ Wicket Guicier Parent
+ A resplacement for wicket-guice that uses constructor injection as an alternative to the excessive use of PageParameters.
+
+
+ The Apache Software License, Version 2.0
+ http://www.apache.org/licenses/LICENSE-2.0.txt
+ repo
+
+
+
+ 1.4.17
+ 3.0
+ 7.3.1.v20110307
+ 4.8.1
+ 1.6.1
+
+
+ guicier
+ demo
+
+
+
+
+ false
+ src/main/resources
+
+
+ false
+ src/main/java
+
+ **
+
+
+ **/*.java
+
+
+
+
+
+ false
+ src/test/java
+
+ **
+
+
+ **/*.java
+
+
+
+
+
+ true
+ org.apache.maven.plugins
+ maven-compiler-plugin
+ 2.3
+
+ 1.6
+ 1.6
+ true
+ true
+
+
+
+ org.apache.maven.plugins
+ maven-eclipse-plugin
+ 2.8
+
+ true
+
+
+
+
+
+ scm:git:git@github.com:jolira/wicket-guicier.git
+ scm:git:git@github.com:jolira/wicket-guicier.git
+ https://github.com/jolira/wicket-guicier
+
+
+
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.1.gz b/repository_test/nexus-maven-repository-index.1.gz
index 095be4c..1b622e0 100644
Binary files a/repository_test/nexus-maven-repository-index.1.gz and b/repository_test/nexus-maven-repository-index.1.gz differ
diff --git a/repository_test/nexus-maven-repository-index.1.gz.md5 b/repository_test/nexus-maven-repository-index.1.gz.md5
index 1b423ec..56f3983 100644
--- a/repository_test/nexus-maven-repository-index.1.gz.md5
+++ b/repository_test/nexus-maven-repository-index.1.gz.md5
@@ -1 +1 @@
-68a4cb78e7427b4800dc75b457f58891
\ No newline at end of file
+e4addbbdc7ff603aab2e9fe8c83b60fb
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.1.gz.sha1 b/repository_test/nexus-maven-repository-index.1.gz.sha1
index 473ca66..3f2eef2 100644
--- a/repository_test/nexus-maven-repository-index.1.gz.sha1
+++ b/repository_test/nexus-maven-repository-index.1.gz.sha1
@@ -1 +1 @@
-69178881fe2d14af5450599a45f3280b68d14f39
\ No newline at end of file
+a3d906a70df633d3c0eb84504838bf04ce31b3a9
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.2.gz b/repository_test/nexus-maven-repository-index.2.gz
new file mode 100644
index 0000000..1f65f4a
Binary files /dev/null and b/repository_test/nexus-maven-repository-index.2.gz differ
diff --git a/repository_test/nexus-maven-repository-index.2.gz.md5 b/repository_test/nexus-maven-repository-index.2.gz.md5
new file mode 100644
index 0000000..ef0cf04
--- /dev/null
+++ b/repository_test/nexus-maven-repository-index.2.gz.md5
@@ -0,0 +1 @@
+2845c19dbeb673ccba9b0da44bed0cb1
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.2.gz.sha1 b/repository_test/nexus-maven-repository-index.2.gz.sha1
new file mode 100644
index 0000000..8642f11
--- /dev/null
+++ b/repository_test/nexus-maven-repository-index.2.gz.sha1
@@ -0,0 +1 @@
+7ac2c6f1541541fedfcd97d5c21b2fe2540682c4
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.gz b/repository_test/nexus-maven-repository-index.gz
index 0dbdc65..e37c312 100644
Binary files a/repository_test/nexus-maven-repository-index.gz and b/repository_test/nexus-maven-repository-index.gz differ
diff --git a/repository_test/nexus-maven-repository-index.gz.md5 b/repository_test/nexus-maven-repository-index.gz.md5
index f2ac466..1db9dbf 100644
--- a/repository_test/nexus-maven-repository-index.gz.md5
+++ b/repository_test/nexus-maven-repository-index.gz.md5
@@ -1 +1 @@
-194d598403fd6794ced6e250cc1cb286
\ No newline at end of file
+44b865f0fa3840f35159ba83635cf4e8
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.gz.sha1 b/repository_test/nexus-maven-repository-index.gz.sha1
index f902328..0818cdc 100644
--- a/repository_test/nexus-maven-repository-index.gz.sha1
+++ b/repository_test/nexus-maven-repository-index.gz.sha1
@@ -1 +1 @@
-00447127acc6191ec26f69df74a3518b2467f9ee
\ No newline at end of file
+3e17b9dfaa911e31bb5283986a3118b13fbd0448
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.properties b/repository_test/nexus-maven-repository-index.properties
index 929f56c..605fdad 100644
--- a/repository_test/nexus-maven-repository-index.properties
+++ b/repository_test/nexus-maven-repository-index.properties
@@ -1,6 +1,7 @@
-#Mon Jul 12 19:43:05 CEST 2021
-nexus.index.last-incremental=1
+#Tue Nov 02 21:10:23 CET 2021
+nexus.index.last-incremental=2
nexus.index.id=index
-nexus.index.incremental-0=1
-nexus.index.chain-id=1626111736087
-nexus.index.timestamp=20210712174305.018 +0000
+nexus.index.incremental-0=2
+nexus.index.incremental-1=1
+nexus.index.chain-id=1635883794695
+nexus.index.timestamp=20211102201023.455 +0000
diff --git a/repository_test/nexus-maven-repository-index.properties.md5 b/repository_test/nexus-maven-repository-index.properties.md5
index 439bec1..1659873 100644
--- a/repository_test/nexus-maven-repository-index.properties.md5
+++ b/repository_test/nexus-maven-repository-index.properties.md5
@@ -1 +1 @@
-2679daac618257c4fb51cd506e55d233
\ No newline at end of file
+1dca63be693bf9142c2373052a068355
\ No newline at end of file
diff --git a/repository_test/nexus-maven-repository-index.properties.sha1 b/repository_test/nexus-maven-repository-index.properties.sha1
index 92e39cc..ced9e8a 100644
--- a/repository_test/nexus-maven-repository-index.properties.sha1
+++ b/repository_test/nexus-maven-repository-index.properties.sha1
@@ -1 +1 @@
-0c75b1a7e3f58f0adde44a3c2ade04515a4a00b1
\ No newline at end of file
+64224d5d5aa9167e52c0c3022f0dca3b8d6cc2b9
\ No newline at end of file