Page MenuHomeSoftware Heritage

Adapt run_full_export according to swh cli conventions
ClosedPublic

Authored by ardumont on Mar 22 2022, 6:03 PM.

Details

Summary

This adapts the existing script to:

  • use click which autodocuments the cli
  • add default values for less important parameters
  • switch to logging instead of print statements
  • allows to provide another image (default to bbaldassari/maven-index-exporter).

This also adapts the documentation about the script accordingly.

Related to T3746

Test Plan
  • scripts/test_docker_image.sh is happy
  • actual run of the script is happy too:
$ cd docker/
# build the image
$ docker build -f Dockerfile -t $USER/maven-index-exporter .
Sending build context to Docker daemon  23.55kB
Step 1/8 : FROM adoptopenjdk/openjdk11:alpine-jre
 ---> b9a979a572aa
Step 2/8 : ADD https://github.com/javasoze/clue/releases/download/release-6.2.0-1.0.0/clue-6.2.0-1.0.0.jar /opt/
Downloading [==================================================>]     18MB/18MB

 ---> Using cache
 ---> 9e3136d449b6
Step 3/8 : ADD https://repo1.maven.org/maven2/org/apache/maven/indexer/indexer-cli/6.0.0/indexer-cli-6.0.0.jar /opt/
Downloading [==================================================>]  14.91MB/14.91MB

 ---> Using cache
 ---> 5d0e575fb7bd
Step 4/8 : COPY extract_indexes.sh /opt/
 ---> Using cache
 ---> 777ca2fa6853
Step 5/8 : WORKDIR /work/
 ---> Using cache
 ---> 8e291c569bd1
Step 6/8 : RUN ls /opt/
 ---> Using cache
 ---> ef435da9603e
Step 7/8 : RUN ls -R /work/
 ---> Using cache
 ---> 5146a6df8a47
Step 8/8 : CMD ["sh", "/opt/extract_indexes.sh", "/work/nexus-maven-repository-index.gz"]
 ---> Using cache
 ---> 40af3ac1add7
Successfully built 40af3ac1add7
Successfully tagged tony/maven-index-exporter:latest
$ cd ../scripts
$ python3 run_full_export.py --base-url https://repo.maven.apache.org/maven2/ --docker-image $USER/maven-index-exporter
INFO:__main__:Script: run_full_export
INFO:__main__:Timestamp: 2022-03-22 18:00:52
INFO:__main__:* URL: https://repo.maven.apache.org/maven2/
INFO:__main__:* Working directory: /tmp/maven-index-exporter/
INFO:__main__:* Publish directory: /tmp/maven-index-exporter/publish/
INFO:__main__:Work_Dir /tmp/maven-index-exporter/ exists. Reusing it.
INFO:__main__:Downloading all required indexes
INFO:__main__:  - Downloading /tmp/maven-index-exporter/nexus-maven-repository-index.properties.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.732.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.733.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.734.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.735.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.736.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.737.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.738.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.742.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.739.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.743.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.740.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.744.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.741.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.745.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.746.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.747.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.748.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.749.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.750.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.751.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.722.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.723.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.724.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.725.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.726.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.727.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.728.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.729.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.730.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.731.gz exists, skipping download.
INFO:__main__:  - File /tmp/maven-index-exporter/nexus-maven-repository-index.gz exists, skipping download.
INFO:__main__:Docker: Found image <Image: 'maven-index-exporter:latest', 'tony/maven-index-exporter:latest'> locally, ID is sha256:40af3ac1add75d24738839ccc1971f338192261927cc287460eb6892a8812092.
INFO:__main__:Docker log:
Docker Script started on 2022-03-22 17:00:52.
# Checks..
* Content of /opt:
total 30469
-rw-------    1 root     root      18000742 Dec  8 07:19 clue-6.2.0-1.0.0.jar
-rw-r--r--    1 root     root          2830 Feb 17 13:56 extract_indexes.sh
-rw-------    1 root     root      14914610 Nov 28  2017 indexer-cli-6.0.0.jar
drwxr-xr-x    3 root     root             3 Mar 20 19:38 java
* Content of /work:
total 1671296
drwxrwxrwx    2 root     root          4096 Mar 22 16:33 export
drwxrwxrwx    2 root     root         12288 Mar 22 16:11 indexes
-rw-r--r--    1 1000     1000       5097757 Mar 22 15:52 nexus-maven-repository-index.722.gz
-rw-r--r--    1 1000     1000       1836882 Mar 22 15:52 nexus-maven-repository-index.723.gz
-rw-r--r--    1 1000     1000       3870944 Mar 22 15:53 nexus-maven-repository-index.724.gz
-rw-r--r--    1 1000     1000       6913907 Mar 22 15:53 nexus-maven-repository-index.725.gz
-rw-r--r--    1 1000     1000       6706671 Mar 22 15:53 nexus-maven-repository-index.726.gz
-rw-r--r--    1 1000     1000       8135404 Mar 22 15:53 nexus-maven-repository-index.727.gz
-rw-r--r--    1 1000     1000      10113194 Mar 22 15:53 nexus-maven-repository-index.728.gz
-rw-r--r--    1 1000     1000       9004362 Mar 22 15:53 nexus-maven-repository-index.729.gz
-rw-r--r--    1 1000     1000       8548614 Mar 22 15:53 nexus-maven-repository-index.730.gz
-rw-r--r--    1 1000     1000       6347214 Mar 22 15:53 nexus-maven-repository-index.731.gz
-rw-r--r--    1 1000     1000       6820245 Mar 22 15:52 nexus-maven-repository-index.732.gz
-rw-r--r--    1 1000     1000      12821159 Mar 22 15:52 nexus-maven-repository-index.733.gz
-rw-r--r--    1 1000     1000       7003185 Mar 22 15:52 nexus-maven-repository-index.734.gz
-rw-r--r--    1 1000     1000       2413908 Mar 22 15:52 nexus-maven-repository-index.735.gz
-rw-r--r--    1 1000     1000       6380653 Mar 22 15:52 nexus-maven-repository-index.736.gz
-rw-r--r--    1 1000     1000      14646697 Mar 22 15:52 nexus-maven-repository-index.737.gz
-rw-r--r--    1 1000     1000      13275279 Mar 22 15:52 nexus-maven-repository-index.738.gz
-rw-r--r--    1 1000     1000       2210698 Mar 22 15:52 nexus-maven-repository-index.739.gz
-rw-r--r--    1 1000     1000      14045180 Mar 22 15:52 nexus-maven-repository-index.740.gz
-rw-r--r--    1 1000     1000       7083099 Mar 22 15:52 nexus-maven-repository-index.741.gz
-rw-r--r--    1 1000     1000      11225591 Mar 22 15:52 nexus-maven-repository-index.742.gz
-rw-r--r--    1 1000     1000       1419693 Mar 22 15:52 nexus-maven-repository-index.743.gz
-rw-r--r--    1 1000     1000       4036562 Mar 22 15:52 nexus-maven-repository-index.744.gz
-rw-r--r--    1 1000     1000       5782895 Mar 22 15:52 nexus-maven-repository-index.745.gz
-rw-r--r--    1 1000     1000       7869621 Mar 22 15:52 nexus-maven-repository-index.746.gz
-rw-r--r--    1 1000     1000       6477544 Mar 22 15:52 nexus-maven-repository-index.747.gz
-rw-r--r--    1 1000     1000       6774157 Mar 22 15:52 nexus-maven-repository-index.748.gz
-rw-r--r--    1 1000     1000       9752927 Mar 22 15:52 nexus-maven-repository-index.749.gz
-rw-r--r--    1 1000     1000      11490402 Mar 22 15:52 nexus-maven-repository-index.750.gz
-rw-r--r--    1 1000     1000      10019047 Mar 22 15:52 nexus-maven-repository-index.751.gz
-rw-r--r--    1 1000     1000     1483183512 Mar 22 15:55 nexus-maven-repository-index.gz
-rw-r--r--    1 1000     1000          1130 Mar 22 17:00 nexus-maven-repository-index.properties
drwxr-xr-x    2 1000     1000          4096 Mar 22 16:52 publish
* Will read files from [/work/nexus-maven-repository-index.gz].
*   Found file [/work/nexus-maven-repository-index.gz].
*   Found indexer [/opt/indexer-cli-6.0.0.jar].
*   Found clue [/opt/clue-6.2.0-1.0.0.jar].
* Java version:.
openjdk version "11.0.14.1" 2022-02-08
OpenJDK Runtime Environment Temurin-11.0.14.1+1 (build 11.0.14.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.14.1+1 (build 11.0.14.1+1, mixed mode)
#############################
Found /work/indexes, skipping index generation.
6.6G    /work/indexes/
Unpacking finished on 2022-03-22 17:00:52.
#############################
Found /work/export, skipping index export.
total 17G
-rwxrwxrwx    1 root     root       16.7G Mar 22 16:18 _n.fld
-rwxrwxrwx    1 root     root           0 Mar 22 16:11 write.lock
Exporting finished on 2022-03-22 17:00:52.
#############################
Cleaning useless files.
Size before cleaning:
16.7G   /work/export
6.6G    /work/indexes
4.9M    /work/nexus-maven-repository-index.722.gz
1.8M    /work/nexus-maven-repository-index.723.gz
3.7M    /work/nexus-maven-repository-index.724.gz
6.6M    /work/nexus-maven-repository-index.725.gz
6.4M    /work/nexus-maven-repository-index.726.gz
7.8M    /work/nexus-maven-repository-index.727.gz
9.6M    /work/nexus-maven-repository-index.728.gz
8.6M    /work/nexus-maven-repository-index.729.gz
8.2M    /work/nexus-maven-repository-index.730.gz
6.1M    /work/nexus-maven-repository-index.731.gz
6.5M    /work/nexus-maven-repository-index.732.gz
12.2M   /work/nexus-maven-repository-index.733.gz
6.7M    /work/nexus-maven-repository-index.734.gz
2.3M    /work/nexus-maven-repository-index.735.gz
6.1M    /work/nexus-maven-repository-index.736.gz
14.0M   /work/nexus-maven-repository-index.737.gz
12.7M   /work/nexus-maven-repository-index.738.gz
2.1M    /work/nexus-maven-repository-index.739.gz
13.4M   /work/nexus-maven-repository-index.740.gz
6.8M    /work/nexus-maven-repository-index.741.gz
10.7M   /work/nexus-maven-repository-index.742.gz
1.4M    /work/nexus-maven-repository-index.743.gz
3.9M    /work/nexus-maven-repository-index.744.gz
5.5M    /work/nexus-maven-repository-index.745.gz
7.5M    /work/nexus-maven-repository-index.746.gz
6.2M    /work/nexus-maven-repository-index.747.gz
6.5M    /work/nexus-maven-repository-index.748.gz
9.3M    /work/nexus-maven-repository-index.749.gz
11.0M   /work/nexus-maven-repository-index.750.gz
9.6M    /work/nexus-maven-repository-index.751.gz
1.4G    /work/nexus-maven-repository-index.gz
4.0K    /work/nexus-maven-repository-index.properties
16.7G   /work/publish
* Removing useless exports.
  Keeping only fld text extract.
  Size after cleaning:
16.7G   /work/export
6.6G    /work/indexes
4.9M    /work/nexus-maven-repository-index.722.gz
1.8M    /work/nexus-maven-repository-index.723.gz
3.7M    /work/nexus-maven-repository-index.724.gz
6.6M    /work/nexus-maven-repository-index.725.gz
6.4M    /work/nexus-maven-repository-index.726.gz
7.8M    /work/nexus-maven-repository-index.727.gz
9.6M    /work/nexus-maven-repository-index.728.gz
8.6M    /work/nexus-maven-repository-index.729.gz
8.2M    /work/nexus-maven-repository-index.730.gz
6.1M    /work/nexus-maven-repository-index.731.gz
6.5M    /work/nexus-maven-repository-index.732.gz
12.2M   /work/nexus-maven-repository-index.733.gz
6.7M    /work/nexus-maven-repository-index.734.gz
2.3M    /work/nexus-maven-repository-index.735.gz
6.1M    /work/nexus-maven-repository-index.736.gz
14.0M   /work/nexus-maven-repository-index.737.gz
12.7M   /work/nexus-maven-repository-index.738.gz
2.1M    /work/nexus-maven-repository-index.739.gz
13.4M   /work/nexus-maven-repository-index.740.gz
6.8M    /work/nexus-maven-repository-index.741.gz
10.7M   /work/nexus-maven-repository-index.742.gz
1.4M    /work/nexus-maven-repository-index.743.gz
3.9M    /work/nexus-maven-repository-index.744.gz
5.5M    /work/nexus-maven-repository-index.745.gz
7.5M    /work/nexus-maven-repository-index.746.gz
6.2M    /work/nexus-maven-repository-index.747.gz
6.5M    /work/nexus-maven-repository-index.748.gz
9.3M    /work/nexus-maven-repository-index.749.gz
11.0M   /work/nexus-maven-repository-index.750.gz
9.6M    /work/nexus-maven-repository-index.751.gz
1.4G    /work/nexus-maven-repository-index.gz
4.0K    /work/nexus-maven-repository-index.properties
16.7G   /work/publish
* Make files modifiable by the end-user.
Docker Script execution finished on 2022-03-22 17:00:52.

INFO:__main__:Export directory has the following files:
INFO:__main__:  - write.lock size 0
INFO:__main__:  - _n.fld size 17982862850
INFO:__main__:Found fld file: _n.fld
INFO:__main__:Copying files to /tmp/maven-index-exporter/publish/export.fld.
INFO:__main__:Script finished on 2022-03-22 18:01:12

# at the end of it all, the export.fld file exists with the massaged data
$ head -20 /tmp/maven-index-exporter/publish/export.fld
doc 0
  field 0
    name u
    type string
    value com.redhat.rhevm.api|rhevm-api-powershell-jaxrs|1.0-rc1.16|javadoc|jar
  field 1
    name m
    type string
    value 1321264789727
  field 2
    name i
    type string
    value jar|1320743675000|768291|2|2|1|jar
  field 10
    name n
    type string
    value RHEV-M API Powershell Wrapper Implementation JAX-RS
  field 13
    name 1
    type string
...

Diff Detail

Repository
rDLSMAVEXP maven-index-exporter
Branch
main
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 27750
Build 43438: arc lint + arc unit

Event Timeline

ardumont edited the test plan for this revision. (Show Details)

other more general points, not directly relative to this diff:

  • the image should be build from adoptopenjdk/openjdk11:debian-jre or any debian based image
  • The user used in the container seems to be root, doesn't it generate some permissions issues in the temporary directory?
docs/run_maven_index_exporter.md
15

why not using only maven-index-export as image name ?

The namespace is only necessary to push on docker hub.
It will allow to use maven-index-exporter as default value for the docker-image parameter of the python script

scripts/run_full_export.py
35–36

possible improvement: add a --force-pull option to force a refresh of the image

49

is there a reason to have the third parameter different compared to the CMD line of the Dockerfile ?
If they were unified, the command parameter could be removed here

other more general points, not directly relative to this diff:

  • the image should be build from adoptopenjdk/openjdk11:debian-jre or any debian based image

right, thx.

  • The user used in the container seems to be root, doesn't it generate some permissions issues in the temporary directory?

You mean in the mounted volume on the host?
It seems it's 777 so it's kind of fine...

# cut to 5 lines, remaining part is actually owned by my $USER
$ ls -lah /tmp/maven-index-exporter | head -5
total 1.6G
drwxr-xr-x  5 tony tony 4.0K Mar 22 17:52 .
drwxrwxrwt 45 root root  84K Mar 23 09:45 ..
drwxrwxrwx  2 root root 4.0K Mar 22 17:33 export
drwxrwxrwx  2 root root  12K Mar 22 17:11 indexes
ardumont added inline comments.
docs/run_maven_index_exporter.md
15

Thanks!

scripts/run_full_export.py
35–36

Good idea, but that seems to not be an option with the api used.

[1]

-> myimage = client.images.pull(repository=docker_image, force=True)
(Pdb++) client.images.pull?
Type:           method
String Form:    <bound method ImageCollection.pull of <docker.models.images.ImageCollection object at 0x7f664bc89fd0>>
File:           /home/tony/.virtualenvs/swh/lib/python3.9/site-packages/docker/models/images.py
Definition:     client.images.pull(repository, tag=None, all_tags=False, **kwargs)
Docstring:
        Pull an image of the given name and return it. Similar to the
        ``docker pull`` command.
        If ``tag`` is ``None`` or empty, it is set to ``latest``.
        If ``all_tags`` is set, the ``tag`` parameter is ignored and all image
        tags will be pulled.

        If you want to get the raw pull output, use the
        :py:meth:`~docker.api.image.ImageApiMixin.pull` method in the
        low-level API.

        Args:
            repository (str): The repository to pull
            tag (str): The tag to pull
            auth_config (dict): Override the credentials that are found in the
                config for this request.  ``auth_config`` should contain the
                ``username`` and ``password`` keys to be valid.
            platform (str): Platform in the format ``os[/arch[/variant]]``
            all_tags (bool): Pull all image tags

        Returns:
            (:py:class:`Image` or list): The image that has been pulled.
                If ``all_tags`` is True, the method will return a list
                of :py:class:`Image` objects belonging to this repository.

        Raises:
            :py:class:`docker.errors.APIError`
                If the server returns an error.

        Example:

            >>> # Pull the image tagged `latest` in the busybox repo
            >>> image = client.images.pull('busybox')

            >>> # Pull all tags in the busybox repo
            >>> images = client.images.pull('busybox', all_tags=True)
ardumont added inline comments.
scripts/run_full_export.py
49

Nope, plus the script is actually not parametrized at all, /work is hardcoded in there.

Adapt according to feedback and simplify the image listing computation

Drop force_pull=True which does not work (as per my comment in the diff).
It was a failed attempt i forgot to drop.

a couple of non-blocking remarks in lined

docs/run_maven_index_exporter.md
16

does the -t maven-index-exporter option not needed to have the image named?
(and the -f Dockerfile not ;) )

scripts/run_full_export.py
35–36

I meant to add an option to the python script to force the call of this method, but it's just a possible improvment and not blocker.

49

WDYT to change the CMD in the dockerfile to

CMD ["sh", "/opt/extract_indexes.sh"]

and removing the line 49 completely?

This revision is now accepted and ready to land.Mar 23 2022, 4:25 PM
docs/run_maven_index_exporter.md
16

i tested that as is and it was fine....

since you asked, i double checked and i misanalyzed that part indeed (got hit by too efficient docker cache stuff ;).
So let's revert that part to keep the -t indeed.

Thanks.

[1]

$ docker image ls | grep maven-index-exporter
tony/maven-index-exporter          latest           40af3ac1add7   39 hours ago    181MB
bbaldassari/maven-index-exporter   latest           b592927dc5fa   7 months ago    181MB
$ docker build -f Dockerfile -t maven-index-exporter .
Sending build context to Docker daemon  23.55kB
Step 1/8 : FROM adoptopenjdk/openjdk11:debian-jre
 ---> e4102b823981
Step 2/8 : ADD https://github.com/javasoze/clue/releases/download/release-6.2.0-1.0.0/clue-6.2.0-1.0.0.jar /opt/
Downloading [==================================================>]     18MB/18MB

 ---> Using cache
 ---> 820824f2487e
Step 3/8 : ADD https://repo1.maven.org/maven2/org/apache/maven/indexer/indexer-cli/6.0.0/indexer-cli-6.0.0.jar /opt/
Downloading [==================================================>]  14.91MB/14.91MB

 ---> Using cache
 ---> 6f0db2290ab1
Step 4/8 : COPY extract_indexes.sh /opt/
 ---> Using cache
 ---> d34c1eb442ae
Step 5/8 : WORKDIR /work/
 ---> Using cache
 ---> 790502fc827a
Step 6/8 : RUN ls /opt/
 ---> Using cache
 ---> 56562bed71b9
Step 7/8 : RUN ls -R /work/
 ---> Using cache
 ---> 4a093a97da2e
Step 8/8 : CMD ["sh", "/opt/extract_indexes.sh", "/work/nexus-maven-repository-index.gz"]
 ---> Using cache
 ---> ffbefb8bdf05
Successfully built ffbefb8bdf05
Successfully tagged maven-index-exporter:latest
$ docker image ls | grep maven-index-exporter
maven-index-exporter               latest           ffbefb8bdf05   4 minutes ago   318MB
tony/maven-index-exporter          latest           40af3ac1add7   39 hours ago    181MB
bbaldassari/maven-index-exporter   latest           b592927dc5fa   7 months ago    181MB
scripts/run_full_export.py
35–36

i see now, thx.

49

I think i agree but i'll do that in another diff first.
Digging in that direction currently confuses me ;)

  • docker/Dockerfile: Switch base image to openjdk11:debian-jre
  • docs/maven-repositories.md: Fix typos
  • Add --docker-image-update flag to ease update
  • run_full_export.py: Fix some more logging instructions
  • Drop unnecessary instruction in the python script
ardumont added inline comments.
scripts/run_full_export.py
49

adapted here in the end (in another commit though).