diff --git a/dockerfiles/Dockerfile b/dockerfiles/Dockerfile --- a/dockerfiles/Dockerfile +++ b/dockerfiles/Dockerfile @@ -8,11 +8,8 @@ -XX:+UseTLAB -XX:+ResizeTLAB \ -Dlogback.configurationFile=configuration/logback.xml -# Monitoring -RUN yum install -y time - # Download third party binaries and dependencies -WORKDIR /swh/graph-lib +WORKDIR /srv/softwareheritage/graph/lib RUN curl -O http://webgraph.di.unimi.it/webgraph-big-3.5.1-bin.tar.gz RUN tar xvfz webgraph-big-3.5.1-bin.tar.gz @@ -26,6 +23,9 @@ RUN cp law-2.5.2/law-2.5.2.jar . # Add user files -WORKDIR /swh/app +WORKDIR /srv/softwareheritage/graph/app COPY configuration configuration/ COPY scripts scripts/ + +# Default dir +WORKDIR /srv/softwareheritage/graph diff --git a/dockerfiles/scripts/compress_graph.sh b/dockerfiles/scripts/compress_graph.sh --- a/dockerfiles/scripts/compress_graph.sh +++ b/dockerfiles/scripts/compress_graph.sh @@ -1,53 +1,60 @@ #!/bin/bash usage() { - echo "Usage: --input --output --lib " - echo " options:" - echo " -t, --tmp (default to /tmp/)" - echo " --stdout (default to ./stdout)" - echo " --stderr (default to ./stderr)" - echo " --batch-size (default to 10^6): WebGraph internals" + echo "Usage: compress_graph.sh --lib --input " + echo "Options:" + echo " -o, --outdir (Default: GRAPH_DIR/compressed)" + echo " -t, --tmp (Default: OUT_DIR/tmp)" + echo " --stdout (Default: OUT_DIR/stdout)" + echo " --stderr (Default: OUT_DIR/stderr)" + echo " --batch-size (Default: 10^6): WebGraph internals" exit 1 } graph_path="" out_dir="" lib_dir="" -tmp_dir="/tmp/" -stdout_file="stdout" -stderr_file="stderr" +stdout_file="" +stderr_file="" batch_size=1000000 while (( "$#" )); do case "$1" in - -i|--input) shift; graph_path=$1;; - -o|--output) shift; out_dir=$1;; - -l|--lib) shift; lib_dir=$1;; - -t|--tmp) shift; tmp_dir=$1;; - --stdout) shift; stdout_file=$1;; - --stderr) shift; stderr_file=$1;; - --batch-size) shift; batch_size=$1;; - *) usage;; + -i|--input) shift; graph_path=$1 ;; + -o|--outdir) shift; out_dir=$1 ;; + -l|--lib) shift; lib_dir=$1 ;; + -t|--tmp) shift; tmp_dir=$1 ;; + --stdout) shift; stdout_file=$1 ;; + --stderr) shift; stderr_file=$1 ;; + --batch-size) shift; batch_size=$1 ;; + *) usage ;; esac shift done -if [[ -z $graph_path || -z $out_dir || -z $lib_dir ]]; then +if [[ -z "$graph_path" || ! -d "$lib_dir" ]]; then usage fi - -if [[ -f "$stdout_file" || -f "$stderr_file" ]]; then - echo "Cannot overwrite previous compression stdout/stderr files" - exit 1 +if [ -z "$out_dir" ] ; then + out_dir="$(dirname $graph_path)/compressed" +fi +if [ -z "$tmp_dir" ] ; then + tmp_dir="${out_dir}/tmp" +fi +if [ -z "$stdout_file" ] ; then + stdout_file="${out_dir}/stdout" +fi +if [ -z "$stderr_file" ] ; then + stderr_file="${out_dir}/stderr" fi dataset=$(basename $graph_path) -compr_graph_path="$out_dir/$dataset" +compr_graph_path="${out_dir}/${dataset}" -mkdir -p $out_dir -mkdir -p $tmp_dir +test -d "$out_dir" || mkdir -p "$out_dir" +test -d "$tmp_dir" || mkdir -p "$tmp_dir" java_cmd () { - /usr/bin/time -v java -cp $lib_dir/'*' $* + java -cp $lib_dir/'*' $* } { @@ -85,7 +92,7 @@ $batch_size $tmp_dir && java_cmd it.unimi.dsi.big.webgraph.BVGraph \ --list $compr_graph_path-transposed -} >> $stdout_file 2>> $stderr_file +} > $stdout_file 2> $stderr_file if [[ $? -eq 0 ]]; then echo "Graph compression done." diff --git a/docs/docker.rst b/docs/docker.rst --- a/docs/docker.rst +++ b/docs/docker.rst @@ -1,6 +1,7 @@ Graph Docker environment ======================== + Build ----- @@ -10,70 +11,65 @@ $ cd swh-graph $ docker build --tag swh-graph dockerfiles + Run --- -Given a graph specified by: +Given a graph ``g`` specified by: - ``g.edges.csv.gz``: gzip-compressed csv file with one edge per line, as a - "SRC_ID SPACE DST_ID" string, where identifiers are the `persistent identifier - `_ - of each node. + "SRC_ID SPACE DST_ID" string, where identifiers are the + :ref:`persistent-identifiers` of each node. - ``g.nodes.csv.gz``: sorted list of unique node identifiers appearing in the corresponding ``g.edges.csv.gz`` file. The format is a gzip-compressed csv file with one persistent identifier per line. .. code:: bash - $ docker run \ - --volume /path/to/graph/:/graph \ - --volume /path/to/output/:/graph/compressed \ - --name swh-graph --tty --interactive \ - swh-graph:latest bash + $ docker run -ti \ + --volume /PATH/TO/GRAPH/:/srv/softwareheritage/graph/data \ + --publish 127.0.0.1:5009:5009 \ + swh-graph:latest \ + bash + +Where ``/PATH/TO/GRAPH`` is a directory containing the ``g.edges.csv.gz`` and +``g.nodes.csv.gz`` files. By default, when entering the container the current +working directory will be ``/srv/softwareheritage/graph``; all relative paths +found below are intended to be relative to that dir. -Where ``/path/to/graph`` is a directory containing the ``g.edges.csv.gz`` and -``g.nodes.csv.gz`` files. Graph compression ~~~~~~~~~~~~~~~~~ -To start graph compression: +To compress the graph: .. code:: bash - $ ./scripts/compress_graph.sh \ - --input /graph/g \ - --output /graph/compressed \ - --lib /swh/graph-lib \ - --tmp /graph/compressed/tmp \ - --stdout /graph/compressed/stdout \ - --stderr /graph/compressed/stderr + $ app/scripts/compress_graph.sh --lib lib/ --input data/g Warning: very large graphs may need a bigger batch size parameter for WebGraph internals (you can specify a value when running the compression script using: ``--batch-size 1000000000``). -Node ids mapping -~~~~~~~~~~~~~~~~ -To dump the mapping files: +Node identifier mappings +~~~~~~~~~~~~~~~~~~~~~~~~ -.. code:: bash +To dump the mapping files (i.e., various node id <-> other info mapping files, +in either ``.csv.gz`` or ad-hoc ``.map`` format): - $ java -cp /swh/app/swh-graph.jar \ - org.softwareheritage.graph.backend.Setup /graph/compressed/g +.. code:: bash -This command outputs: + $ java -cp app/swh-graph.jar \ + org.softwareheritage.graph.backend.Setup data/compressed/g -- ``g.node2pid.csv``: long node id to string persistent identifier. -- ``g.pid2node.csv``: string persistent identifier to long node id. -REST API -~~~~~~~~ +Graph server +~~~~~~~~~~~~ -To start the REST API web-service: +To start the swh-graph server: .. code:: bash - $ java -cp /swh/app/swh-graph.jar \ - org.softwareheritage.graph.App /graph/compressed/g + $ java -cp app/swh-graph.jar \ + org.softwareheritage.graph.App data/compressed/g diff --git a/java/server/src/test/dataset/generate_graph.sh b/java/server/src/test/dataset/generate_graph.sh --- a/java/server/src/test/dataset/generate_graph.sh +++ b/java/server/src/test/dataset/generate_graph.sh @@ -15,13 +15,13 @@ gzip --force --keep example.edges.csv gzip --force --keep example.nodes.csv -docker run \ - --user $(id -u):$(id -g) \ - --name swh-graph-test --rm --tty --interactive \ - --volume $(pwd):/input \ - --volume $(pwd)/output:/output \ - swh-graph-test:latest \ - ./scripts/compress_graph.sh \ - --input /input/example --output /output \ - --lib /swh/graph-lib --tmp /output/tmp \ - --stdout /output/stdout --stderr /output/stderr +docker run \ + --user $(id -u):$(id -g) \ + --name swh-graph-test --rm --tty --interactive \ + --volume $(pwd):/input \ + --volume $(pwd)/output:/output \ + swh-graph-test:latest \ + app/scripts/compress_graph.sh \ + --lib lib/ \ + --input /input/example \ + --outdir /output