diff --git a/docs/cli.rst b/docs/cli.rst
index f660a03..b0dc4e3 100644
--- a/docs/cli.rst
+++ b/docs/cli.rst
@@ -1,6 +1,6 @@
Command-line interface
======================
-.. click:: swh.graph.cli:cli
+.. click:: swh.graph.cli:graph_cli_group
:prog: swh graph
:show-nested:
diff --git a/docs/compression.rst b/docs/compression.rst
index e46e7ec..edca8a7 100644
--- a/docs/compression.rst
+++ b/docs/compression.rst
@@ -1,123 +1,125 @@
+.. _graph-compression:
+
Graph compression
=================
The compression process is a pipeline implemented for the most part on top of
the `WebGraph framework `_ and ecosystem
libraries. The compression pipeline consists of the following steps:
.. figure:: images/compression_steps.png
:align: center
:alt: Compression steps
Compression steps
Each of these steps is briefly described below. For more details see the
following paper:
.. note::
Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
`Ultra-Large-Scale Repository Analysis via Graph Compression
`_. In
proceedings of `SANER 2020 `_: The 27th IEEE
International Conference on Software Analysis, Evolution and
Reengineering. IEEE 2020.
Links: `preprint
`_,
`bibtex
`_.
In order to practically perform graph compression, install the ``swh.graph``
module and use the ``swh graph compress`` command line interface of the
compression driver, that will conduct the various steps in the right order.
See ``swh graph compress --help`` for usage details.
1. MPH
------
A node in the Software Heritage :ref:`data model ` is identified
using its SWHID (see :ref:`persistent identifiers
`). However, WebGraph internally uses integers to refer
to node ids.
Mapping between the strings and longs ids is needed before compressing the
graph. From the `Sux4J `_ utility tool, we use the
`GOVMinimalPerfectHashFunction
`_
class, mapping with no collisions N keys to N consecutive integers.
The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
Hash-function*) storing the hash function taking as input a string and returning
a unique integer.
2. BV compress
--------------
This is the first actual compression step, building a compressed version of the
input graph using WebGraph techniques presented in the framework paper. We use
the `ScatteredArcsASCIIGraph
`_
class, from WebGraph.
The resulting BV graph is stored as a set of files:
- ``.graph``: the compressed graph in the BV format
- ``.offsets``: offsets values to read the bit stream graph file
- ``.obl``: offsets cache to load the graph faster
- ``.properties``: entries used to correctly decode graph and offset files
3. BFS
-------
In the LLP paper, authors propose an empirical analysis linking node ordering
and high compression ratio: it is important to use an ordering of nodes ids such
that vertices from the same host are close to one another.
Building on this insight, the previous compression results in the BV compress
step are improved by re-ordering nodes ids using a BFS traversal order. We use
the `BFS
`_
class from the `LAW `_ library.
The resulting ordering is stored in the ``.order`` file, listing nodes ids in
order of traversal.
4. Permute
----------
Once the order is computed (BFS or another ordering technique), the final
compressed graph is created based on the initial BV compress result, and using
the new node order mapping. The permutation uses the `Transform
`_
class from WebGraph framework.
The final compressed graph is only stored in the resulting ``.graph``,
``.offsets``, ``.obl``, and ``.properties`` files.
5. Stats
--------
Compute various statistics on the final compressed graph:
- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
average locality, etc.
- ``.indegree``: graph indegree distribution
- ``.outdegree``: graph outdegree distribution
This step uses the `Stats
`_
class from WebGraph.
6. Transpose
------------
Create a transposed graph to allow backward traversal, using the `Transform
`_
class from WebGraph.
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
index b65334e..a06010a 100644
--- a/docs/quickstart.rst
+++ b/docs/quickstart.rst
@@ -1,174 +1,174 @@
Quickstart
==========
This quick tutorial shows how to compress and browse a graph using `swh.graph`.
It does not cover the technical details behind the graph compression techniques
-(refer to :ref:`Graph compression `).
+(refer to :ref:`graph-compression`).
Dependencies
------------
In order to run the `swh.graph` tool, you will need Python (>= 3.7) and Java
JRE, you do not need the JDK if you install the package from pypi, but may want
to install it if you want to hack the code or install it from this git
repository. To compress a graph, you will need zstd_ compression tools.
It is highly recommended to install this package in a virtualenv.
On a Debian stable (buster) system:
.. code:: bash
$ sudo apt install python3-virtualenv default-jre zstd
.. _zstd: https://facebook.github.io/zstd/
Install
-------
Create a virtualenv and activate it:
.. code:: bash
~/tmp$ mkdir swh-graph-tests
~/tmp$ cd swh-graph-tests
~/t/swh-graph-tests$ virtualenv swhenv
~/t/swh-graph-tests$ . swhenv/bin/activate
Install the `swh.graph` python package:
.. code:: bash
(swhenv) ~/t/swh-graph-tests$ pip install swh.graph
[...]
(swhenv) ~/t/swh-graph-tests swh graph --help
Usage: swh graph [OPTIONS] COMMAND [ARGS]...
Software Heritage graph tools.
Options:
-C, --config-file FILE YAML configuration file
-h, --help Show this message and exit.
Commands:
api-client client for the graph REST service
cachemount Cache the mmapped files of the compressed graph in a tmpfs.
compress Compress a graph using WebGraph Input: a pair of files...
map Manage swh-graph on-disk maps
rpc-serve run the graph REST service
Compression
-----------
Existing datasets
^^^^^^^^^^^^^^^^^
You can directly use compressed graph datasets provided by Software Heritage.
Here is a small and realistic dataset (3.1GB):
https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
.. code:: bash
(swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
(swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar
(swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading
Note: not for the faint heart, but the full dataset is available at:
https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/
Own datasets
^^^^^^^^^^^^
A graph is described as both its adjacency list and the set of nodes
identifiers in plain text format. Such graph example can be found in the
`swh/graph/tests/dataset/` folder.
You can compress the example graph on the command line like this:
.. code:: bash
(swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/
[...]
(swhenv) ~/t/swh-graph-tests$ ls output/
example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets
example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties
example.indegree example.node2type.map example.order example.stats example-transposed.obl
API server
----------
To start a `swh.graph` API server of a compressed graph dataset, run:
.. code:: bash
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example
Loading graph output/example ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
From there you can use this endpoint to query the compressed graph, for example
with httpie_ (`sudo apt install`) from another terminal:
.. _httpie: https://httpie.org
.. code:: bash
~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Tue, 15 Sep 2020 08:33:25 GMT
Server: Python/3.8 aiohttp/3.6.2
Transfer-Encoding: chunked
swh:1:rel:0000000000000000000000000000000000000010
swh:1:rev:0000000000000000000000000000000000000009
swh:1:rev:0000000000000000000000000000000000000003
swh:1:dir:0000000000000000000000000000000000000002
swh:1:cnt:0000000000000000000000000000000000000001
swh:1:dir:0000000000000000000000000000000000000008
swh:1:dir:0000000000000000000000000000000000000006
swh:1:cnt:0000000000000000000000000000000000000004
swh:1:cnt:0000000000000000000000000000000000000005
swh:1:cnt:0000000000000000000000000000000000000007
Running the existing `python3kcompress` dataset:
.. code:: bash
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k
Loading graph python3kcompress/python3k ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Tue, 15 Sep 2020 08:35:19 GMT
Server: Python/3.8 aiohttp/3.6.2
Transfer-Encoding: chunked
swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb
swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f
swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a
swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472
[...]
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406
See the documentation of the :ref:`API ` for more details.