diff --git a/docs/cli.rst b/docs/cli.rst index f660a03..b0dc4e3 100644 --- a/docs/cli.rst +++ b/docs/cli.rst @@ -1,6 +1,6 @@ Command-line interface ====================== -.. click:: swh.graph.cli:cli +.. click:: swh.graph.cli:graph_cli_group :prog: swh graph :show-nested: diff --git a/docs/compression.rst b/docs/compression.rst index e46e7ec..edca8a7 100644 --- a/docs/compression.rst +++ b/docs/compression.rst @@ -1,123 +1,125 @@ +.. _graph-compression: + Graph compression ================= The compression process is a pipeline implemented for the most part on top of the `WebGraph framework `_ and ecosystem libraries. The compression pipeline consists of the following steps: .. figure:: images/compression_steps.png :align: center :alt: Compression steps Compression steps Each of these steps is briefly described below. For more details see the following paper: .. note:: Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. `Ultra-Large-Scale Repository Analysis via Graph Compression `_. In proceedings of `SANER 2020 `_: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE 2020. Links: `preprint `_, `bibtex `_. In order to practically perform graph compression, install the ``swh.graph`` module and use the ``swh graph compress`` command line interface of the compression driver, that will conduct the various steps in the right order. See ``swh graph compress --help`` for usage details. 1. MPH ------ A node in the Software Heritage :ref:`data model ` is identified using its SWHID (see :ref:`persistent identifiers `). However, WebGraph internally uses integers to refer to node ids. Mapping between the strings and longs ids is needed before compressing the graph. From the `Sux4J `_ utility tool, we use the `GOVMinimalPerfectHashFunction `_ class, mapping with no collisions N keys to N consecutive integers. The step produces a ``.mph`` file (MPH stands for *Minimal Perfect Hash-function*) storing the hash function taking as input a string and returning a unique integer. 2. BV compress -------------- This is the first actual compression step, building a compressed version of the input graph using WebGraph techniques presented in the framework paper. We use the `ScatteredArcsASCIIGraph `_ class, from WebGraph. The resulting BV graph is stored as a set of files: - ``.graph``: the compressed graph in the BV format - ``.offsets``: offsets values to read the bit stream graph file - ``.obl``: offsets cache to load the graph faster - ``.properties``: entries used to correctly decode graph and offset files 3. BFS ------- In the LLP paper, authors propose an empirical analysis linking node ordering and high compression ratio: it is important to use an ordering of nodes ids such that vertices from the same host are close to one another. Building on this insight, the previous compression results in the BV compress step are improved by re-ordering nodes ids using a BFS traversal order. We use the `BFS `_ class from the `LAW `_ library. The resulting ordering is stored in the ``.order`` file, listing nodes ids in order of traversal. 4. Permute ---------- Once the order is computed (BFS or another ordering technique), the final compressed graph is created based on the initial BV compress result, and using the new node order mapping. The permutation uses the `Transform `_ class from WebGraph framework. The final compressed graph is only stored in the resulting ``.graph``, ``.offsets``, ``.obl``, and ``.properties`` files. 5. Stats -------- Compute various statistics on the final compressed graph: - ``.stats``: entries such as number of nodes, edges, avg/min/max degree, average locality, etc. - ``.indegree``: graph indegree distribution - ``.outdegree``: graph outdegree distribution This step uses the `Stats `_ class from WebGraph. 6. Transpose ------------ Create a transposed graph to allow backward traversal, using the `Transform `_ class from WebGraph. diff --git a/docs/quickstart.rst b/docs/quickstart.rst index b65334e..a06010a 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -1,174 +1,174 @@ Quickstart ========== This quick tutorial shows how to compress and browse a graph using `swh.graph`. It does not cover the technical details behind the graph compression techniques -(refer to :ref:`Graph compression `). +(refer to :ref:`graph-compression`). Dependencies ------------ In order to run the `swh.graph` tool, you will need Python (>= 3.7) and Java JRE, you do not need the JDK if you install the package from pypi, but may want to install it if you want to hack the code or install it from this git repository. To compress a graph, you will need zstd_ compression tools. It is highly recommended to install this package in a virtualenv. On a Debian stable (buster) system: .. code:: bash $ sudo apt install python3-virtualenv default-jre zstd .. _zstd: https://facebook.github.io/zstd/ Install ------- Create a virtualenv and activate it: .. code:: bash ~/tmp$ mkdir swh-graph-tests ~/tmp$ cd swh-graph-tests ~/t/swh-graph-tests$ virtualenv swhenv ~/t/swh-graph-tests$ . swhenv/bin/activate Install the `swh.graph` python package: .. code:: bash (swhenv) ~/t/swh-graph-tests$ pip install swh.graph [...] (swhenv) ~/t/swh-graph-tests swh graph --help Usage: swh graph [OPTIONS] COMMAND [ARGS]... Software Heritage graph tools. Options: -C, --config-file FILE YAML configuration file -h, --help Show this message and exit. Commands: api-client client for the graph REST service cachemount Cache the mmapped files of the compressed graph in a tmpfs. compress Compress a graph using WebGraph Input: a pair of files... map Manage swh-graph on-disk maps rpc-serve run the graph REST service Compression ----------- Existing datasets ^^^^^^^^^^^^^^^^^ You can directly use compressed graph datasets provided by Software Heritage. Here is a small and realistic dataset (3.1GB): https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar .. code:: bash (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading Note: not for the faint heart, but the full dataset is available at: https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ Own datasets ^^^^^^^^^^^^ A graph is described as both its adjacency list and the set of nodes identifiers in plain text format. Such graph example can be found in the `swh/graph/tests/dataset/` folder. You can compress the example graph on the command line like this: .. code:: bash (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/ [...] (swhenv) ~/t/swh-graph-tests$ ls output/ example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties example.indegree example.node2type.map example.order example.stats example-transposed.obl API server ---------- To start a `swh.graph` API server of a compressed graph dataset, run: .. code:: bash (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example Loading graph output/example ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) From there you can use this endpoint to query the compressed graph, for example with httpie_ (`sudo apt install`) from another terminal: .. _httpie: https://httpie.org .. code:: bash ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 HTTP/1.1 200 OK Content-Type: text/plain Date: Tue, 15 Sep 2020 08:33:25 GMT Server: Python/3.8 aiohttp/3.6.2 Transfer-Encoding: chunked swh:1:rel:0000000000000000000000000000000000000010 swh:1:rev:0000000000000000000000000000000000000009 swh:1:rev:0000000000000000000000000000000000000003 swh:1:dir:0000000000000000000000000000000000000002 swh:1:cnt:0000000000000000000000000000000000000001 swh:1:dir:0000000000000000000000000000000000000008 swh:1:dir:0000000000000000000000000000000000000006 swh:1:cnt:0000000000000000000000000000000000000004 swh:1:cnt:0000000000000000000000000000000000000005 swh:1:cnt:0000000000000000000000000000000000000007 Running the existing `python3kcompress` dataset: .. code:: bash (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k Loading graph python3kcompress/python3k ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 200 OK Content-Type: text/plain Date: Tue, 15 Sep 2020 08:35:19 GMT Server: Python/3.8 aiohttp/3.6.2 Transfer-Encoding: chunked swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 [...] swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 See the documentation of the :ref:`API ` for more details.