diff --git a/docs/quickstart.rst b/docs/quickstart.rst --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -1,12 +1,11 @@ Quickstart ========== -This quick tutorial shows how to use a compressed graph dataset like the ones -provided by Software Heritage, and make it browsable using the `swh.graph` API -server. +This quick tutorial shows how to compress and browse a graph using the +`swh.graph` API. It does not cover the technical details behind the graph compression techniques -nor how to generate these compressed graph files. +(refer to :ref:`Graph compression `). Dependencies @@ -25,6 +24,11 @@ $ sudo apt install python3-virtualenv default-jre +To compress a graph, you will need zstd_ compression tools. + + +.. _zstd: https://facebook.github.io/zstd/ + Install ------- @@ -60,32 +64,70 @@ map Manage swh-graph on-disk maps rpc-serve run the graph REST service +Compression +----------- -API server ----------- +Existing datasets +^^^^^^^^^^^^^^^^^ -To start a `swh.graph` API server, you need a compressed graph dataset. You can -download a small dataset here: +You can directly use compressed graph datasets provided by Software Heritage. +Here is a small and realistic dataset (3.1GB): https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar -And use it as dataset for the `swh.graph` API: - .. code:: bash (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading - (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k - Loading graph python3kcompress/python3k ... - Graph loaded. - ======== Running on http://0.0.0.0:5009 ======== - (Press CTRL+C to quit) Note: not for the faint heart, but the full dataset is available at: https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ +Own datasets +^^^^^^^^^^^^ + +A graph is described as both its adjacency list and the set of nodes identifiers +in plain text format. Such graph example can be found in the +`swh/graph/tests/dataset/` folder. Depending on the machine you are using, you +might want to tune parameters down for lower RAM usage. Parameters are +configured in a separate YAML file: + +.. code:: yaml + + graph: + compress: + batch_size: 1000 + +Then, we can run the compression: + +.. code:: bash + + + (swhenv) ~/t/swh-graph-tests$ swh graph -C config.yml compress --graph swh/graph/tests/dataset/example --outdir output/ + + [...] + + (swhenv) ~/t/swh-graph-tests$ ls output/ + example-bv.properties example.mph example.obl example.outdegree example.stats example-transposed.offsets + example.graph example.node2pid.bin example.offsets example.pid2node.bin example-transposed.graph example-transposed.properties + example.indegree example.node2type.map example.order example.properties example-transposed.obl + + +API server +---------- + +To start a `swh.graph` API server of a compressed graph dataset, run: + +.. code:: bash + + (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example + Loading graph output/example ... + Graph loaded. + ======== Running on http://0.0.0.0:5009 ======== + (Press CTRL+C to quit) + From there you can use this endpoint to query the compressed graph, for example with httpie_ (`sudo apt install`) from another terminal: @@ -94,10 +136,40 @@ .. code:: bash + ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 + HTTP/1.1 200 OK + Content-Type: text/plain + Date: Tue, 15 Sep 2020 08:33:25 GMT + Server: Python/3.8 aiohttp/3.6.2 + Transfer-Encoding: chunked + + swh:1:rel:0000000000000000000000000000000000000010 + swh:1:rev:0000000000000000000000000000000000000009 + swh:1:rev:0000000000000000000000000000000000000003 + swh:1:dir:0000000000000000000000000000000000000002 + swh:1:cnt:0000000000000000000000000000000000000001 + swh:1:dir:0000000000000000000000000000000000000008 + swh:1:dir:0000000000000000000000000000000000000006 + swh:1:cnt:0000000000000000000000000000000000000004 + swh:1:cnt:0000000000000000000000000000000000000005 + swh:1:cnt:0000000000000000000000000000000000000007 + + +Running the existing `python3kcompress` dataset: + +.. code:: bash + + (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k + Loading graph python3kcompress/python3k ... + Graph loaded. + ======== Running on http://0.0.0.0:5009 ======== + (Press CTRL+C to quit) + + ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 200 OK Content-Type: text/plain - Date: Thu, 03 Sep 2020 12:12:58 GMT + Date: Tue, 15 Sep 2020 08:35:19 GMT Server: Python/3.8 aiohttp/3.6.2 Transfer-Encoding: chunked @@ -108,7 +180,6 @@ [...] swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 - ~/tmp$ See the documentation of the :ref:`API ` for more details.