Changeset View
Standalone View
docs/quickstart.rst
Quickstart | Quickstart | ||||
========== | ========== | ||||
This quick tutorial shows how to use a compressed graph dataset like the ones | This quick tutorial shows how to compress and browse a graph using `swh.graph`. | ||||
provided by Software Heritage, and make it browsable using the `swh.graph` API | |||||
server. | |||||
zack: I'd say using `swh-graph` (the API doesn't matter much). | |||||
It does not cover the technical details behind the graph compression techniques | It does not cover the technical details behind the graph compression techniques | ||||
nor how to generate these compressed graph files. | (refer to :ref:`Graph compression <compression>`). | ||||
Dependencies | Dependencies | ||||
------------ | ------------ | ||||
In order to run the `swh.graph` tool, you will need Python (>= 3.7) and Java | In order to run the `swh.graph` tool, you will need Python (>= 3.7) and Java | ||||
JRE, you do not need the JDK if you install the package from pypi, but may want | JRE, you do not need the JDK if you install the package from pypi, but may want | ||||
to install it if you want to hack the code or install it from this git | to install it if you want to hack the code or install it from this git | ||||
repository. | repository. To compress a graph, you will need zstd_ compression tools. | ||||
It is highly recommended to install this package in a virtualenv. | It is highly recommended to install this package in a virtualenv. | ||||
On a Debian stable (buster) system: | On a Debian stable (buster) system: | ||||
.. code:: bash | .. code:: bash | ||||
$ sudo apt install python3-virtualenv default-jre | $ sudo apt install python3-virtualenv default-jre zstd | ||||
.. _zstd: https://facebook.github.io/zstd/ | |||||
Install | Install | ||||
------- | ------- | ||||
Done Inline ActionsPlease merge this with the previous stuff, i.e., add zstd to the first paragraph and the zstd package name to the apt install line zack: Please merge this with the previous stuff, i.e., add zstd to the first paragraph and the `zstd`… | |||||
Create a virtualenv and activate it: | Create a virtualenv and activate it: | ||||
.. code:: bash | .. code:: bash | ||||
~/tmp$ mkdir swh-graph-tests | ~/tmp$ mkdir swh-graph-tests | ||||
~/tmp$ cd swh-graph-tests | ~/tmp$ cd swh-graph-tests | ||||
~/t/swh-graph-tests$ virtualenv swhenv | ~/t/swh-graph-tests$ virtualenv swhenv | ||||
Show All 16 Lines | .. code:: bash | ||||
Commands: | Commands: | ||||
api-client client for the graph REST service | api-client client for the graph REST service | ||||
cachemount Cache the mmapped files of the compressed graph in a tmpfs. | cachemount Cache the mmapped files of the compressed graph in a tmpfs. | ||||
compress Compress a graph using WebGraph Input: a pair of files... | compress Compress a graph using WebGraph Input: a pair of files... | ||||
map Manage swh-graph on-disk maps | map Manage swh-graph on-disk maps | ||||
rpc-serve run the graph REST service | rpc-serve run the graph REST service | ||||
Compression | |||||
----------- | |||||
API server | Existing datasets | ||||
---------- | ^^^^^^^^^^^^^^^^^ | ||||
To start a `swh.graph` API server, you need a compressed graph dataset. You can | You can directly use compressed graph datasets provided by Software Heritage. | ||||
download a small dataset here: | Here is a small and realistic dataset (3.1GB): | ||||
https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | ||||
Not Done Inline Actionsthis is also documented in more details in the swh-dataset package, see, e.g., https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html , which also lists other datasets (small and large). You might want to sphinx-crosslink that package from here zack: this is also documented in more details in the `swh-dataset` package, see, e.g., https://docs. | |||||
Done Inline ActionsSure! However the graph compressed with swh-graph are not linked there (and it seems to be missing for the popular-4k dataset), should I add those on the swh-dataset doc page first? haltode: Sure! However the graph compressed with `swh-graph` are not linked there (and it seems to be… | |||||
And use it as dataset for the `swh.graph` API: | |||||
.. code:: bash | .. code:: bash | ||||
(swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | ||||
(swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar | (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar | ||||
(swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading | (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading | ||||
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k | |||||
Loading graph python3kcompress/python3k ... | |||||
Graph loaded. | |||||
======== Running on http://0.0.0.0:5009 ======== | |||||
(Press CTRL+C to quit) | |||||
Note: not for the faint heart, but the full dataset is available at: | Note: not for the faint heart, but the full dataset is available at: | ||||
https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ | https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ | ||||
Own datasets | |||||
^^^^^^^^^^^^ | |||||
A graph is described as both its adjacency list and the set of nodes identifiers | |||||
in plain text format. Such graph example can be found in the | |||||
`swh/graph/tests/dataset/` folder. Depending on the machine you are using, you | |||||
might want to tune parameters down for lower RAM usage. Parameters are | |||||
Not Done Inline ActionsI begrudgingly ack this documentation change, because without it things would explose. But. At the same time it would be better to find a way to have a sane default (e.g., computed as a proportion of the number of edges, in case it is already known at this stage?) that just works in most cases, rather than one that does now with a recommended configuration tuning for users. zack: I begrudgingly ack this documentation change, because without it things would explose.
But. | |||||
Done Inline ActionsAgreed, I opened a new task for this: T2595. haltode: Agreed, I opened a new task for this: T2595. | |||||
configured in a separate YAML file: | |||||
.. code:: yaml | |||||
graph: | |||||
compress: | |||||
batch_size: 1000 | |||||
Then, we can run the compression: | |||||
.. code:: bash | |||||
(swhenv) ~/t/swh-graph-tests$ swh graph -C config.yml compress --graph swh/graph/tests/dataset/example --outdir output/ | |||||
[...] | |||||
(swhenv) ~/t/swh-graph-tests$ ls output/ | |||||
example-bv.properties example.mph example.obl example.outdegree example.stats example-transposed.offsets | |||||
example.graph example.node2pid.bin example.offsets example.pid2node.bin example-transposed.graph example-transposed.properties | |||||
example.indegree example.node2type.map example.order example.properties example-transposed.obl | |||||
API server | |||||
---------- | |||||
To start a `swh.graph` API server of a compressed graph dataset, run: | |||||
.. code:: bash | |||||
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example | |||||
Loading graph output/example ... | |||||
Graph loaded. | |||||
======== Running on http://0.0.0.0:5009 ======== | |||||
(Press CTRL+C to quit) | |||||
From there you can use this endpoint to query the compressed graph, for example | From there you can use this endpoint to query the compressed graph, for example | ||||
with httpie_ (`sudo apt install`) from another terminal: | with httpie_ (`sudo apt install`) from another terminal: | ||||
.. _httpie: https://httpie.org | .. _httpie: https://httpie.org | ||||
.. code:: bash | .. code:: bash | ||||
~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 | |||||
HTTP/1.1 200 OK | |||||
Content-Type: text/plain | |||||
Date: Tue, 15 Sep 2020 08:33:25 GMT | |||||
Server: Python/3.8 aiohttp/3.6.2 | |||||
Transfer-Encoding: chunked | |||||
swh:1:rel:0000000000000000000000000000000000000010 | |||||
swh:1:rev:0000000000000000000000000000000000000009 | |||||
swh:1:rev:0000000000000000000000000000000000000003 | |||||
swh:1:dir:0000000000000000000000000000000000000002 | |||||
swh:1:cnt:0000000000000000000000000000000000000001 | |||||
swh:1:dir:0000000000000000000000000000000000000008 | |||||
swh:1:dir:0000000000000000000000000000000000000006 | |||||
swh:1:cnt:0000000000000000000000000000000000000004 | |||||
swh:1:cnt:0000000000000000000000000000000000000005 | |||||
swh:1:cnt:0000000000000000000000000000000000000007 | |||||
Running the existing `python3kcompress` dataset: | |||||
.. code:: bash | |||||
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k | |||||
Loading graph python3kcompress/python3k ... | |||||
Graph loaded. | |||||
======== Running on http://0.0.0.0:5009 ======== | |||||
(Press CTRL+C to quit) | |||||
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 | ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 | ||||
HTTP/1.1 200 OK | HTTP/1.1 200 OK | ||||
Content-Type: text/plain | Content-Type: text/plain | ||||
Date: Thu, 03 Sep 2020 12:12:58 GMT | Date: Tue, 15 Sep 2020 08:35:19 GMT | ||||
Server: Python/3.8 aiohttp/3.6.2 | Server: Python/3.8 aiohttp/3.6.2 | ||||
Transfer-Encoding: chunked | Transfer-Encoding: chunked | ||||
swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb | swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb | ||||
swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f | swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f | ||||
swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a | swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a | ||||
swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 | swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 | ||||
[...] | [...] | ||||
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a | swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a | ||||
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 | swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 | ||||
~/tmp$ | |||||
See the documentation of the :ref:`API <swh-graph-api>` for more details. | See the documentation of the :ref:`API <swh-graph-api>` for more details. |
I'd say using swh-graph (the API doesn't matter much).