Changeset View
Standalone View
docs/quickstart.rst
.. _swh-graph-quickstart: | |||||||||
Quickstart | Quickstart | ||||||||
========== | ========== | ||||||||
This quick tutorial shows how to compress and browse a graph using ``swh.graph``. | This quick tutorial shows how to start the ``swh.graph`` service to query | ||||||||
an existing compressed graph with the high-level HTTP API. | |||||||||
It does not cover the technical details behind the graph compression techniques | |||||||||
(refer to :ref:`graph-compression`). | |||||||||
Dependencies | Dependencies | ||||||||
------------ | ------------ | ||||||||
In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java | In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java | ||||||||
JRE, you do not need the JDK if you install the package from pypi, but may want | JRE. On a Debian system: | ||||||||
to install it if you want to hack the code or install it from this git | |||||||||
repository. To compress a graph, you will need zstd_ compression tools. | |||||||||
It is highly recommended to install this package in a virtualenv. | |||||||||
On a Debian stable (buster) system: | |||||||||
.. code:: bash | |||||||||
$ sudo apt install python3-virtualenv default-jre zstd | |||||||||
.. code:: console | |||||||||
.. _zstd: https://facebook.github.io/zstd/ | $ sudo apt install python3 python3-venv default-jre | ||||||||
JaredR26: Starting from a raw AWS ubuntu instance, this needs python3-venv for the next step | |||||||||
Installing swh.graph | |||||||||
Install | -------------------- | ||||||||
------- | |||||||||
Create a virtualenv and activate it: | Create a virtualenv and activate it: | ||||||||
.. code:: bash | .. code:: console | ||||||||
~/tmp$ mkdir swh-graph-tests | $ python3 -m venv .venv | ||||||||
~/tmp$ cd swh-graph-tests | $ source .venv/bin/activate | ||||||||
Done Inline ActionsThis command didn't work for me, "/usr/bin/python3: Relative module names not supported" python3 -m venv workingDir did work. Then the next line needed to be 'source workingDir/bin/activate' In addition, venv did something weird on my first try, but on my second after doing a full apt update ; apt upgrade it did not (create an extra venv dir?). I'd highlight that package upgrades need to be run to ensure commands work as written. JaredR26: This command didn't work for me, "/usr/bin/python3: Relative module names not supported"… | |||||||||
~/t/swh-graph-tests$ virtualenv swhenv | |||||||||
~/t/swh-graph-tests$ . swhenv/bin/activate | |||||||||
Install the ``swh.graph`` python package: | Install the ``swh.graph`` python package: | ||||||||
.. code:: bash | .. code:: console | ||||||||
(swhenv) ~/t/swh-graph-tests$ pip install swh.graph | (venv) $ pip install swh.graph | ||||||||
Done Inline ActionsFYI this gave me an error "ERROR: Failed building wheel for blinker" The next command worked though. JaredR26: FYI this gave me an error "ERROR: Failed building wheel for blinker"
The next command worked… | |||||||||
Done Inline ActionsNewer versions of pip just show that as a warning. seirl: Newer versions of pip just show that as a warning. | |||||||||
[...] | [...] | ||||||||
(swhenv) ~/t/swh-graph-tests swh graph --help | (venv) $ swh graph --help | ||||||||
Usage: swh graph [OPTIONS] COMMAND [ARGS]... | Usage: swh graph [OPTIONS] COMMAND [ARGS]... | ||||||||
Software Heritage graph tools. | Software Heritage graph tools. | ||||||||
Options: | Options: | ||||||||
-C, --config-file FILE YAML configuration file | -C, --config-file FILE YAML configuration file | ||||||||
-h, --help Show this message and exit. | -h, --help Show this message and exit. | ||||||||
Commands: | Commands: | ||||||||
api-client client for the graph RPC service | |||||||||
cachemount Cache the mmapped files of the compressed graph in a tmpfs. | |||||||||
compress Compress a graph using WebGraph Input: a pair of files... | compress Compress a graph using WebGraph Input: a pair of files... | ||||||||
map Manage swh-graph on-disk maps | |||||||||
rpc-serve run the graph RPC service | rpc-serve run the graph RPC service | ||||||||
Compression | |||||||||
----------- | |||||||||
Existing datasets | |||||||||
^^^^^^^^^^^^^^^^^ | |||||||||
You can directly use compressed graph datasets provided by Software Heritage. | |||||||||
Here is a small and realistic dataset (3.1GB): | |||||||||
https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | .. _swh-graph-retrieving-compressed: | ||||||||
.. code:: bash | |||||||||
(swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar | Retrieving a compressed graph | ||||||||
(swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar | ----------------------------- | ||||||||
(swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading | |||||||||
Note: not for the faint heart, but the full dataset is available at: | Software Heritage provides a list of off-the-shelf datasets that can be used | ||||||||
for various research or prototyping purposes. Most of them are available in | |||||||||
*compressed* representation, i.e., in a format suitable to be loaded and | |||||||||
queried by the ``swh-graph`` library. | |||||||||
https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ | All the publicly available datasets are documented on this page: | ||||||||
https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html | |||||||||
Own datasets | A good way of retrieving these datasets is to use the `AWS S3 CLI | ||||||||
^^^^^^^^^^^^ | <https://docs.aws.amazon.com/cli/latest/reference/s3/>`_. | ||||||||
A graph is described as both its adjacency list and the set of nodes | |||||||||
identifiers in plain text format. Such graph example can be found in the | |||||||||
``swh/graph/tests/dataset/`` folder. | |||||||||
You can compress the example graph on the command line like this: | |||||||||
.. code:: bash | |||||||||
Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has | |||||||||
Done Inline ActionsMay be worth noting, this graph will NOT load on an aws nano instance (lack of ram). I'm not sure if a micro can run it or not, but I succeeded with a t3a.small instance. Might save some frustration for someone trying to do this on the AWS free tier if they knew the requirements in advance. JaredR26: May be worth noting, this graph will NOT load on an aws nano instance (lack of ram). I'm not… | |||||||||
a relatively reasonable size (~15 GiB including property data, with | |||||||||
the compressed graph itself being less than 700 MiB): | |||||||||
(swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/ | .. code:: console | ||||||||
(venv) $ pip install awscli | |||||||||
[...] | [...] | ||||||||
(venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed | |||||||||
(venv) $ cd 2021-03-23-popular-3k-python/ | |||||||||
(venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed | |||||||||
(swhenv) ~/t/swh-graph-tests$ ls output/ | |||||||||
example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets | You can also retrieve larger graphs, but note that these graphs are generally | ||||||||
example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties | intended to be loaded fully in RAM, and do not fit on ordinary desktop | ||||||||
example.indegree example.node2type.map example.order example.stats example-transposed.obl | machines. The server we use in production to run the graph service has more | ||||||||
than 700 GiB of RAM. These memory considerations are discussed in more details | |||||||||
Done Inline Actionsmissing link vlorentz: missing link | |||||||||
in :ref:`swh-graph-memory`. | |||||||||
**Note:** for testing purposes, a fake test dataset is available in the | |||||||||
Done Inline Actions
JaredR26: | |||||||||
Done Inline Actions? seirl: ? | |||||||||
Done Inline ActionsOh, I guess that's singular its not plural possessive. Nevermind. JaredR26: Oh, I guess that's singular its not plural possessive. Nevermind. | |||||||||
``swh-graph`` repository, with just a few dozen nodes. Its basename is | |||||||||
``swh-graph/swh/graph/tests/dataset/compressed/example``. | |||||||||
API server | API server | ||||||||
---------- | ---------- | ||||||||
To start a ``swh.graph`` API server of a compressed graph dataset, run: | To start a ``swh.graph`` API server of a compressed graph dataset, you need to | ||||||||
use the ``rpc-serve`` command with the basename of the graph, which is the path prefix | |||||||||
of all the graph files (e.g., with the basename ``compressed/graph``, it will | |||||||||
attempt to load the files located at | |||||||||
``compressed/graph.{graph,properties,offsets,...}``. | |||||||||
.. code:: bash | In our example: | ||||||||
.. code:: console | |||||||||
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example | (venv) $ swh graph rpc-serve -g compressed/graph | ||||||||
Loading graph output/example ... | Loading graph compressed/graph ... | ||||||||
Graph loaded. | Graph loaded. | ||||||||
======== Running on http://0.0.0.0:5009 ======== | ======== Running on http://0.0.0.0:5009 ======== | ||||||||
(Press CTRL+C to quit) | (Press CTRL+C to quit) | ||||||||
From there you can use this endpoint to query the compressed graph, for example | From there you can use this endpoint to query the compressed graph, for example | ||||||||
with httpie_ (``sudo apt install``) from another terminal: | with httpie_ (``sudo apt install httpie``): | ||||||||
Not Done Inline ActionsTwo things; 1. I think the "from another terminal" addition is helpful. I would EXPECT people to understand that, but they might not notice the prompt below and think they can pass these commands directly into the graph terminal after they run rpc-serve. And 2. Following these instructions was giving me 400 errors. I'm not sure why it isn't able to read the SWHID correctly? http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 Unknown SWHID: swh This is on a brand new ubuntu t3a.small instance on AWS, after doing apt update;upgrade. JaredR26: Two things; 1. I think the "from another terminal" addition is helpful. I would EXPECT people… | |||||||||
Not Done Inline ActionsFYI this was what the other terminal showed as recieved. It all looks right, I'm not sure why it didn't work. ~/2021-03-23-popular-3k-python$ swh graph rpc-serve -g compressed/graph INFO:root:using swh-graph JAR: /home/ubuntu/workingDir/share/swh-graph/swh-graph-0.5.2.jar Loading graph compressed/graph ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:10 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:27 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:38 +0000] "GET /graph/leaves/dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 180 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:11 +0000] "GET /graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:55 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" JaredR26: FYI this was what the other terminal showed as recieved. It all looks right, I'm not sure why… | |||||||||
.. _httpie: https://httpie.org | .. _httpie: https://httpie.org | ||||||||
.. code:: bash | .. code:: bash | ||||||||
~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 | |||||||||
HTTP/1.1 200 OK | |||||||||
Content-Type: text/plain | |||||||||
Date: Tue, 15 Sep 2020 08:33:25 GMT | |||||||||
Server: Python/3.8 aiohttp/3.6.2 | |||||||||
Transfer-Encoding: chunked | |||||||||
swh:1:rel:0000000000000000000000000000000000000010 | |||||||||
swh:1:rev:0000000000000000000000000000000000000009 | |||||||||
swh:1:rev:0000000000000000000000000000000000000003 | |||||||||
swh:1:dir:0000000000000000000000000000000000000002 | |||||||||
swh:1:cnt:0000000000000000000000000000000000000001 | |||||||||
swh:1:dir:0000000000000000000000000000000000000008 | |||||||||
swh:1:dir:0000000000000000000000000000000000000006 | |||||||||
swh:1:cnt:0000000000000000000000000000000000000004 | |||||||||
swh:1:cnt:0000000000000000000000000000000000000005 | |||||||||
swh:1:cnt:0000000000000000000000000000000000000007 | |||||||||
Running the existing ``python3kcompress`` dataset: | |||||||||
.. code:: bash | |||||||||
(swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k | |||||||||
Loading graph python3kcompress/python3k ... | |||||||||
Graph loaded. | |||||||||
======== Running on http://0.0.0.0:5009 ======== | |||||||||
(Press CTRL+C to quit) | |||||||||
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 | ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 | ||||||||
HTTP/1.1 200 OK | HTTP/1.1 200 OK | ||||||||
Content-Type: text/plain | Content-Type: text/plain | ||||||||
Date: Tue, 15 Sep 2020 08:35:19 GMT | Date: Tue, 15 Sep 2020 08:35:19 GMT | ||||||||
Server: Python/3.8 aiohttp/3.6.2 | Server: Python/3.8 aiohttp/3.6.2 | ||||||||
Transfer-Encoding: chunked | Transfer-Encoding: chunked | ||||||||
swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb | swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb | ||||||||
swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f | swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f | ||||||||
swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a | swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a | ||||||||
swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 | swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 | ||||||||
[...] | [...] | ||||||||
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a | swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a | ||||||||
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 | swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 | ||||||||
See the documentation of the :ref:`API <swh-graph-api>` for more details on how | |||||||||
See the documentation of the :ref:`API <swh-graph-api>` for more details. | to use the HTTP graph querying API. |
Starting from a raw AWS ubuntu instance, this needs python3-venv for the next step