diff --git a/docs/api.rst b/docs/api.rst index 35b3e9e..9f254c2 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,218 +1,218 @@ Graph traversal API =================== Terminology ----------- This API uses the following notions: - **Node**: a node in the `Software Heritage graph `_, represented by a `persistent identifier `_ (abbreviated as *SWH PID*, or simply *PID*). - **Node type**: the 3-letter specifier from the node PID (``cnt``, ``dir``, ``rel``, ``rev``, ``snp``), or ``*`` for all node types. - **Edge type**: a comma-separated list of ``src:dst`` strings where ``src`` and ``dst`` are node types, or ``*`` for all edge types. Examples ~~~~~~~~ - ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` the PID of a node of type content containing the full text of the GPL3 license. - ``swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35`` the PID of a node of type revision corresponding to the commit in Linux that merged the 'x86/urgent' branch on 31 December 2017. - ``"dir:dir,dir:cnt"`` node types allowing edges from directories to directories nodes, or directories to contents nodes. - ``"rev:rev,dir:*"`` node types allowing edges from revisions to revisions nodes, or from directories nodes. - ``"*:rel"`` node types allowing all edges to releases. Leaves ------ .. http:get:: /graph/leaves/:src Performs a graph traversal and returns the leaves of the subgraph rooted at the specified source node. :param string src: source node specified as a SWH PID :query string edges: edges types the traversal can follow; default to - ``"*"`` + ``"*"`` :query string direction: direction in which graph edges will be followed; - can be either ``forward`` or ``backward``, default to ``forward`` + can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http - HTTP/1.1 200 OK - Content-Type: application/json + HTTP/1.1 200 OK + Content-Type: application/json - [ - "swh:1:cnt:669ac7c32292798644b21dbb5a0dc657125f444d", - "swh:1:cnt:da4cb28febe66172a9fdf1a235525ae6c00cde1d", - ... - ] + [ + "swh:1:cnt:669ac7c32292798644b21dbb5a0dc657125f444d", + "swh:1:cnt:da4cb28febe66172a9fdf1a235525ae6c00cde1d", + ... + ] Neighbors --------- .. http:get:: /graph/neighbors/:src Returns node direct neighbors (linked with exactly one edge) in the graph. :param string src: source node specified as a SWH PID :query string edges: edges types allowed to be listed as neighbors; default - to ``"*"`` + to ``"*"`` :query string direction: direction in which graph edges will be followed; - can be either ``forward`` or ``backward``, default to ``forward`` + can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http - HTTP/1.1 200 OK - Content-Type: application/json + HTTP/1.1 200 OK + Content-Type: application/json - [ - "swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2", - "swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505", - ... - ] + [ + "swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2", + "swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505", + ... + ] Walk ---- .. http:get:: /graph/walk/:src/:dst Performs a graph traversal and returns the first found path from source to destination (final destination node included). :param string src: starting node specified as a SWH PID :param string dst: destination node, either as a node PID or a node type. - The traversal will stop at the first node encountered matching the desired - destination. + The traversal will stop at the first node encountered matching the + desired destination. :query string edges: edges types the traversal can follow; default to - ``"*"`` + ``"*"`` :query string traversal: traversal algorithm; can be either ``dfs`` or - ``bfs``, default to ``dfs`` + ``bfs``, default to ``dfs`` :query string direction: direction in which graph edges will be followed; - can be either ``forward`` or ``backward``, default to ``forward`` + can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http - HTTP/1.1 200 OK - Content-Type: application/json + HTTP/1.1 200 OK + Content-Type: application/json - [ - "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", - "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", - "swh:1:rev:cea92e843e40452c08ba313abc39f59efbb4c29c", - "swh:1:rev:8d517bdfb57154b8a11d7f1682ecc0f79abf8e02", - ... - ] + [ + "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", + "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", + "swh:1:rev:cea92e843e40452c08ba313abc39f59efbb4c29c", + "swh:1:rev:8d517bdfb57154b8a11d7f1682ecc0f79abf8e02", + ... + ] Visit ----- .. http:get:: /graph/visit/nodes/:src .. http:get:: /graph/visit/paths/:src Performs a graph traversal and returns explored nodes or paths (in the order of the traversal). :param string src: starting node specified as a SWH PID :query string edges: edges types the traversal can follow; default to - ``"*"`` + ``"*"`` :query string direction: direction in which graph edges will be followed; - can be either ``forward`` or ``backward``, default to ``forward`` + can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http - GET /graph/visit/nodes/ - HTTP/1.1 200 OK - Content-Type: application/json - - [ - "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", - "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", - ... - "swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961", - ... - ] - - .. sourcecode:: http - - GET /graph/visit/paths/ - HTTP/1.1 200 OK - Content-Type: application/json + GET /graph/visit/nodes/ + HTTP/1.1 200 OK + Content-Type: application/json - [ [ "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", ... - ], - [ - "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961", ... - ], - ... - ] + ] + + .. sourcecode:: http + + GET /graph/visit/paths/ + HTTP/1.1 200 OK + Content-Type: application/json + + [ + [ + "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", + "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", + ... + ], + [ + "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", + "swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961", + ... + ], + ... + ] Stats ----- .. http:get:: /graph/stats Returns statistics on the compressed graph. :statuscode 200: success .. sourcecode:: http - HTTP/1.1 200 OK - Content-Type: application/json - - { - "counts": { - "nodes": 16222788, - "edges": 9907464 - }, - "ratios": { - "compression": 0.367, - "bits_per_node": 5.846, - "bits_per_edge": 9.573, - "avg_locality": 270.369 - }, - "indegree": { - "min": 0, - "max": 12382, - "avg": 0.6107127825377487 - }, - "outdegree": { - "min": 0, - "max": 1, - "avg": 0.6107127825377487 + HTTP/1.1 200 OK + Content-Type: application/json + + { + "counts": { + "nodes": 16222788, + "edges": 9907464 + }, + "ratios": { + "compression": 0.367, + "bits_per_node": 5.846, + "bits_per_edge": 9.573, + "avg_locality": 270.369 + }, + "indegree": { + "min": 0, + "max": 12382, + "avg": 0.6107127825377487 + }, + "outdegree": { + "min": 0, + "max": 1, + "avg": 0.6107127825377487 + } } - } diff --git a/docs/compression.rst b/docs/compression.rst index c0e4a92..13144a0 100644 --- a/docs/compression.rst +++ b/docs/compression.rst @@ -1,107 +1,109 @@ Graph compression ================= The compression process is based on the `WebGraph framework `_ and ecosystem libraries. References used: - Paolo Boldi, Sebastiano Vigna, *The webgraph framework I: compression techniques*, Proceedings of the 13th international conference on World Wide Web. ACM, 2004. `pdf `_ - Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, *Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks*. `arxiv `_ - Alberto Apostolico, Guido Drovandi, *Graph compression by BFS*, Algorithms 2009, 2(3), 1031-1044. `mdpi `_ .. figure:: images/compression_steps.png :align: center :alt: Compression steps Compression steps 1. MPH ------ A node in the `Software Heritage graph `_ is identified using its string `persistent identifier `_. However, WebGraph internally uses integers to refer to node ids. Mapping between the strings and longs ids is needed before compressing the graph. From the `Sux4J `_ utility tool, we use the `GOVMinimalPerfectHashFunction `_ -class, mapping with no collisions $n$ keys to $n$ consecutive integers. +class, mapping with no collisions N keys to N consecutive integers. The step produces a ``.mph`` file (MPH stands for *Minimal Perfect Hash-function*) storing the hash function taking as input a string and returning a unique integer. 2. BV compress -------------- This is the first actual compression step, building a compressed version of the input graph using WebGraph techniques presented in the framework paper. We use the `ScatteredArcsASCIIGraph `_ class, from WebGraph. The resulting BV graph is stored as a set of files: - ``.graph``: the compressed graph in the BV format - ``.offsets``: offsets values to read the bit stream graph file - ``.obl``: offsets cache to load the graph faster - ``.properties``: entries used to correctly decode graph and offset files 3. BFS ------- In the LLP paper, authors propose an empirical analysis linking node ordering and high compression ratio: it is important to use an ordering of nodes ids such that vertices from the same host are close to one another. Building on this insight, the previous compression results in the BV compress step are improved by re-ordering nodes ids using a BFS traversal order. We use -the `BFSBig <>`_ class from the `LAW `_ library. +the `BFS +`_ +class from the `LAW `_ library. The resulting ordering is stored in the ``.order`` file, listing nodes ids in order of traversal. 4. Permute ---------- Once the order is computed (BFS or another ordering technique), the final compressed graph is created based on the initial BV compress result, and using the new node order mapping. The permutation uses the `Transform `_ class from WebGraph framework. The final compressed graph is only stored in the resulting ``.graph``, ``.offsets``, ``.obl``, and ``.properties`` files. Extra steps ----------- 5. Stats ~~~~~~~~ Compute various statistics on the final compressed graph: - ``.stats``: entries such as number of nodes, edges, avg/min/max degree, average locality, etc. - ``.indegree``: graph indegree distribution - ``.outdegree``: graph outdegree distribution This step uses the `Stats `_ class from WebGraph. 6. Transpose ~~~~~~~~~~~~ Create a transposed graph to allow backward traversal, using the `Transform `_ class from WebGraph. diff --git a/docs/index.rst b/docs/index.rst index 07e79f8..eb3dd8c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,26 +1,26 @@ .. _swh-graph: Software Heritage - graph service ================================= Tooling and service providing fast access to the graph representation of the Software Heritage archive. The service is in-memory, based on a compressed representation of the Software Heritage Merkle DAG (see :ref:`data-model`). Overview -------- -* :doc:`graph compression ` -* :doc:`graph service REST API ` -* :doc:`use cases ` -* :doc:`docker environment ` (for development purposes) +* :doc:`Graph compression ` +* :doc:`Graph service REST API ` +* :doc:`Use cases ` +* :doc:`Docker environment ` (for development purposes) Reference Documentation ----------------------- .. toctree:: :maxdepth: 2 /apidoc/swh.graph