diff --git a/docs/api.rst b/docs/api.rst index 4ddbb44..30f04f9 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,282 +1,282 @@ -Graph traversal API -=================== +Traversal REST API +================== Terminology ----------- This API uses the following notions: - **Node**: a node in the `Software Heritage graph `_, represented by a `persistent identifier `_ (abbreviated as *SWH PID*, or simply *PID*). - **Node type**: the 3-letter specifier from the node PID (``cnt``, ``dir``, ``rel``, ``rev``, ``snp``), or ``*`` for all node types. - **Edge type**: a comma-separated list of ``src:dst`` strings where ``src`` and ``dst`` are node types, or ``*`` for all edge types. Examples ~~~~~~~~ - ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` the PID of a node of type content containing the full text of the GPL3 license. - ``swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35`` the PID of a node of type revision corresponding to the commit in Linux that merged the 'x86/urgent' branch on 31 December 2017. - ``"dir:dir,dir:cnt"`` node types allowing edges from directories to directories nodes, or directories to contents nodes. - ``"rev:rev,dir:*"`` node types allowing edges from revisions to revisions nodes, or from directories nodes. - ``"*:rel"`` node types allowing all edges to releases. Metadata -------- Extra metadata are given in addition to the result: - ``timings``: only when configured to do so (see the server's README): - ``traversal``: time in seconds to do the actual graph traversal. - ``pid2node``: time in seconds to convert input PID to node id. - ``node2pid``: time in seconds to convert output node ids to PIDs. - ``nb_edges_accessed``: number of edges accessed during the traversal operation. Leaves ------ .. http:get:: /graph/leaves/:src Performs a graph traversal and returns the leaves of the subgraph rooted at the specified source node. :param string src: source node specified as a SWH PID :query string edges: edges types the traversal can follow; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http HTTP/1.1 200 OK Content-Type: application/json { "result": [ "swh:1:cnt:669ac7c32292798644b21dbb5a0dc657125f444d", "swh:1:cnt:da4cb28febe66172a9fdf1a235525ae6c00cde1d", ... ], "meta": { "timings": { "traversal": 0.002942681, "pid2node": 0.000178051, "node2pid": 0.000956569 }, "nb_edges_accessed": 12 } } Neighbors --------- .. http:get:: /graph/neighbors/:src Returns node direct neighbors (linked with exactly one edge) in the graph. :param string src: source node specified as a SWH PID :query string edges: edges types allowed to be listed as neighbors; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http HTTP/1.1 200 OK Content-Type: application/json { "result": [ "swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2", "swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505", ... ], "meta": { "timings": { "traversal": 0.002942681, "pid2node": 0.000178051, "node2pid": 0.000956569 }, "nb_edges_accessed": 12 } } Walk ---- .. http:get:: /graph/walk/:src/:dst Performs a graph traversal and returns the first found path from source to destination (final destination node included). :param string src: starting node specified as a SWH PID :param string dst: destination node, either as a node PID or a node type. The traversal will stop at the first node encountered matching the desired destination. :query string edges: edges types the traversal can follow; default to ``"*"`` :query string traversal: traversal algorithm; can be either ``dfs`` or ``bfs``, default to ``dfs`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http HTTP/1.1 200 OK Content-Type: application/json { "result": [ "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", "swh:1:rev:cea92e843e40452c08ba313abc39f59efbb4c29c", "swh:1:rev:8d517bdfb57154b8a11d7f1682ecc0f79abf8e02", ... ], "meta": { "timings": { "traversal": 0.002942681, "pid2node": 0.000178051, "node2pid": 0.000956569 }, "nb_edges_accessed": 12 } } Visit ----- .. http:get:: /graph/visit/nodes/:src .. http:get:: /graph/visit/paths/:src Performs a graph traversal and returns explored nodes or paths (in the order of the traversal). :param string src: starting node specified as a SWH PID :query string edges: edges types the traversal can follow; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found .. sourcecode:: http GET /graph/visit/nodes/ HTTP/1.1 200 OK Content-Type: application/json { "result": [ "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", ... "swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961", ... ], "meta": { "timings": { "traversal": 0.002942681, "pid2node": 0.000178051, "node2pid": 0.000956569 }, "nb_edges_accessed": 12 } } .. sourcecode:: http GET /graph/visit/paths/ HTTP/1.1 200 OK Content-Type: application/json { "result": [ [ "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd", ... ], [ "swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35", "swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961", ... ], ... ], "meta": { "timings" : { "traversal": 0.002942681, "pid2node": 0.000178051, "node2pid": 0.000956569 }, "nb_edges_accessed": 12 } } Stats ----- .. http:get:: /graph/stats Returns statistics on the compressed graph. :statuscode 200: success .. sourcecode:: http HTTP/1.1 200 OK Content-Type: application/json { "counts": { "nodes": 16222788, "edges": 9907464 }, "ratios": { "compression": 0.367, "bits_per_node": 5.846, "bits_per_edge": 9.573, "avg_locality": 270.369 }, "indegree": { "min": 0, "max": 12382, "avg": 0.6107127825377487 }, "outdegree": { "min": 0, "max": 1, "avg": 0.6107127825377487 } } diff --git a/docs/docker.rst b/docs/docker.rst index bd10b85..6e33249 100644 --- a/docs/docker.rst +++ b/docs/docker.rst @@ -1,79 +1,79 @@ -Graph Docker environment -======================== +Docker environment +================== Build ----- .. code:: bash $ git clone https://forge.softwareheritage.org/source/swh-graph.git $ cd swh-graph $ docker build --tag swh-graph dockerfiles Run --- Given a graph ``g`` specified by: - ``g.edges.csv.gz``: gzip-compressed csv file with one edge per line, as a "SRC_ID SPACE DST_ID" string, where identifiers are the :ref:`persistent-identifiers` of each node. - ``g.nodes.csv.gz``: sorted list of unique node identifiers appearing in the corresponding ``g.edges.csv.gz`` file. The format is a gzip-compressed csv file with one persistent identifier per line. .. code:: bash $ docker run -ti \ --volume /PATH/TO/GRAPH/:/srv/softwareheritage/graph/data \ --publish 127.0.0.1:5009:5009 \ swh-graph:latest \ bash Where ``/PATH/TO/GRAPH`` is a directory containing the ``g.edges.csv.gz`` and ``g.nodes.csv.gz`` files. By default, when entering the container the current working directory will be ``/srv/softwareheritage/graph``; all relative paths found below are intended to be relative to that dir. Graph compression ~~~~~~~~~~~~~~~~~ To compress the graph: .. code:: bash $ app/scripts/compress_graph.sh --lib lib/ --input data/g Warning: very large graphs may need a bigger batch size parameter for WebGraph internals (you can specify a value when running the compression script using: ``--batch-size 1000000000``). Node identifier mappings ~~~~~~~~~~~~~~~~~~~~~~~~ To dump the mapping files (i.e., various node id <-> other info mapping files, in either ``.csv.gz`` or ad-hoc ``.map`` format): .. code:: bash $ java -cp app/swh-graph.jar \ org.softwareheritage.graph.backend.Setup \ data/g.nodes.csv.gz data/compressed/g Graph server ~~~~~~~~~~~~ To start the swh-graph server: .. code:: bash $ java -cp app/swh-graph.jar \ org.softwareheritage.graph.App data/compressed/g To specify the port on which the server will run, use the `--port` or `-p` flag (default is 5009). diff --git a/docs/index.rst b/docs/index.rst index eb3dd8c..ed16a6c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,26 +1,18 @@ .. _swh-graph: Software Heritage - graph service ================================= Tooling and service providing fast access to the graph representation of the Software Heritage archive. The service is in-memory, based on a compressed representation of the Software Heritage Merkle DAG (see :ref:`data-model`). - -Overview --------- - -* :doc:`Graph compression ` -* :doc:`Graph service REST API ` -* :doc:`Use cases ` -* :doc:`Docker environment ` (for development purposes) - - -Reference Documentation ------------------------ - .. toctree:: - :maxdepth: 2 + :maxdepth: 1 + :caption: Overview + compression + api + use-cases + docker /apidoc/swh.graph diff --git a/docs/use-cases.rst b/docs/use-cases.rst index 925e497..983568a 100644 --- a/docs/use-cases.rst +++ b/docs/use-cases.rst @@ -1,162 +1,167 @@ +========= +Use cases +========= + + This document lists use cases and benchmark scenarios for the Software Heritage graph service. Conventions =========== - **Node identification**: in the following, nodes are always identified by their Software Heritage :ref:`persistent-identifiers` (SWH PIDs). Use cases ========= Browsing -------- The following use cases require traversing the *forward graph*. - **ls**: given a directory node, list (non recursively) all linked nodes of type directory and content Implementation:: /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir - **ls -R**: given a directory node, recursively list all linked nodes of type directory and content Implementation:: /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir - **git log**: given a revision node, recursively list all linked nodes of type revision Implementation:: /graph/visit/nodes/:REV_ID?edges=rev:rev Vault ----- The following use cases require traversing the *forward graph*. - **tarball** (same as *ls -R* above) - **git bundle**: given a node, recursively list all linked nodes of any kind Implementation:: /graph/visit/nodes/:NODE_ID?edges=* Provenance ---------- The following use cases require traversing the *backward (transposed) graph*. - **commit provenance**: given a content or directory node, return *a* commit whose directory (recursively) contains it Implementation:: /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev - **complete commit provenance**: given a content or directory node, return *all* commits whose directory (recursively) contains it Implementation:: /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev - **origin provenance**: given a content, directory, or commit node, return *an* origin that has at least one snapshot that (recursively) contains it Implementation:: /graph/walk/:NODE_ID/ori?direction=backward&edges=* - **complete origin provenance**: given a content, directory, or commit node, return *all* origins that have at least one snapshot that (recursively) contains it Implementation:: /graph/leaves/:NODE_ID?direction=backward&edges=* - *SLOC tracking*: left as future work Provenance statistics ~~~~~~~~~~~~~~~~~~~~~ The following use cases require traversing the *backward (transposed) graph*. - **content popularity across commits**: for each content, count the number of commits (or *commit popularity*) that link to a directory that (recursively) includes it. Plot the distribution of content popularity across commits. Implementation: apply *complete commit provenance* to each content node, count the returned commits, aggregate. - **commit popularity across origins**: for each commit, count the number of origins (or *origin popularity*) that have a snapshot that (recursively) includes it. Plot the distribution of commit popularity across origins. Implementation: apply *complete origin provenance* to each commit node, count the returned origins, aggregate. - *SLOC popularity across contents*: left as future work The following use cases require traversing the *forward graph*. - **revision size** (as n. of contents) distribution: for each revision, count the number of contents that are (recursively) reachable from it. Plot the distribution of revision sizes. - **origin size** (as n. of revisions) distribution: for each origin, count the number of revisions that are (recursively) reachable from it. Plot the distribution of origin sizes. Benchmarks ========== Notes on how to benchmark graph access: - separate pure-graph timings from timings related to use additional mappings (e.g., node types), no matter if the mappings are in-memory or on-disk - separate in-memory timings from on-disk timings; in particular, separate the timing of translating node identifiers between internal integers and SWH PIDs - for each use case that requires a node as input, we will randomize the choice of the input node and repeat the experiment a suitable number of times; where possible we will aggregate results computing basic statistics (average, standard deviation), as well as normalize results w.r.t. the “size” of the chosen node (e.g., number of nodes/path length in the resulting visit) Basic benchmarks ---------------- - **Edge traversal**: given a node, retrieve the first node in its adjacency list. For reference: Apostolico, Drovandi in *Graph Compression by BFS* report times to retrieve the adjacency list of a node (and/or test if an edge exists between two nodes) in the 2-3 us range, for the largest graph in their experiments (22 M nodes, 600 M edges). Each use case is a benchmark ---------------------------- In addition to abstract benchmark, we will use each use case above as a scenario-based benchmark.