diff --git a/docs/grpc-api.rst b/docs/grpc-api.rst index e487e51..1596d37 100644 --- a/docs/grpc-api.rst +++ b/docs/grpc-api.rst @@ -1,127 +1,553 @@ .. _swh-graph-grpc-api: +================== Using the GRPC API ================== The GRPC API is the core API used to query the graph remotely. It uses the `GRPC framework `_ to provide high-performance graph traversal methods with server streaming. It is more expressive than the :ref:`HTTP API ` (which itself uses the GRPC API under the hood to serve queries), however it can only be used internally or with a local setup, and is never exposed publicly. Its major features include: returning node and edge properties, performing BFS traversals, including traversals with more than one starting node, finding shortest paths, common ancestors, etc. Quickstart ----------- +========== Starting the server -~~~~~~~~~~~~~~~~~~~ +------------------- The GRPC server is automatically started on port 50091 when the HTTP server is started with ``swh graph rpc-serve``. It can also be started directly with Java, instead of going through the Python layer, by using the fat-jar shipped with swh-graph: .. code-block:: console $ java -cp swh-graph-XXX.jar org.softwareheritage.graph.rpc.GraphServer (See :ref:`swh-graph-java-api` and :ref:`swh-graph-memory` for more information on Java process options and JVM tuning.) Running queries -~~~~~~~~~~~~~~~ +--------------- The `gRPC command line tool `_ can be an easy way to query the GRPC API from the command line. It is invoked with the ``grpc_cli`` command. Of course, it is also possible to use a generated RPC client in any programming language supported by GRPC. All RPC methods are defined in the service ``swh.graph.TraversalService``. The available endpoints can be listed with ``ls``: .. code-block:: console $ grpc_cli ls localhost:50091 swh.graph.TraversalService Traverse FindPathTo FindPathBetween CountNodes CountEdges Stats GetNode -A RPC method can be called with ``call``: +A RPC method can be called with the ``call`` subcommand. .. code-block:: console - % grpc_cli --json_output call localhost:50091 swh.graph.TraversalService.Stats "" + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Stats "" + connecting to localhost:50091 + num_nodes: 21 + num_edges: 23 + compression: 1.412 + bits_per_node: 8.524 + [...] + Rpc succeeded with OK status + +The ``--json-output`` flag can also be used to make the results easier to +parse. + +.. code-block:: console + + $ grpc_cli --json_output call localhost:50091 swh.graph.TraversalService.Stats "" connecting to localhost:50091 { "numNodes": "21", "numEdges": "23", [...] } Rpc succeeded with OK status -Endpoints ---------- +**Note**: grpc_cli's outputs in this document are slightly modified for +readability's sake. + +Simple queries +============== For a full documentation of all the endpoints, as well as the request and response messages, see :ref:`swh-graph-grpc-api-protobuf`. -Stats -~~~~~ - -Stats returns overall statistics on the entire compressed graph. +Querying a single node +---------------------- -TODO +The **GetNode** endpoint can be used to return information on a single +node of the graph, including all its node properties, from its SWHID. Here +are a few examples from the test graph: -GetNode +Content ~~~~~~~ -TODO +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:cnt:0000000000000000000000000000000000000001"' + +.. code-block:: javascript -Traverse + swhid: "swh:1:cnt:0000000000000000000000000000000000000001" + cnt { + length: 42 + is_skipped: false + } + +Revision ~~~~~~~~ -TODO +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:rev:0000000000000000000000000000000000000009"' + +.. code-block:: javascript + + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + rev { + author: 2 + author_date: 1111140840 + author_date_offset: 120 + committer: 2 + committer_date: 1111151950 + committer_date_offset: 120 + message: "Add parser" + } + +Release +~~~~~~~ + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:rel:0000000000000000000000000000000000000010"' + +.. code-block:: javascript + + swhid: "swh:1:rel:0000000000000000000000000000000000000010" + rel { + author: 0 + author_date: 1234564290 + author_date_offset: 120 + message: "Version 1.0" + } + +Origin +~~~~~~ + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054"' + +.. code-block:: javascript + + swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" + ori { + url: "https://example.com/swh/graph" + } + + +Checking the presence of a node +------------------------------- + +The **GetNode** endpoint can also be used to check if a node exists in the +graph. The RPC will return the ``INVALID_ARGUMENT`` code, and a detailed error +message. + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:ori:ffffffffffffffffffffffffffffffffffffffff"' + Rpc failed with status code 3, error message: Unknown SWHID: swh:1:ori:ffffffffffffffffffffffffffffffffffffffff + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "invalidswhid"' + Rpc failed with status code 3, error message: malformed SWHID: swh:1:ori:ffffffffffffffffffffffffffffffffffffffff + + +Selecting returned fields with FieldMask +---------------------------------------- + +Many endpoints, including **GetNode**, contain a ``mask`` field of type +`FieldMask +`_, +which can be used to select which fields should be returned in the response. + +This is particularly interesting for traversal queries that return a large +number of nodes, because property access is quite costly from the compressed +graph (at least compared to regular node access). It is therefore recommended +that clients systematically use FieldMasks to only request the properties that +they will consume. + +A FieldMask is represented as a set of "field paths" in dotted notation. For +instance, ``paths: ["swhid", "rev.message"]`` will only request the swhid and +the message of a given node. An empty mask will return an empty object. + +Example: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:rev:0000000000000000000000000000000000000009", mask: {paths: ["swhid"]}' + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ + 'swhid: "swh:1:rev:0000000000000000000000000000000000000009", mask: {paths: ["swhid", "rev.message", "rev.author"]}' + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + rev { + author: 2 + message: "Add parser" + } + + +Getting statistics on the graph +------------------------------- + +The **Stats** endpoint returns overall statistics on the entire compressed +graph. Most notably, the total number of nodes and edges, as well as the +range of indegrees and outdegrees, and some compression-related statistics. + +.. code-block:: console + + $ grpc_cli --json_output call localhost:50091 swh.graph.TraversalService.Stats "" + +.. code-block:: json + + { + "numNodes": "21", + "numEdges": "23", + "compression": 1.412, + "bitsPerNode": 8.524, + "bitsPerEdge": 7.783, + "avgLocality": 2.522, + "indegreeMax": "3", + "indegreeAvg": 1.0952380952380953, + "outdegreeMax": "3", + "outdegreeAvg": 1.0952380952380953 + } + + +Graph traversals +================ + +Breadth-first traversal +----------------------- + +The **Traverse** endpoint performs a breadth-first traversal from a set of +source nodes, and `streams +`_ all +the nodes it encounters on the way. All the node properties are stored in the +result nodes. Additionally, the *edge properties* (e.g., directory entry names +and permissions) are stored as a list in the ``successor`` field of each node. -FindPathTo -~~~~~~~~~~ +For instance, here we run a traversal from a directory that contains two +contents: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: 'swh:1:dir:0000000000000000000000000000000000000006'" + +We get the following stream of nodes: first, the source directory (including +its properties, successor list and their labels), then the contents themselves +and their respective properties. + +.. code-block:: javascript + + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + successor { + swhid: "swh:1:cnt:0000000000000000000000000000000000000005" + label { + name: "parser.c" + permission: 33188 + } + } + successor { + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" + label { + name: "README.md" + permission: 33188 + } + } + num_successors: 2 + +.. code-block:: javascript + + swhid: "swh:1:cnt:0000000000000000000000000000000000000005" + cnt { + length: 1337 + is_skipped: false + } + +.. code-block:: javascript + + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" + cnt { + length: 404 + is_skipped: false + } + +Again, it is possible to use a FieldMask to restrict which fields get returned. +For instance, if we only care about the SWHIDs: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: 'swh:1:dir:0000000000000000000000000000000000000006', mask: {paths: ['swhid']}" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:cnt:0000000000000000000000000000000000000005" + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" -TODO -FindPathBetween +Graph direction ~~~~~~~~~~~~~~~ -TODO +For many purposes, especially that of finding the provenance of software +artifacts, it is useful to query the backward (or transposed) graph instead, +which is the same as the forward graph except all the edges are reversed. +To achieve this, the ``direction`` field can be used to specify a direction +from the ``GraphDirection`` enum (either ``FORWARD`` or ``BACKWARD``). + +This query returns all the nodes reachable from a given directory in the +*transposed* (or "backward") graph: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: 'swh:1:dir:0000000000000000000000000000000000000006', direction: BACKWARD, mask: {paths: ['swhid']}" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:dir:0000000000000000000000000000000000000008" + swhid: "swh:1:dir:0000000000000000000000000000000000000012" + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + swhid: "swh:1:rev:0000000000000000000000000000000000000013" + swhid: "swh:1:rel:0000000000000000000000000000000000000010" + swhid: "swh:1:snp:0000000000000000000000000000000000000020" + swhid: "swh:1:rev:0000000000000000000000000000000000000018" + swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" + swhid: "swh:1:rel:0000000000000000000000000000000000000019" + + +Edge restrictions +~~~~~~~~~~~~~~~~~ + +To constrain the kinds of edges can be followed during the graph traversal, it +is possible to specify an edge restriction string in the ``edge`` field. +It is a comma-separated list of edge types that will be followed (e.g. +``"rev:dir,dir:cnt"`` to only follow revision → directory and directory → +content edges). +By default (or when ``"*"`` is provided), all edges can be followed. + +This query traverses the parent revisions of a given revision only (i.e., it +outputs the *commit log* from a given commit): + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: 'swh:1:rev:0000000000000000000000000000000000000018', edges: 'rev:rev', mask: {paths: ['swhid']}" + swhid: "swh:1:rev:0000000000000000000000000000000000000018" + swhid: "swh:1:rev:0000000000000000000000000000000000000013" + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + swhid: "swh:1:rev:0000000000000000000000000000000000000003" + + +Limiting the traversal +~~~~~~~~~~~~~~~~~~~~~~ + +To avoid using up too much memory or resources, a traversal can be limited +in two different ways: + +- the ``max_depth`` attribute defines the maximum depth of the traversal. +- the ``max_edges`` attribute defines the maximum number of edges that can be + fetched by the traversal. + +When these limits are reached, the traversal will simply stop. While these +options have obvious use-cases for anti-abuse, they can also be semantically +useful: for instance, specifying ``max_depth: 1`` will only return the +*neighbors* of the source node. + + +Filtering returned nodes +~~~~~~~~~~~~~~~~~~~~~~~~ + +In many cases, clients might not want to get all the traversed nodes in the +response stream. With the ``return_nodes`` field (of type ``NodeFilter``), it +is possible to specify various *criteria* for which nodes should be sent to the +stream. By default, all nodes are returned. + +One common filter is to only want specific *node types* to be returned, which +can be done with the ``types`` field of ``NodeFilter``. This field contains a +node type restriction string (e.g. "dir,cnt,rev"), and defaults to "*" (all). +For instance, to find the list of origins in which a given directory can be +found: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: 'swh:1:dir:0000000000000000000000000000000000000006', return_nodes: {types: 'ori'}, direction: BACKWARD, mask: {paths: ['swhid']}" + swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" + + +Traversal from multiple sources +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Traversals can have multiple starting nodes, when multiple source nodes are +present in the ``src`` field. For instance, this BFS starts from two different +directories, and explores the graph in parallel from these multiple starting +points: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ + "src: ['swh:1:dir:0000000000000000000000000000000000000006', 'swh:1:dir:0000000000000000000000000000000000000017'], mask: {paths: ['swhid']}" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:dir:0000000000000000000000000000000000000017" + swhid: "swh:1:cnt:0000000000000000000000000000000000000005" + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" + swhid: "swh:1:cnt:0000000000000000000000000000000000000014" + swhid: "swh:1:dir:0000000000000000000000000000000000000016" + swhid: "swh:1:cnt:0000000000000000000000000000000000000015" + + +Finding a path to a node matching a criteria +-------------------------------------------- + +The **FindPathTo** endpoint searches for a shortest path between a set of +source nodes and any node that matches a specific *criteria*. +It does so by performing a breadth-first search from the source node, +until any node that matches the given criteria is found, then follows +back its parents to return a shortest path from the source set to that +node. + +The criteria can be specified in the ``target`` field of the +``FindPathToRequest``, which is of type ``NodeFilter``. + +As an example, a common use-case for content provenance is to find the shortest +path of a content to an origin in the transposed graph. This query can be +run like this: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathTo \ + "src: 'swh:1:cnt:0000000000000000000000000000000000000001', target: {types: 'ori'}, direction: BACKWARD, mask: {paths: ['swhid']}" + swhid: "swh:1:cnt:0000000000000000000000000000000000000001" + swhid: "swh:1:dir:0000000000000000000000000000000000000008" + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + swhid: "swh:1:snp:0000000000000000000000000000000000000020" + swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" + +As soon as the request finds an origin, it stops and returns the path from the +source set to this origin. + +Similar to the **Traverse** endpoint, it is possible to specify edge +restrictions, graph directions, as well as multiple source nodes. -CountNodes -~~~~~~~~~~ -TODO +Finding a path between two sets of nodes +---------------------------------------- -CountEdges -~~~~~~~~~~ +The **FindPathBetween** endpoint searches for a shortest path between a set of +source nodes and a set of destination nodes. + +It does so by performing a *bidirectional breadth-first search*, i.e., +two parallel breadth-first searches, one from the source set ("src-BFS") +and one from the destination set ("dst-BFS"), until both searches find a +common node that joins their visited sets. This node is called the +"midpoint node". +The path returned is the path src -> ... -> midpoint -> ... -> dst, +which is always a shortest path between src and dst. + +The graph direction of both BFS can be configured separately. By +default, the dst-BFS will use the graph in the opposite direction than +the src-BFS (if direction = FORWARD, by default direction_reverse = +BACKWARD, and vice-versa). The default behavior is thus to search for +a shortest path between two nodes in a given direction. However, one +can also specify FORWARD or BACKWARD for *both* the src-BFS and the +dst-BFS. This will search for a common descendant or a common ancestor +between the two sets, respectively. These will be the midpoints of the +returned path. + +Similar to the **Traverse** endpoint, it is also possible to specify edge +restrictions. + +**Example 1**: shortest path from a snapshot to a content (forward graph): + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ + "src: 'swh:1:snp:0000000000000000000000000000000000000020', dst: 'swh:1:cnt:0000000000000000000000000000000000000004', mask: {paths: ['swhid']}" + swhid: "swh:1:snp:0000000000000000000000000000000000000020" + swhid: "swh:1:rev:0000000000000000000000000000000000000009" + swhid: "swh:1:dir:0000000000000000000000000000000000000008" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" + +**Example 2**: shortest path from a directory to a snapshot (backward graph): + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ + "src: 'swh:1:dir:0000000000000000000000000000000000000006', dst: 'swh:1:rel:0000000000000000000000000000000000000019', direction: BACKWARD, mask: {paths: ['swhid']}" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:dir:0000000000000000000000000000000000000008" + swhid: "swh:1:dir:0000000000000000000000000000000000000012" + swhid: "swh:1:rev:0000000000000000000000000000000000000013" + swhid: "swh:1:rev:0000000000000000000000000000000000000018" + swhid: "swh:1:rel:0000000000000000000000000000000000000019" + +**Example 3**: common ancestor of two contents: + +.. code-block:: console + + $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ + "src: 'swh:1:cnt:0000000000000000000000000000000000000004', dst: 'swh:1:cnt:0000000000000000000000000000000000000015', direction: BACKWARD, direction_reverse: BACKWARD, mask: {paths: ['swhid']}" + swhid: "swh:1:cnt:0000000000000000000000000000000000000004" + swhid: "swh:1:dir:0000000000000000000000000000000000000006" + swhid: "swh:1:dir:0000000000000000000000000000000000000008" + swhid: "swh:1:dir:0000000000000000000000000000000000000012" + swhid: "swh:1:rev:0000000000000000000000000000000000000013" + swhid: "swh:1:rev:0000000000000000000000000000000000000018" + swhid: "swh:1:dir:0000000000000000000000000000000000000017" + swhid: "swh:1:dir:0000000000000000000000000000000000000016" + swhid: "swh:1:cnt:0000000000000000000000000000000000000015" + middle_node_index: 5 + +Because ``middle_node_index = 5``, the common ancestor is +``swh:1:rev:0000000000000000000000000000000000000018``. -TODO .. _swh-graph-grpc-api-protobuf: Protobuf API Reference ----------------------- +====================== The GRPC API is specified in a single self-documenting `protobuf `_ file, reproduced here verbatim. .. literalinclude:: ../proto/swhgraph.proto :language: protobuf - :linenos: