diff --git a/docs/grpc-api.rst b/docs/grpc-api.rst index 1596d37..ab12323 100644 --- a/docs/grpc-api.rst +++ b/docs/grpc-api.rst @@ -1,553 +1,553 @@ .. _swh-graph-grpc-api: ================== Using the GRPC API ================== The GRPC API is the core API used to query the graph remotely. It uses the `GRPC framework `_ to provide high-performance graph traversal methods with server streaming. It is more expressive than the :ref:`HTTP API ` (which itself uses the GRPC API under the hood to serve queries), however it can only be used internally or with a local setup, and is never exposed publicly. Its major features include: returning node and edge properties, performing BFS traversals, including traversals with more than one starting node, finding shortest paths, common ancestors, etc. Quickstart ========== Starting the server ------------------- The GRPC server is automatically started on port 50091 when the HTTP server is started with ``swh graph rpc-serve``. It can also be started directly with Java, instead of going through the Python layer, by using the fat-jar shipped with swh-graph: .. code-block:: console $ java -cp swh-graph-XXX.jar org.softwareheritage.graph.rpc.GraphServer (See :ref:`swh-graph-java-api` and :ref:`swh-graph-memory` for more information on Java process options and JVM tuning.) Running queries --------------- The `gRPC command line tool `_ can be an easy way to query the GRPC API from the command line. It is invoked with the ``grpc_cli`` command. Of course, it is also possible to use a generated RPC client in any programming language supported by GRPC. All RPC methods are defined in the service ``swh.graph.TraversalService``. The available endpoints can be listed with ``ls``: .. code-block:: console $ grpc_cli ls localhost:50091 swh.graph.TraversalService Traverse FindPathTo FindPathBetween CountNodes CountEdges Stats GetNode A RPC method can be called with the ``call`` subcommand. .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Stats "" connecting to localhost:50091 num_nodes: 21 num_edges: 23 compression: 1.412 bits_per_node: 8.524 [...] Rpc succeeded with OK status The ``--json-output`` flag can also be used to make the results easier to parse. .. code-block:: console $ grpc_cli --json_output call localhost:50091 swh.graph.TraversalService.Stats "" connecting to localhost:50091 { "numNodes": "21", "numEdges": "23", [...] } Rpc succeeded with OK status **Note**: grpc_cli's outputs in this document are slightly modified for readability's sake. Simple queries ============== For a full documentation of all the endpoints, as well as the request and response messages, see :ref:`swh-graph-grpc-api-protobuf`. Querying a single node ---------------------- The **GetNode** endpoint can be used to return information on a single node of the graph, including all its node properties, from its SWHID. Here are a few examples from the test graph: Content ~~~~~~~ .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:cnt:0000000000000000000000000000000000000001"' .. code-block:: javascript swhid: "swh:1:cnt:0000000000000000000000000000000000000001" cnt { length: 42 is_skipped: false } Revision ~~~~~~~~ .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:rev:0000000000000000000000000000000000000009"' .. code-block:: javascript swhid: "swh:1:rev:0000000000000000000000000000000000000009" rev { author: 2 author_date: 1111140840 author_date_offset: 120 committer: 2 committer_date: 1111151950 committer_date_offset: 120 message: "Add parser" } Release ~~~~~~~ .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:rel:0000000000000000000000000000000000000010"' .. code-block:: javascript swhid: "swh:1:rel:0000000000000000000000000000000000000010" rel { author: 0 author_date: 1234564290 author_date_offset: 120 message: "Version 1.0" } Origin ~~~~~~ .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054"' .. code-block:: javascript swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" ori { url: "https://example.com/swh/graph" } Checking the presence of a node ------------------------------- The **GetNode** endpoint can also be used to check if a node exists in the graph. The RPC will return the ``INVALID_ARGUMENT`` code, and a detailed error message. .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:ori:ffffffffffffffffffffffffffffffffffffffff"' Rpc failed with status code 3, error message: Unknown SWHID: swh:1:ori:ffffffffffffffffffffffffffffffffffffffff $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "invalidswhid"' Rpc failed with status code 3, error message: malformed SWHID: swh:1:ori:ffffffffffffffffffffffffffffffffffffffff Selecting returned fields with FieldMask ---------------------------------------- Many endpoints, including **GetNode**, contain a ``mask`` field of type `FieldMask `_, which can be used to select which fields should be returned in the response. This is particularly interesting for traversal queries that return a large number of nodes, because property access is quite costly from the compressed graph (at least compared to regular node access). It is therefore recommended that clients systematically use FieldMasks to only request the properties that they will consume. A FieldMask is represented as a set of "field paths" in dotted notation. For instance, ``paths: ["swhid", "rev.message"]`` will only request the swhid and the message of a given node. An empty mask will return an empty object. Example: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:rev:0000000000000000000000000000000000000009", mask: {paths: ["swhid"]}' swhid: "swh:1:rev:0000000000000000000000000000000000000009" $ grpc_cli call localhost:50091 swh.graph.TraversalService.GetNode \ 'swhid: "swh:1:rev:0000000000000000000000000000000000000009", mask: {paths: ["swhid", "rev.message", "rev.author"]}' swhid: "swh:1:rev:0000000000000000000000000000000000000009" rev { author: 2 message: "Add parser" } Getting statistics on the graph ------------------------------- The **Stats** endpoint returns overall statistics on the entire compressed graph. Most notably, the total number of nodes and edges, as well as the range of indegrees and outdegrees, and some compression-related statistics. .. code-block:: console $ grpc_cli --json_output call localhost:50091 swh.graph.TraversalService.Stats "" .. code-block:: json { "numNodes": "21", "numEdges": "23", "compression": 1.412, "bitsPerNode": 8.524, "bitsPerEdge": 7.783, "avgLocality": 2.522, "indegreeMax": "3", "indegreeAvg": 1.0952380952380953, "outdegreeMax": "3", "outdegreeAvg": 1.0952380952380953 } Graph traversals ================ Breadth-first traversal ----------------------- The **Traverse** endpoint performs a breadth-first traversal from a set of source nodes, and `streams `_ all the nodes it encounters on the way. All the node properties are stored in the result nodes. Additionally, the *edge properties* (e.g., directory entry names and permissions) are stored as a list in the ``successor`` field of each node. For instance, here we run a traversal from a directory that contains two contents: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: 'swh:1:dir:0000000000000000000000000000000000000006'" We get the following stream of nodes: first, the source directory (including its properties, successor list and their labels), then the contents themselves and their respective properties. .. code-block:: javascript swhid: "swh:1:dir:0000000000000000000000000000000000000006" successor { swhid: "swh:1:cnt:0000000000000000000000000000000000000005" label { name: "parser.c" permission: 33188 } } successor { swhid: "swh:1:cnt:0000000000000000000000000000000000000004" label { name: "README.md" permission: 33188 } } num_successors: 2 .. code-block:: javascript swhid: "swh:1:cnt:0000000000000000000000000000000000000005" cnt { length: 1337 is_skipped: false } .. code-block:: javascript swhid: "swh:1:cnt:0000000000000000000000000000000000000004" cnt { length: 404 is_skipped: false } Again, it is possible to use a FieldMask to restrict which fields get returned. For instance, if we only care about the SWHIDs: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: 'swh:1:dir:0000000000000000000000000000000000000006', mask: {paths: ['swhid']}" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:cnt:0000000000000000000000000000000000000005" swhid: "swh:1:cnt:0000000000000000000000000000000000000004" Graph direction ~~~~~~~~~~~~~~~ For many purposes, especially that of finding the provenance of software artifacts, it is useful to query the backward (or transposed) graph instead, which is the same as the forward graph except all the edges are reversed. To achieve this, the ``direction`` field can be used to specify a direction from the ``GraphDirection`` enum (either ``FORWARD`` or ``BACKWARD``). This query returns all the nodes reachable from a given directory in the -*transposed* (or "backward") graph: +*backward* (or "transposed") graph: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: 'swh:1:dir:0000000000000000000000000000000000000006', direction: BACKWARD, mask: {paths: ['swhid']}" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:dir:0000000000000000000000000000000000000008" swhid: "swh:1:dir:0000000000000000000000000000000000000012" swhid: "swh:1:rev:0000000000000000000000000000000000000009" swhid: "swh:1:rev:0000000000000000000000000000000000000013" swhid: "swh:1:rel:0000000000000000000000000000000000000010" swhid: "swh:1:snp:0000000000000000000000000000000000000020" swhid: "swh:1:rev:0000000000000000000000000000000000000018" swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" swhid: "swh:1:rel:0000000000000000000000000000000000000019" Edge restrictions ~~~~~~~~~~~~~~~~~ -To constrain the kinds of edges can be followed during the graph traversal, it -is possible to specify an edge restriction string in the ``edge`` field. -It is a comma-separated list of edge types that will be followed (e.g. +To constrain the types of edges that can be followed during the graph +traversal, it is possible to specify an edge restriction string in the ``edge`` +field. It is a comma-separated list of edge types that will be followed (e.g. ``"rev:dir,dir:cnt"`` to only follow revision → directory and directory → content edges). By default (or when ``"*"`` is provided), all edges can be followed. This query traverses the parent revisions of a given revision only (i.e., it outputs the *commit log* from a given commit): .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: 'swh:1:rev:0000000000000000000000000000000000000018', edges: 'rev:rev', mask: {paths: ['swhid']}" swhid: "swh:1:rev:0000000000000000000000000000000000000018" swhid: "swh:1:rev:0000000000000000000000000000000000000013" swhid: "swh:1:rev:0000000000000000000000000000000000000009" swhid: "swh:1:rev:0000000000000000000000000000000000000003" Limiting the traversal ~~~~~~~~~~~~~~~~~~~~~~ To avoid using up too much memory or resources, a traversal can be limited in two different ways: - the ``max_depth`` attribute defines the maximum depth of the traversal. - the ``max_edges`` attribute defines the maximum number of edges that can be fetched by the traversal. When these limits are reached, the traversal will simply stop. While these options have obvious use-cases for anti-abuse, they can also be semantically useful: for instance, specifying ``max_depth: 1`` will only return the *neighbors* of the source node. Filtering returned nodes ~~~~~~~~~~~~~~~~~~~~~~~~ In many cases, clients might not want to get all the traversed nodes in the response stream. With the ``return_nodes`` field (of type ``NodeFilter``), it is possible to specify various *criteria* for which nodes should be sent to the stream. By default, all nodes are returned. One common filter is to only want specific *node types* to be returned, which can be done with the ``types`` field of ``NodeFilter``. This field contains a node type restriction string (e.g. "dir,cnt,rev"), and defaults to "*" (all). For instance, to find the list of origins in which a given directory can be found: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: 'swh:1:dir:0000000000000000000000000000000000000006', return_nodes: {types: 'ori'}, direction: BACKWARD, mask: {paths: ['swhid']}" swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" Traversal from multiple sources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Traversals can have multiple starting nodes, when multiple source nodes are present in the ``src`` field. For instance, this BFS starts from two different directories, and explores the graph in parallel from these multiple starting points: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.Traverse \ "src: ['swh:1:dir:0000000000000000000000000000000000000006', 'swh:1:dir:0000000000000000000000000000000000000017'], mask: {paths: ['swhid']}" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:dir:0000000000000000000000000000000000000017" swhid: "swh:1:cnt:0000000000000000000000000000000000000005" swhid: "swh:1:cnt:0000000000000000000000000000000000000004" swhid: "swh:1:cnt:0000000000000000000000000000000000000014" swhid: "swh:1:dir:0000000000000000000000000000000000000016" swhid: "swh:1:cnt:0000000000000000000000000000000000000015" Finding a path to a node matching a criteria -------------------------------------------- The **FindPathTo** endpoint searches for a shortest path between a set of source nodes and any node that matches a specific *criteria*. It does so by performing a breadth-first search from the source node, until any node that matches the given criteria is found, then follows back its parents to return a shortest path from the source set to that node. The criteria can be specified in the ``target`` field of the ``FindPathToRequest``, which is of type ``NodeFilter``. As an example, a common use-case for content provenance is to find the shortest path of a content to an origin in the transposed graph. This query can be run like this: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathTo \ "src: 'swh:1:cnt:0000000000000000000000000000000000000001', target: {types: 'ori'}, direction: BACKWARD, mask: {paths: ['swhid']}" swhid: "swh:1:cnt:0000000000000000000000000000000000000001" swhid: "swh:1:dir:0000000000000000000000000000000000000008" swhid: "swh:1:rev:0000000000000000000000000000000000000009" swhid: "swh:1:snp:0000000000000000000000000000000000000020" swhid: "swh:1:ori:83404f995118bd25774f4ac14422a8f175e7a054" As soon as the request finds an origin, it stops and returns the path from the source set to this origin. Similar to the **Traverse** endpoint, it is possible to specify edge restrictions, graph directions, as well as multiple source nodes. Finding a path between two sets of nodes ---------------------------------------- The **FindPathBetween** endpoint searches for a shortest path between a set of source nodes and a set of destination nodes. It does so by performing a *bidirectional breadth-first search*, i.e., two parallel breadth-first searches, one from the source set ("src-BFS") and one from the destination set ("dst-BFS"), until both searches find a common node that joins their visited sets. This node is called the "midpoint node". The path returned is the path src -> ... -> midpoint -> ... -> dst, which is always a shortest path between src and dst. The graph direction of both BFS can be configured separately. By default, the dst-BFS will use the graph in the opposite direction than the src-BFS (if direction = FORWARD, by default direction_reverse = BACKWARD, and vice-versa). The default behavior is thus to search for a shortest path between two nodes in a given direction. However, one can also specify FORWARD or BACKWARD for *both* the src-BFS and the dst-BFS. This will search for a common descendant or a common ancestor between the two sets, respectively. These will be the midpoints of the returned path. Similar to the **Traverse** endpoint, it is also possible to specify edge restrictions. **Example 1**: shortest path from a snapshot to a content (forward graph): .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ "src: 'swh:1:snp:0000000000000000000000000000000000000020', dst: 'swh:1:cnt:0000000000000000000000000000000000000004', mask: {paths: ['swhid']}" swhid: "swh:1:snp:0000000000000000000000000000000000000020" swhid: "swh:1:rev:0000000000000000000000000000000000000009" swhid: "swh:1:dir:0000000000000000000000000000000000000008" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:cnt:0000000000000000000000000000000000000004" **Example 2**: shortest path from a directory to a snapshot (backward graph): .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ "src: 'swh:1:dir:0000000000000000000000000000000000000006', dst: 'swh:1:rel:0000000000000000000000000000000000000019', direction: BACKWARD, mask: {paths: ['swhid']}" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:dir:0000000000000000000000000000000000000008" swhid: "swh:1:dir:0000000000000000000000000000000000000012" swhid: "swh:1:rev:0000000000000000000000000000000000000013" swhid: "swh:1:rev:0000000000000000000000000000000000000018" swhid: "swh:1:rel:0000000000000000000000000000000000000019" **Example 3**: common ancestor of two contents: .. code-block:: console $ grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween \ "src: 'swh:1:cnt:0000000000000000000000000000000000000004', dst: 'swh:1:cnt:0000000000000000000000000000000000000015', direction: BACKWARD, direction_reverse: BACKWARD, mask: {paths: ['swhid']}" swhid: "swh:1:cnt:0000000000000000000000000000000000000004" swhid: "swh:1:dir:0000000000000000000000000000000000000006" swhid: "swh:1:dir:0000000000000000000000000000000000000008" swhid: "swh:1:dir:0000000000000000000000000000000000000012" swhid: "swh:1:rev:0000000000000000000000000000000000000013" swhid: "swh:1:rev:0000000000000000000000000000000000000018" swhid: "swh:1:dir:0000000000000000000000000000000000000017" swhid: "swh:1:dir:0000000000000000000000000000000000000016" swhid: "swh:1:cnt:0000000000000000000000000000000000000015" middle_node_index: 5 Because ``middle_node_index = 5``, the common ancestor is ``swh:1:rev:0000000000000000000000000000000000000018``. .. _swh-graph-grpc-api-protobuf: Protobuf API Reference ====================== The GRPC API is specified in a single self-documenting `protobuf `_ file, reproduced here verbatim. .. literalinclude:: ../proto/swhgraph.proto :language: protobuf diff --git a/java/pom.xml b/java/pom.xml index 02e438f..0fa2200 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -1,403 +1,403 @@ 4.0.0 org.softwareheritage.graph swh-graph ${git.closest.tag.name} swh-graph https://forge.softwareheritage.org/source/swh-graph/ UTF-8 11 3.21.1 1.47.0 ch.qos.logback logback-classic 1.2.3 org.junit.jupiter junit-jupiter-api 5.7.0 test org.junit.jupiter junit-jupiter-engine 5.7.0 test org.slf4j slf4j-simple 1.7.26 it.unimi.dsi webgraph-big 3.6.7 it.unimi.dsi fastutil 8.5.8 it.unimi.dsi dsiutils 2.7.1 it.unimi.dsi sux4j 5.3.1 it.unimi.dsi law 2.7.2 org.apache.hadoop hadoop-common org.umlgraph umlgraph org.eclipse.jetty.aggregate jetty-all it.unimi.di mg4j it.unimi.di mg4j-big com.martiansoftware jsap 2.1 commons-codec commons-codec 1.15 com.github.luben zstd-jni 1.5.1-1 org.apache.orc orc-core 1.7.1 org.apache.hadoop hadoop-common 3.3.1 org.apache.hadoop hadoop-client-runtime 3.3.1 - - com.google.protobuf - protobuf-java - ${protobuf.version} - - - io.grpc - grpc-netty-shaded - ${grpc.version} - - - io.grpc - grpc-protobuf - ${grpc.version} - - - io.grpc - grpc-stub - ${grpc.version} - - - io.grpc - grpc-services - ${grpc.version} - - - io.grpc - grpc-testing - ${grpc.version} - - - javax.annotation - javax.annotation-api - 1.3.2 - - - com.google.protobuf - protobuf-java-util - ${protobuf.version} - + + com.google.protobuf + protobuf-java + ${protobuf.version} + + + io.grpc + grpc-netty-shaded + ${grpc.version} + + + io.grpc + grpc-protobuf + ${grpc.version} + + + io.grpc + grpc-stub + ${grpc.version} + + + io.grpc + grpc-services + ${grpc.version} + + + io.grpc + grpc-testing + ${grpc.version} + + + javax.annotation + javax.annotation-api + 1.3.2 + + + com.google.protobuf + protobuf-java-util + ${protobuf.version} + maven-clean-plugin 3.1.0 maven-resources-plugin 3.0.2 maven-compiler-plugin 3.8.0 11 11 -verbose -Xlint:all maven-surefire-plugin 2.22.2 maven-failsafe-plugin 2.22.2 maven-jar-plugin 3.0.2 maven-install-plugin 2.5.2 maven-deploy-plugin 2.8.2 maven-site-plugin 3.7.1 maven-project-info-reports-plugin 3.0.0 maven-dependency-plugin 3.1.2 maven-assembly-plugin 3.3.0 org.softwareheritage.graph.rpc.GraphServer jar-with-dependencies false make-assembly package single com.diffplug.spotless spotless-maven-plugin 2.22.1 *.md .gitignore true 4 4.16.0 .coding-style.xml pl.project13.maven git-commit-id-plugin 3.0.1 get-the-git-infos revision initialize true true true true v* git.closest.tag.name ^v true maven-source-plugin 2.1.1 bundle-sources package jar-no-fork test-jar-no-fork org.apache.maven.plugins maven-javadoc-plugin 3.3.1 resource-bundles package resource-bundle test-resource-bundle false javadoc-jar package jar true it.unimi.dsi:webgraph-big:* https://webgraph.di.unimi.it/docs-big/ https://dsiutils.di.unimi.it/docs/ https://fastutil.di.unimi.it/docs/ https://law.di.unimi.it/software/law-docs/ implSpec a Implementation Requirements: implNote a Implementation Note: org.xolstice.maven.plugins protobuf-maven-plugin 0.6.1 com.google.protobuf:protoc:${protobuf.version}:exe:${os.detected.classifier} grpc-java io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} compile compile-custom test-compile test-compile-custom kr.motd.maven os-maven-plugin 1.6.2