diff --git a/PKG-INFO b/PKG-INFO index 26a7abd..c5989bd 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,52 +1,52 @@ Metadata-Version: 2.1 Name: swh.graph -Version: 1.0.1 +Version: 1.0.2 Summary: Software Heritage graph service Home-page: https://forge.softwareheritage.org/diffusion/DGRPH Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-graph Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-graph/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/x-rst Provides-Extra: testing License-File: LICENSE License-File: AUTHORS Software Heritage - graph service ================================= Tooling and services, collectively known as ``swh-graph``, providing fast access to the graph representation of the `Software Heritage `_ `archive `_. The service is in-memory, based on a compressed representation of the Software Heritage Merkle DAG. Bibliography ------------ In addition to accompanying technical documentation, ``swh-graph`` is also described in the following scientific paper. If you publish results based on ``swh-graph``, please acknowledge it by citing the paper as follows: .. note:: Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. `Ultra-Large-Scale Repository Analysis via Graph Compression `_. In proceedings of `SANER 2020 `_: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, pages 184-194. IEEE 2020. Links: `preprint `_, `bibtex `_. diff --git a/docs/api.rst b/docs/api.rst index 3face8d..9d29c82 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,541 +1,473 @@ .. _swh-graph-api: Graph Querying HTTP API ======================= The Graph Querying API is a high-level HTTP API intended to run common, relatively simple traversal queries on the compressed graph. The client/server architecture allows it to only load the graph in memory once then serve multiple different requests. However, it is limited in expressivity; more complex or resource-intensive queries should rather use the :ref:`Low-level Java API ` to run them as standalone programs. Terminology ----------- This API uses the following notions: - **Node**: a node in the :ref:`Software Heritage graph `, represented by a :ref:`SWHID `. - **Node type**: the 3-letter specifier from the node SWHID (``cnt``, ``dir``, ``rel``, ``rev``, ``snp``, ``ori``), or ``*`` for all node types. - **Edge type**: a pair ``src:dst`` where ``src`` and ``dst`` are either node types, or ``*`` to denote all node types. - **Edge restrictions**: a textual specification of which edges can be followed during graph traversal. Either ``*`` to denote that all edges can be followed or a comma separated list of edge types to allow following only those edges. Note that when traversing the *backward* (i.e., transposed) graph, edge types are reversed too. So, for instance, ``ori:snp`` makes sense when traversing the forward graph, but useless (due to lack of matching edges in the graph) when traversing the backward graph; conversely ``snp:ori`` is useful when traversing the backward graph, but not in the forward one. For the same reason ``dir:dir`` allows following edges from parent directories to sub-directories when traversing the forward graph, but the same restriction allows following edges from sub-directories to parent directories. - **Node restrictions**: a textual specification of which type of nodes can be returned after a request. Either ``*`` to denote that all types of nodes can be returned or a comma separated list of node types to allow returning only those node types. Examples ~~~~~~~~ - ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` the SWHID of a node of type content containing the full text of the GPL3 license. - ``swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35`` the SWHID of a node of type revision corresponding to the commit in Linux that merged the 'x86/urgent' branch on 31 December 2017. - ``"dir:dir,dir:cnt"`` node types allowing edges from directories to directories nodes, or directories to contents nodes. - ``"rev:rev,dir:*"`` node types allowing edges from revisions to revisions nodes, or from directories nodes. - ``"*:rel"`` node types allowing all edges to releases. - ``"cnt,snp"`` accepted node types returned in the query results. Endpoints --------- Leaves ~~~~~~ .. http:get:: /graph/leaves/:src Performs a graph traversal and returns the leaves of the subgraph rooted at the specified source node. :param string src: source node specified as a SWHID :query string edges: edges types the traversal can follow; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:cnt:540faad6b1e02e2db4f349a4845192db521ff2bd swh:1:cnt:630585fc6d34e5e121139e2aee0a64e83dc9aae6 swh:1:cnt:f8634ced669f0a9155c8cab1b2621d57d778215e swh:1:cnt:ba6daa801ad3ea587904b1abe9161dceedb2e0bd ... Neighbors ~~~~~~~~~ .. http:get:: /graph/neighbors/:src Returns node direct neighbors (linked with exactly one edge) in the graph. :param string src: source node specified as a SWHID :query string edges: edges types allowed to be listed as neighbors; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/neighbors/swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35 HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961 swh:1:dir:b5d2aa0746b70300ebbca82a8132af386cc5986d swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd ... Walk ~~~~ .. .. http:get:: /graph/walk/:src/:dst Performs a graph traversal and returns the first found path from source to destination (final destination node included). :param string src: starting node specified as a SWHID :param string dst: destination node, either as a node SWHID or a node type. The traversal will stop at the first node encountered matching the desired destination. :query string edges: edges types the traversal can follow; default to ``"*"`` :query string traversal: traversal algorithm; can be either ``dfs`` or ``bfs``, default to ``dfs`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query string return_types: types of nodes we want to be displayed; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http HTTP/1.1 200 OK swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35 swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd swh:1:rev:cea92e843e40452c08ba313abc39f59efbb4c29c swh:1:rev:8d517bdfb57154b8a11d7f1682ecc0f79abf8e02 ... -.. http:get:: /graph/randomwalk/:src/:dst - - Performs a graph *random* traversal, i.e., picking one random successor - node at each hop, from source to destination (final destination node - included). - - :param string src: starting node specified as a SWHID - :param string dst: destination node, either as a node SWHID or a node type. - The traversal will stop at the first node encountered matching the - desired destination. - - :query string edges: edges types the traversal can follow; default to - ``"*"`` - :query string direction: direction in which graph edges will be followed; - can be either ``forward`` or ``backward``, default to ``forward`` - :query int limit: limit the number of nodes returned. You can use positive - numbers to get the first N results, or negative numbers to get the last - N results starting from the tail; - default to ``0``, meaning no limit. - :query integer max_edges: how many edges can be traversed during the visit; - default to 0 (not restricted) - :query string return_types: only return the nodes matching this type; - default to ``"*"`` - - :statuscode 200: success - :statuscode 400: invalid query string provided - :statuscode 404: starting node cannot be found - - **Example:** - - .. sourcecode:: http - - GET /graph/randomwalk/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2/ori?direction=backward HTTP/1.1 - - Content-Type: text/plain - Transfer-Encoding: chunked - - .. sourcecode:: http - - HTTP/1.1 200 OK - - swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 - swh:1:dir:8de8a8823a0780524529c94464ee6ef60b98e2ed - swh:1:dir:7146ea6cbd5ffbfec58cc8df5e0552da45e69cb7 - swh:1:rev:b12563e00026b48b817fd3532fc3df2db2a0f460 - swh:1:rev:13e8ebe80fb878bade776131e738d5772aa0ad1b - swh:1:rev:cb39b849f167c70c1f86d4356f02d1285d49ee13 - ... - swh:1:rev:ff70949f336593d6c59b18e4989edf24d7f0f254 - swh:1:snp:a511810642b7795e725033febdd82075064ed863 - swh:1:ori:98aa0e71f5c789b12673717a97f6e9fa20aa1161 - - **Limit example:** - - .. sourcecode:: http - - GET /graph/randomwalk/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2/ori?direction=backward&limit=-2 HTTP/1.1 - - Content-Type: text/plain - Transfer-Encoding: chunked - - .. sourcecode:: http - - HTTP/1.1 200 OK - - swh:1:ori:98aa0e71f5c789b12673717a97f6e9fa20aa1161 - swh:1:snp:a511810642b7795e725033febdd82075064ed863 - Visit ~~~~~ .. http:get:: /graph/visit/nodes/:src .. http:get:: /graph/visit/edges/:src .. http:get:: /graph/visit/paths/:src Performs a graph traversal and returns explored nodes, edges or paths (in the order of the traversal). :param string src: starting node specified as a SWHID :query string edges: edges types the traversal can follow; default to ``"*"`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/visit/nodes/swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:cfab784723a6c2d33468c9ed8a566fd5e2abd8c9 swh:1:rev:53e5df0e7a6b7bd4919074c081a173655c0da164 swh:1:rev:f85647f14b8243532283eff3e08f4ee96c35945f swh:1:rev:fe5f9ef854715fc59b9ec22f9878f11498cfcdbf swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb swh:1:cnt:c8cece50beae7a954f4ea27e3ae7bf941dc6d0c0 swh:1:dir:a358d0cf89821227d4c00b0ced5e0a8b3756b5db swh:1:cnt:cc407b7e24dd300d2e1a77d8f04af89b3f962a51 swh:1:cnt:701bd0a63e11b3390a547ce8515d28c6bab8a201 ... **Example:** .. sourcecode:: http GET /graph/visit/edges/swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:61f92a7db95f5a6d1fcb94d2b897ed3797584d7b swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:00e81c89c29ff3e58745fdaf7abb68daa1389e85 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:7596fdc31c9aa00aed281ccb026a74cabf2383bb swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:ec7a2341ac3d9d8b571bbdfb90a089d4e54dea56 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:1c5b5eac61eda2454034a43eb124ab490885ef3a swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:4dfa88ca55e04e8afe05e8543ddddee32dde7236 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:d56ae79e43ff1b37534370911c8a78ec7f38d437 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:19ba5d6203a040a39ecc4a77b165d3f097c1e662 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:9c56102eefea23c95405533e1de23da4b873ecc4 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:3f54e816b46c2e179cd164e17fea93b3013a9db4 ... **Example:** .. sourcecode:: http GET /graph/visit/paths/swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb HTTP/1.1 Content-Type: application/x-ndjson Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:acfb7cabd63b368a03a9df87670ece1488c8bce0"] ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:2a0837708151d76edf28fdbb90dc3eabc676cff3"] ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:eaf025ad54b94b2fdda26af75594cfae3491ec75"] ... ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:dir:2ebd4b96fa5665ff74f2b27ae41aecdc43af4463", "swh:1:cnt:1d3b6575fb7bf2a147d228e78ffd77ea193c3639"] ... Counting results ~~~~~~~~~~~~~~~~ The following method variants, with trailing `/count` added, behave like their already discussed counterparts but, instead of returning results, return the *amount* of results that would have been returned: .. http:get:: /graph/leaves/count/:src Return the amount of :http:get:`/graph/leaves/:src` results .. http:get:: /graph/neighbors/count/:src Return the amount of :http:get:`/graph/neighbors/:src` results .. http:get:: /graph/visit/nodes/count/:src Return the amount of :http:get:`/graph/visit/nodes/:src` results Stats ~~~~~ .. http:get:: /graph/stats Returns statistics on the compressed graph. :statuscode 200: success **Example** .. sourcecode:: http GET /graph/stats HTTP/1.1 Content-Type: application/json .. sourcecode:: http HTTP/1.1 200 OK { "counts": { "nodes": 16222788, "edges": 9907464 }, "ratios": { "compression": 0.367, "bits_per_node": 5.846, "bits_per_edge": 9.573, "avg_locality": 270.369 }, "indegree": { "min": 0, "max": 12382, "avg": 0.6107127825377487 }, "outdegree": { "min": 0, "max": 1, "avg": 0.6107127825377487 } } Use-case examples ----------------- This section showcases how to leverage the endpoints of the HTTP API described above for some common use-cases. Browsing ~~~~~~~~ The following use cases require traversing the *forward graph*. - **ls**: given a directory node, list (non recursively) all linked nodes of type directory and content Endpoint:: /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir - **ls -R**: given a directory node, recursively list all linked nodes of type directory and content Endpoint:: /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir - **git log**: given a revision node, recursively list all linked nodes of type revision Endpoint:: /graph/visit/nodes/:REV_ID?edges=rev:rev Vault ~~~~~ The following use cases require traversing the *forward graph*. - **tarball** (same as *ls -R* above) - **git bundle**: given a node, recursively list all linked nodes of any kind Endpoint:: /graph/visit/nodes/:NODE_ID?edges=* Provenance ~~~~~~~~~~ The following use cases require traversing the *backward (transposed) graph*. - **commit provenance**: given a content or directory node, return *a* commit whose directory (recursively) contains it Endpoint:: /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev - **complete commit provenance**: given a content or directory node, return *all* commits whose directory (recursively) contains it Endpoint:: /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev - **origin provenance**: given a content, directory, or commit node, return *an* origin that has at least one snapshot that (recursively) contains it Endpoint:: /graph/walk/:NODE_ID/ori?direction=backward&edges=* - **complete origin provenance**: given a content, directory, or commit node, return *all* origins that have at least one snapshot that (recursively) contains it Endpoint:: /graph/leaves/:NODE_ID?direction=backward&edges=* Provenance statistics ~~~~~~~~~~~~~~~~~~~~~ The following use cases require traversing the *backward (transposed) graph*. - **content popularity across commits**: count the number of commits (or *commit popularity*) that link to a directory that (recursively) includes a given content. Endpoint:: /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev - **commit popularity across origins**: count the number of origins (or *origin popularity*) that have a snapshot that (recursively) includes a given commit. Endpoint:: /graph/count/leaves/:NODE_ID?direction=backward&edges=* The following use cases require traversing the *forward graph*. - **revision size** (as n. of contents) distribution: the number of contents that are (recursively) reachable from a given revision. Endpoint:: /graph/count/leaves/:NODE_ID?edges=* - **origin size** (as n. of revisions) distribution: count the number of revisions that are (recursively) reachable from a given origin. Endpoint:: /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev diff --git a/swh.graph.egg-info/PKG-INFO b/swh.graph.egg-info/PKG-INFO index 26a7abd..c5989bd 100644 --- a/swh.graph.egg-info/PKG-INFO +++ b/swh.graph.egg-info/PKG-INFO @@ -1,52 +1,52 @@ Metadata-Version: 2.1 Name: swh.graph -Version: 1.0.1 +Version: 1.0.2 Summary: Software Heritage graph service Home-page: https://forge.softwareheritage.org/diffusion/DGRPH Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-graph Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-graph/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/x-rst Provides-Extra: testing License-File: LICENSE License-File: AUTHORS Software Heritage - graph service ================================= Tooling and services, collectively known as ``swh-graph``, providing fast access to the graph representation of the `Software Heritage `_ `archive `_. The service is in-memory, based on a compressed representation of the Software Heritage Merkle DAG. Bibliography ------------ In addition to accompanying technical documentation, ``swh-graph`` is also described in the following scientific paper. If you publish results based on ``swh-graph``, please acknowledge it by citing the paper as follows: .. note:: Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. `Ultra-Large-Scale Repository Analysis via Graph Compression `_. In proceedings of `SANER 2020 `_: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, pages 184-194. IEEE 2020. Links: `preprint `_, `bibtex `_. diff --git a/swh.graph.egg-info/SOURCES.txt b/swh.graph.egg-info/SOURCES.txt index 9511cb7..bf77475 100644 --- a/swh.graph.egg-info/SOURCES.txt +++ b/swh.graph.egg-info/SOURCES.txt @@ -1,256 +1,258 @@ .git-blame-ignore-revs .gitignore .pre-commit-config.yaml AUTHORS CODE_OF_CONDUCT.md CONTRIBUTORS LICENSE MANIFEST.in Makefile Makefile.local README.rst mypy.ini pyproject.toml pytest.ini requirements-swh.txt requirements-test.txt requirements.txt setup.cfg setup.py tox.ini docker/Dockerfile docker/build.sh docker/run.sh docs/.gitignore docs/Makefile docs/Makefile.local docs/README.rst docs/api.rst docs/cli.rst docs/compression.rst docs/conf.py docs/docker.rst docs/git2graph.md docs/grpc-api.rst docs/index.rst docs/java-api.rst docs/memory.rst docs/quickstart.rst docs/_static/.placeholder docs/_templates/.placeholder docs/images/.gitignore docs/images/Makefile docs/images/compression_steps.dot java/.coding-style.xml java/.gitignore java/AUTHORS java/LICENSE java/README.md java/pom.xml java/.mvn/jvm.config java/src/main/proto java/src/main/java/org/softwareheritage/graph/AllowedEdges.java java/src/main/java/org/softwareheritage/graph/AllowedNodes.java java/src/main/java/org/softwareheritage/graph/SWHID.java java/src/main/java/org/softwareheritage/graph/Subgraph.java java/src/main/java/org/softwareheritage/graph/SwhBidirectionalGraph.java java/src/main/java/org/softwareheritage/graph/SwhGraph.java java/src/main/java/org/softwareheritage/graph/SwhGraphProperties.java java/src/main/java/org/softwareheritage/graph/SwhType.java java/src/main/java/org/softwareheritage/graph/SwhUnidirectionalGraph.java java/src/main/java/org/softwareheritage/graph/compress/CSVEdgeDataset.java java/src/main/java/org/softwareheritage/graph/compress/ComposePermutations.java java/src/main/java/org/softwareheritage/graph/compress/ExtractNodes.java java/src/main/java/org/softwareheritage/graph/compress/ExtractPersons.java java/src/main/java/org/softwareheritage/graph/compress/GraphDataset.java java/src/main/java/org/softwareheritage/graph/compress/LabelMapBuilder.java java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java java/src/main/java/org/softwareheritage/graph/compress/ORCGraphDataset.java java/src/main/java/org/softwareheritage/graph/compress/ScatteredArcsORCGraph.java java/src/main/java/org/softwareheritage/graph/compress/WriteNodeProperties.java java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java java/src/main/java/org/softwareheritage/graph/experiments/topology/AveragePaths.java java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java java/src/main/java/org/softwareheritage/graph/labels/DirEntry.java java/src/main/java/org/softwareheritage/graph/labels/SwhLabel.java java/src/main/java/org/softwareheritage/graph/maps/NodeIdMap.java java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java java/src/main/java/org/softwareheritage/graph/rpc/GraphServer.java java/src/main/java/org/softwareheritage/graph/rpc/NodePropertyBuilder.java java/src/main/java/org/softwareheritage/graph/rpc/Traversal.java java/src/main/java/org/softwareheritage/graph/utils/DumpProperties.java java/src/main/java/org/softwareheritage/graph/utils/ExportSubdataset.java java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java java/src/main/java/org/softwareheritage/graph/utils/ForkJoinBigQuickSort2.java java/src/main/java/org/softwareheritage/graph/utils/ForkJoinQuickSort3.java java/src/main/java/org/softwareheritage/graph/utils/MPHTranslate.java java/src/main/java/org/softwareheritage/graph/utils/ReadGraph.java java/src/main/java/org/softwareheritage/graph/utils/ReadLabelledGraph.java java/src/main/java/org/softwareheritage/graph/utils/Sort.java java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java java/src/test/java/org/softwareheritage/graph/AllowedNodesTest.java java/src/test/java/org/softwareheritage/graph/GraphTest.java java/src/test/java/org/softwareheritage/graph/SubgraphTest.java java/src/test/java/org/softwareheritage/graph/compress/ExtractNodesTest.java java/src/test/java/org/softwareheritage/graph/compress/ExtractPersonsTest.java java/src/test/java/org/softwareheritage/graph/rpc/CountEdgesTest.java java/src/test/java/org/softwareheritage/graph/rpc/CountNodesTest.java java/src/test/java/org/softwareheritage/graph/rpc/FindPathBetweenTest.java java/src/test/java/org/softwareheritage/graph/rpc/FindPathToTest.java java/src/test/java/org/softwareheritage/graph/rpc/GetNodeTest.java java/src/test/java/org/softwareheritage/graph/rpc/StatsTest.java java/src/test/java/org/softwareheritage/graph/rpc/TraversalServiceTest.java java/src/test/java/org/softwareheritage/graph/rpc/TraverseLeavesTest.java java/src/test/java/org/softwareheritage/graph/rpc/TraverseNeighborsTest.java java/src/test/java/org/softwareheritage/graph/rpc/TraverseNodesPropertiesTest.java java/src/test/java/org/softwareheritage/graph/rpc/TraverseNodesTest.java java/src/test/java/org/softwareheritage/graph/utils/ForkJoinBigQuickSort2Test.java java/src/test/java/org/softwareheritage/graph/utils/ForkJoinQuickSort3Test.java -java/target/swh-graph-1.0.1.jar +java/target/swh-graph-1.0.2.jar proto/swhgraph.proto reports/.gitignore reports/benchmarks/Makefile reports/benchmarks/benchmarks.tex reports/experiments/Makefile reports/experiments/experiments.tex reports/linux_log/LinuxLog.java reports/linux_log/Makefile reports/linux_log/linux_log.tex reports/node_mapping/Makefile reports/node_mapping/NodeIdMapHaloDB.java reports/node_mapping/NodeIdMapRocksDB.java reports/node_mapping/node_mapping.tex swh/__init__.py swh.graph.egg-info/PKG-INFO swh.graph.egg-info/SOURCES.txt swh.graph.egg-info/dependency_links.txt swh.graph.egg-info/entry_points.txt swh.graph.egg-info/requires.txt swh.graph.egg-info/top_level.txt swh/graph/__init__.py swh/graph/cli.py swh/graph/client.py swh/graph/config.py swh/graph/http_client.py swh/graph/http_naive_client.py swh/graph/http_server.py swh/graph/naive_client.py swh/graph/py.typed swh/graph/rpc_server.py swh/graph/webgraph.py swh/graph/rpc/swhgraph.proto swh/graph/rpc/swhgraph_pb2.py swh/graph/rpc/swhgraph_pb2.pyi swh/graph/rpc/swhgraph_pb2_grpc.py swh/graph/tests/__init__.py swh/graph/tests/conftest.py swh/graph/tests/test_cli.py +swh/graph/tests/test_grpc.py swh/graph/tests/test_http_client.py +swh/graph/tests/test_http_server_down.py swh/graph/tests/dataset/generate_dataset.py swh/graph/tests/dataset/compressed/example-labelled.labeloffsets swh/graph/tests/dataset/compressed/example-labelled.labels swh/graph/tests/dataset/compressed/example-labelled.properties swh/graph/tests/dataset/compressed/example-transposed-labelled.labeloffsets swh/graph/tests/dataset/compressed/example-transposed-labelled.labels swh/graph/tests/dataset/compressed/example-transposed-labelled.properties swh/graph/tests/dataset/compressed/example-transposed.graph swh/graph/tests/dataset/compressed/example-transposed.obl swh/graph/tests/dataset/compressed/example-transposed.offsets swh/graph/tests/dataset/compressed/example-transposed.properties swh/graph/tests/dataset/compressed/example.edges.count.txt swh/graph/tests/dataset/compressed/example.edges.stats.txt swh/graph/tests/dataset/compressed/example.graph swh/graph/tests/dataset/compressed/example.indegree swh/graph/tests/dataset/compressed/example.labels.count.txt swh/graph/tests/dataset/compressed/example.labels.csv.zst swh/graph/tests/dataset/compressed/example.labels.fcl.bytearray swh/graph/tests/dataset/compressed/example.labels.fcl.pointers swh/graph/tests/dataset/compressed/example.labels.fcl.properties swh/graph/tests/dataset/compressed/example.labels.mph swh/graph/tests/dataset/compressed/example.mph swh/graph/tests/dataset/compressed/example.node2swhid.bin swh/graph/tests/dataset/compressed/example.node2type.map swh/graph/tests/dataset/compressed/example.nodes.count.txt swh/graph/tests/dataset/compressed/example.nodes.csv.zst swh/graph/tests/dataset/compressed/example.nodes.stats.txt swh/graph/tests/dataset/compressed/example.obl swh/graph/tests/dataset/compressed/example.offsets swh/graph/tests/dataset/compressed/example.order swh/graph/tests/dataset/compressed/example.outdegree swh/graph/tests/dataset/compressed/example.persons.count.txt swh/graph/tests/dataset/compressed/example.persons.csv.zst swh/graph/tests/dataset/compressed/example.persons.mph swh/graph/tests/dataset/compressed/example.properties swh/graph/tests/dataset/compressed/example.property.author_id.bin swh/graph/tests/dataset/compressed/example.property.author_timestamp.bin swh/graph/tests/dataset/compressed/example.property.author_timestamp_offset.bin swh/graph/tests/dataset/compressed/example.property.committer_id.bin swh/graph/tests/dataset/compressed/example.property.committer_timestamp.bin swh/graph/tests/dataset/compressed/example.property.committer_timestamp_offset.bin swh/graph/tests/dataset/compressed/example.property.content.is_skipped.bin swh/graph/tests/dataset/compressed/example.property.content.length.bin swh/graph/tests/dataset/compressed/example.property.message.bin swh/graph/tests/dataset/compressed/example.property.message.offset.bin swh/graph/tests/dataset/compressed/example.property.tag_name.bin swh/graph/tests/dataset/compressed/example.property.tag_name.offset.bin swh/graph/tests/dataset/compressed/example.stats swh/graph/tests/dataset/edges/content/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/content/graph-all.nodes.csv.zst swh/graph/tests/dataset/edges/directory/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/directory/graph-all.nodes.csv.zst swh/graph/tests/dataset/edges/origin/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/origin/graph-all.nodes.csv.zst swh/graph/tests/dataset/edges/release/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/release/graph-all.nodes.csv.zst swh/graph/tests/dataset/edges/revision/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/revision/graph-all.nodes.csv.zst swh/graph/tests/dataset/edges/snapshot/graph-all.edges.csv.zst swh/graph/tests/dataset/edges/snapshot/graph-all.nodes.csv.zst swh/graph/tests/dataset/img/.gitignore swh/graph/tests/dataset/img/Makefile swh/graph/tests/dataset/img/example.dot swh/graph/tests/dataset/orc/content/content-all.orc swh/graph/tests/dataset/orc/directory/directory-all.orc swh/graph/tests/dataset/orc/directory_entry/directory_entry-all.orc swh/graph/tests/dataset/orc/origin/origin-all.orc swh/graph/tests/dataset/orc/origin_visit/origin_visit-all.orc swh/graph/tests/dataset/orc/origin_visit_status/origin_visit_status-all.orc swh/graph/tests/dataset/orc/release/release-all.orc swh/graph/tests/dataset/orc/revision/revision-all.orc swh/graph/tests/dataset/orc/revision_extra_headers/revision_extra_headers-all.orc swh/graph/tests/dataset/orc/revision_history/revision_history-all.orc swh/graph/tests/dataset/orc/skipped_content/skipped_content-all.orc swh/graph/tests/dataset/orc/snapshot/snapshot-all.orc swh/graph/tests/dataset/orc/snapshot_branch/snapshot_branch-all.orc tools/dir2graph tools/swhid2int2int2swhid.sh tools/git2graph/.gitignore tools/git2graph/Makefile tools/git2graph/README.md tools/git2graph/git2graph.c tools/git2graph/tests/edge-filters.bats tools/git2graph/tests/full-graph.bats tools/git2graph/tests/node-filters.bats tools/git2graph/tests/repo_helper.bash tools/git2graph/tests/data/sample-repo.tgz tools/git2graph/tests/data/graphs/dir-nodes/edges.csv tools/git2graph/tests/data/graphs/dir-nodes/nodes.csv tools/git2graph/tests/data/graphs/from-dir-edges/edges.csv tools/git2graph/tests/data/graphs/from-dir-edges/nodes.csv tools/git2graph/tests/data/graphs/from-rel-edges/edges.csv tools/git2graph/tests/data/graphs/from-rel-edges/nodes.csv tools/git2graph/tests/data/graphs/fs-nodes/edges.csv tools/git2graph/tests/data/graphs/fs-nodes/nodes.csv tools/git2graph/tests/data/graphs/full/edges.csv tools/git2graph/tests/data/graphs/full/nodes.csv tools/git2graph/tests/data/graphs/rev-edges/edges.csv tools/git2graph/tests/data/graphs/rev-edges/nodes.csv tools/git2graph/tests/data/graphs/rev-nodes/edges.csv tools/git2graph/tests/data/graphs/rev-nodes/nodes.csv tools/git2graph/tests/data/graphs/to-rev-edges/edges.csv tools/git2graph/tests/data/graphs/to-rev-edges/nodes.csv \ No newline at end of file diff --git a/swh/graph/http_server.py b/swh/graph/http_server.py index 84192ac..53cf1bc 100644 --- a/swh/graph/http_server.py +++ b/swh/graph/http_server.py @@ -1,349 +1,372 @@ -# Copyright (C) 2019-2020 The Software Heritage developers +# Copyright (C) 2019-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ A proxy HTTP server for swh-graph, talking to the Java code via py4j, and using FIFO as a transport to stream integers between the two languages. """ import json import os from typing import Optional import aiohttp.test_utils import aiohttp.web from google.protobuf import json_format from google.protobuf.field_mask_pb2 import FieldMask import grpc from swh.core.api.asynchronous import RPCServerApp from swh.core.config import read as config_read from swh.graph.rpc.swhgraph_pb2 import ( GetNodeRequest, NodeFilter, StatsRequest, TraversalRequest, ) from swh.graph.rpc.swhgraph_pb2_grpc import TraversalServiceStub from swh.graph.rpc_server import spawn_java_rpc_server, stop_java_rpc_server from swh.model.swhids import EXTENDED_SWHID_TYPES try: from contextlib import asynccontextmanager except ImportError: # Compatibility with 3.6 backport from async_generator import asynccontextmanager # type: ignore # maximum number of retries for random walks RANDOM_RETRIES = 10 # TODO make this configurable via rpc-serve configuration +async def _aiorpcerror_middleware(app, handler): + async def middleware_handler(request): + try: + return await handler(request) + except grpc.aio.AioRpcError as e: + # The default error handler of the RPC framework tries to serialize this + # with msgpack; which for some unknown reason causes it to raise + # ValueError("recursion limit exceeded") with a lot of context, causing + # Sentry to be overflowed with gigabytes of logs (160KB per event, with + # potentially hundreds of thousands of events per day). + # Instead, we simply serialize the exception to a string. + # https://sentry.softwareheritage.org/share/issue/d6d4db971e4b47728a6c1dd06cb9b8a5/ + raise aiohttp.web.HTTPServiceUnavailable(text=str(e)) + + return middleware_handler + + class GraphServerApp(RPCServerApp): - def __init__(self, *args, **kwargs): - super().__init__(*args, **kwargs) + def __init__(self, *args, middlewares=(), **kwargs): + middlewares = (_aiorpcerror_middleware,) + middlewares + super().__init__(*args, middlewares=middlewares, **kwargs) self.on_startup.append(self._start) self.on_shutdown.append(self._stop) @staticmethod async def _start(app): app["channel"] = grpc.aio.insecure_channel(app["rpc_url"]) await app["channel"].__aenter__() app["rpc_client"] = TraversalServiceStub(app["channel"]) await app["rpc_client"].Stats(StatsRequest(), wait_for_ready=True) @staticmethod async def _stop(app): await app["channel"].__aexit__(None, None, None) if app.get("local_server"): stop_java_rpc_server(app["local_server"]) async def index(request): return aiohttp.web.Response( content_type="text/html", body=""" Software Heritage graph server

You have reached the Software Heritage graph API server.

See its API documentation for more information.

""", ) class GraphView(aiohttp.web.View): """Base class for views working on the graph, with utility functions""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.rpc_client: TraversalServiceStub = self.request.app["rpc_client"] def get_direction(self): """Validate HTTP query parameter `direction`""" s = self.request.query.get("direction", "forward") if s not in ("forward", "backward"): raise aiohttp.web.HTTPBadRequest(text=f"invalid direction: {s}") return s.upper() def get_edges(self): """Validate HTTP query parameter `edges`, i.e., edge restrictions""" s = self.request.query.get("edges", "*") if any( [ node_type != "*" and node_type not in EXTENDED_SWHID_TYPES for edge in s.split(":") for node_type in edge.split(",", maxsplit=1) ] ): raise aiohttp.web.HTTPBadRequest(text=f"invalid edge restriction: {s}") return s def get_return_types(self): """Validate HTTP query parameter 'return types', i.e, a set of types which we will filter the query results with""" s = self.request.query.get("return_types", "*") if any( node_type != "*" and node_type not in EXTENDED_SWHID_TYPES for node_type in s.split(",") ): raise aiohttp.web.HTTPBadRequest( text=f"invalid type for filtering res: {s}" ) # if the user puts a star, # then we filter nothing, we don't need the other information if "*" in s: return "*" else: return s def get_traversal(self): """Validate HTTP query parameter `traversal`, i.e., visit order""" s = self.request.query.get("traversal", "dfs") if s not in ("bfs", "dfs"): raise aiohttp.web.HTTPBadRequest(text=f"invalid traversal order: {s}") return s def get_limit(self): """Validate HTTP query parameter `limit`, i.e., number of results""" s = self.request.query.get("limit", "0") try: return int(s) except ValueError: raise aiohttp.web.HTTPBadRequest(text=f"invalid limit value: {s}") def get_max_edges(self): """Validate HTTP query parameter 'max_edges', i.e., the limit of the number of edges that can be visited""" s = self.request.query.get("max_edges", "0") try: return int(s) except ValueError: raise aiohttp.web.HTTPBadRequest(text=f"invalid max_edges value: {s}") async def check_swhid(self, swhid): """Validate that the given SWHID exists in the graph""" try: await self.rpc_client.GetNode( GetNodeRequest(swhid=swhid, mask=FieldMask(paths=["swhid"])) ) except grpc.aio.AioRpcError as e: if e.code() == grpc.StatusCode.INVALID_ARGUMENT: raise aiohttp.web.HTTPBadRequest(text=str(e.details())) class StreamingGraphView(GraphView): """Base class for views streaming their response line by line.""" content_type = "text/plain" @asynccontextmanager async def response_streamer(self, *args, **kwargs): """Context manager to prepare then close a StreamResponse""" response = aiohttp.web.StreamResponse(*args, **kwargs) response.content_type = self.content_type await response.prepare(self.request) yield response await response.write_eof() async def get(self): await self.prepare_response() async with self.response_streamer() as self.response_stream: self._buf = [] try: await self.stream_response() finally: await self._flush_buffer() return self.response_stream async def prepare_response(self): """This can be overridden with some setup to be run before the response actually starts streaming. """ pass async def stream_response(self): """Override this to perform the response streaming. Implementations of this should await self.stream_line(line) to write each line. """ raise NotImplementedError async def stream_line(self, line): """Write a line in the response stream.""" self._buf.append(line) if len(self._buf) > 100: await self._flush_buffer() async def _flush_buffer(self): await self.response_stream.write("\n".join(self._buf).encode() + b"\n") self._buf = [] class StatsView(GraphView): """View showing some statistics on the graph""" async def get(self): res = await self.rpc_client.Stats(StatsRequest()) stats = json_format.MessageToDict( res, including_default_value_fields=True, preserving_proto_field_name=True ) # Int64 fields are serialized as strings by default. for descriptor in res.DESCRIPTOR.fields: if descriptor.type == descriptor.TYPE_INT64: try: stats[descriptor.name] = int(stats[descriptor.name]) except KeyError: pass json_body = json.dumps(stats, indent=4, sort_keys=True) return aiohttp.web.Response(body=json_body, content_type="application/json") class SimpleTraversalView(StreamingGraphView): """Base class for views of simple traversals""" async def prepare_response(self): src = self.request.match_info["src"] self.traversal_request = TraversalRequest( src=[src], edges=self.get_edges(), direction=self.get_direction(), return_nodes=NodeFilter(types=self.get_return_types()), mask=FieldMask(paths=["swhid"]), ) if self.get_max_edges(): self.traversal_request.max_edges = self.get_max_edges() await self.check_swhid(src) self.configure_request() + self.nodes_stream = self.rpc_client.Traverse(self.traversal_request) + + # Force gRPC to query the server and fetch the first nodes; so errors + # are raised early, so we can return HTTP 503 before HTTP 200 + await self.nodes_stream.wait_for_connection() def configure_request(self): pass async def stream_response(self): - async for node in self.rpc_client.Traverse(self.traversal_request): + async for node in self.nodes_stream: await self.stream_line(node.swhid) class LeavesView(SimpleTraversalView): def configure_request(self): self.traversal_request.return_nodes.max_traversal_successors = 0 class NeighborsView(SimpleTraversalView): def configure_request(self): self.traversal_request.min_depth = 1 self.traversal_request.max_depth = 1 class VisitNodesView(SimpleTraversalView): pass class VisitEdgesView(SimpleTraversalView): def configure_request(self): self.traversal_request.mask.paths.extend(["successor", "successor.swhid"]) # self.traversal_request.return_fields.successor = True async def stream_response(self): async for node in self.rpc_client.Traverse(self.traversal_request): for succ in node.successor: await self.stream_line(node.swhid + " " + succ.swhid) class CountView(GraphView): """Base class for counting views.""" count_type: Optional[str] = None async def get(self): src = self.request.match_info["src"] self.traversal_request = TraversalRequest( src=[src], edges=self.get_edges(), direction=self.get_direction(), return_nodes=NodeFilter(types=self.get_return_types()), mask=FieldMask(paths=["swhid"]), ) if self.get_max_edges(): self.traversal_request.max_edges = self.get_max_edges() self.configure_request() res = await self.rpc_client.CountNodes(self.traversal_request) return aiohttp.web.Response( body=str(res.count), content_type="application/json" ) def configure_request(self): pass class CountNeighborsView(CountView): def configure_request(self): self.traversal_request.min_depth = 1 self.traversal_request.max_depth = 1 class CountLeavesView(CountView): def configure_request(self): self.traversal_request.return_nodes.max_traversal_successors = 0 class CountVisitNodesView(CountView): pass def make_app(config=None, rpc_url=None, spawn_rpc_port=50091, **kwargs): app = GraphServerApp(**kwargs) if rpc_url is None: app["local_server"], port = spawn_java_rpc_server(config, port=spawn_rpc_port) rpc_url = f"localhost:{port}" app.add_routes( [ aiohttp.web.get("/", index), aiohttp.web.get("/graph", index), aiohttp.web.view("/graph/stats", StatsView), aiohttp.web.view("/graph/leaves/{src}", LeavesView), aiohttp.web.view("/graph/neighbors/{src}", NeighborsView), aiohttp.web.view("/graph/visit/nodes/{src}", VisitNodesView), aiohttp.web.view("/graph/visit/edges/{src}", VisitEdgesView), aiohttp.web.view("/graph/neighbors/count/{src}", CountNeighborsView), aiohttp.web.view("/graph/leaves/count/{src}", CountLeavesView), aiohttp.web.view("/graph/visit/nodes/count/{src}", CountVisitNodesView), ] ) app["rpc_url"] = rpc_url return app def make_app_from_configfile(): """Load configuration and then build application to run""" config_file = os.environ.get("SWH_CONFIG_FILENAME") config = config_read(config_file) return make_app(config=config) diff --git a/swh/graph/rpc_server.py b/swh/graph/rpc_server.py index 540fc5d..f6e1b4b 100644 --- a/swh/graph/rpc_server.py +++ b/swh/graph/rpc_server.py @@ -1,47 +1,48 @@ # Copyright (C) 2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ A simple tool to start the swh-graph GRPC server in Java. """ import logging +import shlex import subprocess import aiohttp.test_utils import aiohttp.web from swh.graph.config import check_config def spawn_java_rpc_server(config, port=None): if port is None: port = aiohttp.test_utils.unused_port() config = check_config(config or {}) cmd = [ "java", "-cp", config["classpath"], *config["java_tool_options"].split(), "org.softwareheritage.graph.rpc.GraphServer", "--port", str(port), str(config["graph"]["path"]), ] print(cmd) # XXX: shlex.join() is in 3.8 # logging.info("Starting RPC server: %s", shlex.join(cmd)) - logging.info("Starting RPC server: %s", str(cmd)) + logging.info("Starting RPC server: %s", " ".join(shlex.quote(x) for x in cmd)) server = subprocess.Popen(cmd) return server, port def stop_java_rpc_server(server: subprocess.Popen, timeout: int = 15): server.terminate() try: server.wait(timeout=timeout) except subprocess.TimeoutExpired: logging.warning("Server did not terminate, sending kill signal...") server.kill() diff --git a/swh/graph/tests/conftest.py b/swh/graph/tests/conftest.py index 3d86602..6e832af 100644 --- a/swh/graph/tests/conftest.py +++ b/swh/graph/tests/conftest.py @@ -1,70 +1,107 @@ -# Copyright (C) 2019-2021 The Software Heritage developers +# Copyright (C) 2019-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import multiprocessing from pathlib import Path import subprocess from aiohttp.test_utils import TestClient, TestServer, loop_context +import grpc import pytest from swh.graph.http_client import RemoteGraphClient from swh.graph.http_naive_client import NaiveClient +from swh.graph.rpc.swhgraph_pb2_grpc import TraversalServiceStub SWH_GRAPH_TESTS_ROOT = Path(__file__).parents[0] TEST_GRAPH_PATH = SWH_GRAPH_TESTS_ROOT / "dataset/compressed/example" class GraphServerProcess(multiprocessing.Process): - def __init__(self, q, *args, **kwargs): - self.q = q + def __init__(self, *args, **kwargs): + self.q = multiprocessing.Queue() super().__init__(*args, **kwargs) def run(self): # Lazy import to allow debian packaging from swh.graph.http_server import make_app try: config = {"graph": {"path": TEST_GRAPH_PATH}} with loop_context() as loop: app = make_app(config=config, debug=True, spawn_rpc_port=None) client = TestClient(TestServer(app), loop=loop) loop.run_until_complete(client.start_server()) url = client.make_url("/graph/") - self.q.put(url) + self.q.put( + { + "server_url": url, + "rpc_url": app["rpc_url"], + "pid": app["local_server"].pid, + } + ) loop.run_forever() except Exception as e: self.q.put(e) + def start(self, *args, **kwargs): + super().start() + self.result = self.q.get() + + +@pytest.fixture(scope="module") +def graph_grpc_server_process(): + server = GraphServerProcess() + + yield server + + server.kill() + + +@pytest.fixture(scope="module") +def graph_grpc_server(graph_grpc_server_process): + server = graph_grpc_server_process + server.start() + if isinstance(server.result, Exception): + raise server.result + grpc_url = server.result["rpc_url"] + yield grpc_url + server.kill() + + +@pytest.fixture(scope="module") +def graph_grpc_stub(graph_grpc_server): + with grpc.insecure_channel(graph_grpc_server) as channel: + stub = TraversalServiceStub(channel) + yield stub + @pytest.fixture(scope="module", params=["remote", "naive"]) def graph_client(request): if request.param == "remote": - queue = multiprocessing.Queue() - server = GraphServerProcess(queue) + server = request.getfixturevalue("graph_grpc_server_process") server.start() - res = queue.get() - if isinstance(res, Exception): - raise res - yield RemoteGraphClient(str(res)) - server.terminate() + if isinstance(server.result, Exception): + raise server.result + yield RemoteGraphClient(str(server.result["server_url"])) + server.kill() else: def zstdcat(*files): p = subprocess.run(["zstdcat", *files], stdout=subprocess.PIPE) return p.stdout.decode() edges_dataset = SWH_GRAPH_TESTS_ROOT / "dataset/edges" edge_files = edges_dataset.glob("*/*.edges.csv.zst") node_files = edges_dataset.glob("*/*.nodes.csv.zst") nodes = set(zstdcat(*node_files).strip().split("\n")) edge_lines = [line.split() for line in zstdcat(*edge_files).strip().split("\n")] edges = [(src, dst) for src, dst, *_ in edge_lines] for src, dst in edges: nodes.add(src) nodes.add(dst) yield NaiveClient(nodes=list(nodes), edges=edges) diff --git a/swh/graph/tests/test_grpc.py b/swh/graph/tests/test_grpc.py new file mode 100644 index 0000000..2cef192 --- /dev/null +++ b/swh/graph/tests/test_grpc.py @@ -0,0 +1,129 @@ +# Copyright (c) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import hashlib + +from google.protobuf.field_mask_pb2 import FieldMask + +from swh.graph.rpc.swhgraph_pb2 import ( + GraphDirection, + NodeFilter, + StatsRequest, + TraversalRequest, +) + +TEST_ORIGIN_ID = "swh:1:ori:{}".format( + hashlib.sha1(b"https://example.com/swh/graph").hexdigest() +) + + +def test_stats(graph_grpc_stub): + stats = graph_grpc_stub.Stats(StatsRequest()) + assert stats.num_nodes == 21 + assert stats.num_edges == 23 + assert isinstance(stats.compression_ratio, float) + assert isinstance(stats.bits_per_node, float) + assert isinstance(stats.bits_per_edge, float) + assert isinstance(stats.avg_locality, float) + assert stats.indegree_min == 0 + assert stats.indegree_max == 3 + assert isinstance(stats.indegree_avg, float) + assert stats.outdegree_min == 0 + assert stats.outdegree_max == 3 + assert isinstance(stats.outdegree_avg, float) + + +def test_leaves(graph_grpc_stub): + request = graph_grpc_stub.Traverse( + TraversalRequest( + src=[TEST_ORIGIN_ID], + mask=FieldMask(paths=["swhid"]), + return_nodes=NodeFilter(types="cnt"), + ) + ) + actual = [node.swhid for node in request] + expected = [ + "swh:1:cnt:0000000000000000000000000000000000000001", + "swh:1:cnt:0000000000000000000000000000000000000004", + "swh:1:cnt:0000000000000000000000000000000000000005", + "swh:1:cnt:0000000000000000000000000000000000000007", + ] + assert set(actual) == set(expected) + + +def test_neighbors(graph_grpc_stub): + request = graph_grpc_stub.Traverse( + TraversalRequest( + src=["swh:1:rev:0000000000000000000000000000000000000009"], + direction=GraphDirection.BACKWARD, + mask=FieldMask(paths=["swhid"]), + min_depth=1, + max_depth=1, + ) + ) + actual = [node.swhid for node in request] + expected = [ + "swh:1:snp:0000000000000000000000000000000000000020", + "swh:1:rel:0000000000000000000000000000000000000010", + "swh:1:rev:0000000000000000000000000000000000000013", + ] + assert set(actual) == set(expected) + + +def test_visit_nodes(graph_grpc_stub): + request = graph_grpc_stub.Traverse( + TraversalRequest( + src=["swh:1:rel:0000000000000000000000000000000000000010"], + mask=FieldMask(paths=["swhid"]), + edges="rel:rev,rev:rev", + ) + ) + actual = [node.swhid for node in request] + expected = [ + "swh:1:rel:0000000000000000000000000000000000000010", + "swh:1:rev:0000000000000000000000000000000000000009", + "swh:1:rev:0000000000000000000000000000000000000003", + ] + assert set(actual) == set(expected) + + +def test_visit_nodes_filtered(graph_grpc_stub): + request = graph_grpc_stub.Traverse( + TraversalRequest( + src=["swh:1:rel:0000000000000000000000000000000000000010"], + mask=FieldMask(paths=["swhid"]), + return_nodes=NodeFilter(types="dir"), + ) + ) + actual = [node.swhid for node in request] + expected = [ + "swh:1:dir:0000000000000000000000000000000000000002", + "swh:1:dir:0000000000000000000000000000000000000008", + "swh:1:dir:0000000000000000000000000000000000000006", + ] + assert set(actual) == set(expected) + + +def test_visit_nodes_filtered_star(graph_grpc_stub): + request = graph_grpc_stub.Traverse( + TraversalRequest( + src=["swh:1:rel:0000000000000000000000000000000000000010"], + mask=FieldMask(paths=["swhid"]), + ) + ) + actual = [node.swhid for node in request] + expected = [ + "swh:1:rel:0000000000000000000000000000000000000010", + "swh:1:rev:0000000000000000000000000000000000000009", + "swh:1:rev:0000000000000000000000000000000000000003", + "swh:1:dir:0000000000000000000000000000000000000002", + "swh:1:cnt:0000000000000000000000000000000000000001", + "swh:1:dir:0000000000000000000000000000000000000008", + "swh:1:cnt:0000000000000000000000000000000000000007", + "swh:1:dir:0000000000000000000000000000000000000006", + "swh:1:cnt:0000000000000000000000000000000000000004", + "swh:1:cnt:0000000000000000000000000000000000000005", + ] + assert set(actual) == set(expected) diff --git a/swh/graph/tests/test_http_server_down.py b/swh/graph/tests/test_http_server_down.py new file mode 100644 index 0000000..d6cb3fb --- /dev/null +++ b/swh/graph/tests/test_http_server_down.py @@ -0,0 +1,38 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import os +import signal + +import pytest + +from swh.core.api import TransientRemoteException +from swh.graph.http_client import RemoteGraphClient +from swh.graph.http_naive_client import NaiveClient + +from .test_http_client import TEST_ORIGIN_ID + + +def test_leaves(graph_client, graph_grpc_server_process): + if isinstance(graph_client, RemoteGraphClient): + pass + elif isinstance(graph_client, NaiveClient): + pytest.skip("test irrelevant for naive graph client") + else: + assert False, f"unexpected graph_client class: {graph_client.__class__}" + + list(graph_client.leaves(TEST_ORIGIN_ID)) + + server = graph_grpc_server_process + pid = server.result["pid"] + os.kill(pid, signal.SIGKILL) + try: + os.waitpid(pid, os.WNOHANG) + except ChildProcessError: + pass + + it = graph_client.leaves(TEST_ORIGIN_ID) + with pytest.raises(TransientRemoteException, match="failed to connect"): + list(it)