diff --git a/docs/api.rst b/docs/api.rst index 0b7f1a2..3face8d 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,407 +1,541 @@ .. _swh-graph-api: -Graph RPC API -============= +Graph Querying HTTP API +======================= + +The Graph Querying API is a high-level HTTP API intended to run common, +relatively simple traversal queries on the compressed graph. + +The client/server architecture allows it to only load the graph in memory once +then serve multiple different requests. However, it is limited in expressivity; +more complex or resource-intensive queries should rather use the +:ref:`Low-level Java API ` to run them as standalone +programs. Terminology ----------- This API uses the following notions: - **Node**: a node in the :ref:`Software Heritage graph `, represented by a :ref:`SWHID `. - **Node type**: the 3-letter specifier from the node SWHID (``cnt``, ``dir``, ``rel``, ``rev``, ``snp``, ``ori``), or ``*`` for all node types. - **Edge type**: a pair ``src:dst`` where ``src`` and ``dst`` are either node types, or ``*`` to denote all node types. - **Edge restrictions**: a textual specification of which edges can be followed during graph traversal. Either ``*`` to denote that all edges can be followed or a comma separated list of edge types to allow following only those edges. Note that when traversing the *backward* (i.e., transposed) graph, edge types are reversed too. So, for instance, ``ori:snp`` makes sense when traversing the forward graph, but useless (due to lack of matching edges in the graph) when traversing the backward graph; conversely ``snp:ori`` is useful when traversing the backward graph, but not in the forward one. For the same reason ``dir:dir`` allows following edges from parent directories to sub-directories when traversing the forward graph, but the same restriction allows following edges from sub-directories to parent directories. - **Node restrictions**: a textual specification of which type of nodes can be returned after a request. Either ``*`` to denote that all types of nodes can be returned or a comma separated list of node types to allow returning only those node types. Examples ~~~~~~~~ - ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` the SWHID of a node of type content containing the full text of the GPL3 license. - ``swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35`` the SWHID of a node of type revision corresponding to the commit in Linux that merged the 'x86/urgent' branch on 31 December 2017. - ``"dir:dir,dir:cnt"`` node types allowing edges from directories to directories nodes, or directories to contents nodes. - ``"rev:rev,dir:*"`` node types allowing edges from revisions to revisions nodes, or from directories nodes. - ``"*:rel"`` node types allowing all edges to releases. - ``"cnt,snp"`` accepted node types returned in the query results. +Endpoints +--------- + Leaves ------- +~~~~~~ .. http:get:: /graph/leaves/:src Performs a graph traversal and returns the leaves of the subgraph rooted at the specified source node. :param string src: source node specified as a SWHID :query string edges: edges types the traversal can follow; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:cnt:540faad6b1e02e2db4f349a4845192db521ff2bd swh:1:cnt:630585fc6d34e5e121139e2aee0a64e83dc9aae6 swh:1:cnt:f8634ced669f0a9155c8cab1b2621d57d778215e swh:1:cnt:ba6daa801ad3ea587904b1abe9161dceedb2e0bd ... Neighbors ---------- +~~~~~~~~~ .. http:get:: /graph/neighbors/:src Returns node direct neighbors (linked with exactly one edge) in the graph. :param string src: source node specified as a SWHID :query string edges: edges types allowed to be listed as neighbors; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/neighbors/swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35 HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:rev:a31e58e129f73ab5b04016330b13ed51fde7a961 swh:1:dir:b5d2aa0746b70300ebbca82a8132af386cc5986d swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd ... Walk ----- +~~~~ .. .. http:get:: /graph/walk/:src/:dst Performs a graph traversal and returns the first found path from source to destination (final destination node included). :param string src: starting node specified as a SWHID :param string dst: destination node, either as a node SWHID or a node type. The traversal will stop at the first node encountered matching the desired destination. :query string edges: edges types the traversal can follow; default to ``"*"`` :query string traversal: traversal algorithm; can be either ``dfs`` or ``bfs``, default to ``dfs`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query string return_types: types of nodes we want to be displayed; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http HTTP/1.1 200 OK swh:1:rev:f39d7d78b70e0f39facb1e4fab77ad3df5c52a35 swh:1:rev:52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd swh:1:rev:cea92e843e40452c08ba313abc39f59efbb4c29c swh:1:rev:8d517bdfb57154b8a11d7f1682ecc0f79abf8e02 ... .. http:get:: /graph/randomwalk/:src/:dst Performs a graph *random* traversal, i.e., picking one random successor node at each hop, from source to destination (final destination node included). :param string src: starting node specified as a SWHID :param string dst: destination node, either as a node SWHID or a node type. The traversal will stop at the first node encountered matching the desired destination. :query string edges: edges types the traversal can follow; default to ``"*"`` :query string direction: direction in which graph edges will be followed; can be either ``forward`` or ``backward``, default to ``forward`` :query int limit: limit the number of nodes returned. You can use positive numbers to get the first N results, or negative numbers to get the last N results starting from the tail; default to ``0``, meaning no limit. :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/randomwalk/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2/ori?direction=backward HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 swh:1:dir:8de8a8823a0780524529c94464ee6ef60b98e2ed swh:1:dir:7146ea6cbd5ffbfec58cc8df5e0552da45e69cb7 swh:1:rev:b12563e00026b48b817fd3532fc3df2db2a0f460 swh:1:rev:13e8ebe80fb878bade776131e738d5772aa0ad1b swh:1:rev:cb39b849f167c70c1f86d4356f02d1285d49ee13 ... swh:1:rev:ff70949f336593d6c59b18e4989edf24d7f0f254 swh:1:snp:a511810642b7795e725033febdd82075064ed863 swh:1:ori:98aa0e71f5c789b12673717a97f6e9fa20aa1161 **Limit example:** .. sourcecode:: http GET /graph/randomwalk/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2/ori?direction=backward&limit=-2 HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:ori:98aa0e71f5c789b12673717a97f6e9fa20aa1161 swh:1:snp:a511810642b7795e725033febdd82075064ed863 Visit ------ +~~~~~ .. http:get:: /graph/visit/nodes/:src .. http:get:: /graph/visit/edges/:src .. http:get:: /graph/visit/paths/:src Performs a graph traversal and returns explored nodes, edges or paths (in the order of the traversal). :param string src: starting node specified as a SWHID :query string edges: edges types the traversal can follow; default to ``"*"`` :query integer max_edges: how many edges can be traversed during the visit; default to 0 (not restricted) :query string return_types: only return the nodes matching this type; default to ``"*"`` :statuscode 200: success :statuscode 400: invalid query string provided :statuscode 404: starting node cannot be found **Example:** .. sourcecode:: http GET /graph/visit/nodes/swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:cfab784723a6c2d33468c9ed8a566fd5e2abd8c9 swh:1:rev:53e5df0e7a6b7bd4919074c081a173655c0da164 swh:1:rev:f85647f14b8243532283eff3e08f4ee96c35945f swh:1:rev:fe5f9ef854715fc59b9ec22f9878f11498cfcdbf swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb swh:1:cnt:c8cece50beae7a954f4ea27e3ae7bf941dc6d0c0 swh:1:dir:a358d0cf89821227d4c00b0ced5e0a8b3756b5db swh:1:cnt:cc407b7e24dd300d2e1a77d8f04af89b3f962a51 swh:1:cnt:701bd0a63e11b3390a547ce8515d28c6bab8a201 ... **Example:** .. sourcecode:: http GET /graph/visit/edges/swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc HTTP/1.1 Content-Type: text/plain Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:61f92a7db95f5a6d1fcb94d2b897ed3797584d7b swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:00e81c89c29ff3e58745fdaf7abb68daa1389e85 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:7596fdc31c9aa00aed281ccb026a74cabf2383bb swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:ec7a2341ac3d9d8b571bbdfb90a089d4e54dea56 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:1c5b5eac61eda2454034a43eb124ab490885ef3a swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:4dfa88ca55e04e8afe05e8543ddddee32dde7236 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:d56ae79e43ff1b37534370911c8a78ec7f38d437 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:19ba5d6203a040a39ecc4a77b165d3f097c1e662 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:9c56102eefea23c95405533e1de23da4b873ecc4 swh:1:snp:40f9f177b8ab0b7b3d70ee14bbc8b214e2b2dcfc swh:1:rev:3f54e816b46c2e179cd164e17fea93b3013a9db4 ... **Example:** .. sourcecode:: http GET /graph/visit/paths/swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb HTTP/1.1 Content-Type: application/x-ndjson Transfer-Encoding: chunked .. sourcecode:: http HTTP/1.1 200 OK ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:acfb7cabd63b368a03a9df87670ece1488c8bce0"] ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:2a0837708151d76edf28fdbb90dc3eabc676cff3"] ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:cnt:eaf025ad54b94b2fdda26af75594cfae3491ec75"] ... ["swh:1:dir:644dd466d8ad527ea3a609bfd588a3244e6dafcb", "swh:1:dir:2ebd4b96fa5665ff74f2b27ae41aecdc43af4463", "swh:1:cnt:1d3b6575fb7bf2a147d228e78ffd77ea193c3639"] ... Counting results ----------------- +~~~~~~~~~~~~~~~~ The following method variants, with trailing `/count` added, behave like their already discussed counterparts but, instead of returning results, return the *amount* of results that would have been returned: .. http:get:: /graph/leaves/count/:src Return the amount of :http:get:`/graph/leaves/:src` results .. http:get:: /graph/neighbors/count/:src Return the amount of :http:get:`/graph/neighbors/:src` results .. http:get:: /graph/visit/nodes/count/:src Return the amount of :http:get:`/graph/visit/nodes/:src` results Stats ------ +~~~~~ .. http:get:: /graph/stats Returns statistics on the compressed graph. :statuscode 200: success **Example** .. sourcecode:: http GET /graph/stats HTTP/1.1 Content-Type: application/json .. sourcecode:: http HTTP/1.1 200 OK { "counts": { "nodes": 16222788, "edges": 9907464 }, "ratios": { "compression": 0.367, "bits_per_node": 5.846, "bits_per_edge": 9.573, "avg_locality": 270.369 }, "indegree": { "min": 0, "max": 12382, "avg": 0.6107127825377487 }, "outdegree": { "min": 0, "max": 1, "avg": 0.6107127825377487 } } + + +Use-case examples +----------------- + +This section showcases how to leverage the endpoints of the HTTP API described +above for some common use-cases. + + +Browsing +~~~~~~~~ + +The following use cases require traversing the *forward graph*. + +- **ls**: given a directory node, list (non recursively) all linked nodes of + type directory and content + + Endpoint:: + + /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir + +- **ls -R**: given a directory node, recursively list all linked nodes of type + directory and content + + Endpoint:: + + /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir + +- **git log**: given a revision node, recursively list all linked nodes of type + revision + + Endpoint:: + + /graph/visit/nodes/:REV_ID?edges=rev:rev + + +Vault +~~~~~ + +The following use cases require traversing the *forward graph*. + +- **tarball** (same as *ls -R* above) + +- **git bundle**: given a node, recursively list all linked nodes of any kind + + Endpoint:: + + /graph/visit/nodes/:NODE_ID?edges=* + + +Provenance +~~~~~~~~~~ + +The following use cases require traversing the *backward (transposed) +graph*. + +- **commit provenance**: given a content or directory node, return *a* commit + whose directory (recursively) contains it + + Endpoint:: + + /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev + +- **complete commit provenance**: given a content or directory node, return + *all* commits whose directory (recursively) contains it + + Endpoint:: + + /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev + +- **origin provenance**: given a content, directory, or commit node, return + *an* origin that has at least one snapshot that (recursively) contains it + + Endpoint:: + + /graph/walk/:NODE_ID/ori?direction=backward&edges=* + +- **complete origin provenance**: given a content, directory, or commit node, + return *all* origins that have at least one snapshot that (recursively) + contains it + + Endpoint:: + + /graph/leaves/:NODE_ID?direction=backward&edges=* + + +Provenance statistics +~~~~~~~~~~~~~~~~~~~~~ + +The following use cases require traversing the *backward (transposed) +graph*. + +- **content popularity across commits**: count the number of commits (or + *commit popularity*) that link to a directory that (recursively) includes a + given content. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev + +- **commit popularity across origins**: count the number of origins (or *origin + popularity*) that have a snapshot that (recursively) includes a given commit. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?direction=backward&edges=* + +The following use cases require traversing the *forward graph*. + +- **revision size** (as n. of contents) distribution: the number of contents + that are (recursively) reachable from a given revision. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?edges=* + +- **origin size** (as n. of revisions) distribution: count the number of + revisions that are (recursively) reachable from a given origin. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev diff --git a/docs/compression.rst b/docs/compression.rst index edca8a7..5641ff1 100644 --- a/docs/compression.rst +++ b/docs/compression.rst @@ -1,125 +1,596 @@ .. _graph-compression: +================= Graph compression ================= -The compression process is a pipeline implemented for the most part on top of -the `WebGraph framework `_ and ecosystem -libraries. The compression pipeline consists of the following steps: +The compression pipeline is implemented on top of the `WebGraph framework +`_. It takes an ORC Graph Dataset as an input, +such as the ones found in the :ref:`Graph Dataset List `, +and generates a compressed graph suitable for high intensity analyses on +large servers. -.. figure:: images/compression_steps.png - :align: center - :alt: Compression steps - Compression steps +Running the compression pipeline +================================ + +Dependencies +------------ + +To compress a graph, you will need to install the ``swh.graph`` tool as well as +a recent JRE, as described in the :ref:`swh-graph-quickstart` page. + +You will also need the zstd_ compression tool:: + + $ sudo apt install zstd + +.. _zstd: https://facebook.github.io/zstd/ + + +Hardware Requirements +--------------------- + +The compression pipeline is even more demanding than the graph server in terms +of hardware requirements, especially RAM. Notably, the BFS compression step +loads a graph compressed in random order in memory, which is usually more than +a TiB for the full graph. While it is possible to do this step with a memory +mapping, our experiments show that this could take a very long time (several +months) on hard drives. + +The LLP compression step requires 13 bytes of RAM per node, which could amount +to storing hundreds of gigabytes in RAM in addition to loading the graph +itself. + +Some steps also involve sorting the entire set of edges and their labels, by +using large on-disk buffer files, sometimes reaching the size of the input +dataself itself. + +The machine we used to compress the entire graph (dataset version 2022-04-25) +has the following hardware specs: + +- 2 TiB of RAM (DDR4 ECC 2400Mhz) +- 64 vCPUs (Dual AMD EPYC 7302 16-Core) +- 24 TiB of SSD (NVMe) + +The server we rented is from the +`HGR-HCI-4 `_ +series from OVH. + + +Input dataset +------------- + +First, you need to retrieve a graph to compress, in ORC format. The :ref:`Graph +Dataset List ` has a list of datasets made available by the +Software Heritage archive, including "teaser" subdatasets which have a more +manageable size and are thus very useful for prototyping with less hardware +resources. + +The datasets can be retrieved from S3 or the annex, in a similar fashion to +what is described in :ref:`swh-graph-retrieving-compressed`, by simply +replacing "compressed" by "orc": + +.. code:: console + + (venv) $ mkdir -p 2021-03-23-popular-3k-python/orc + (venv) $ cd 2021-03-23-popular-3k-python/ + (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/ orc + +Alternatively, any custom ORC dataset can be used as long as it respects +:ref:`the schema ` of the Software Heritage Graph Dataset. + +**Note:** for testing purposes, a fake test dataset is available in the +``swh-graph`` repository, with just a few dozen nodes. The ORC tables are +available in ``swh-graph/swh/graph/tests/dataset/orc/``. + + +Compression +----------- + +You can compress your dataset by using the ``swh graph compress`` command. It +will run all the various steps of the pipeline in the right order. + +.. code:: console -Each of these steps is briefly described below. For more details see the -following paper: -.. note:: + (venv) $ swh graph compress --input-dataset orc/ --outdir compressed/ + [...] + (venv) $ ls compressed/ + graph.edges.count.txt + graph.edges.stats.txt + graph.graph + graph.indegree + graph-labelled.labeloffsets + graph-labelled.labels + [...] + graph-transposed.obl + graph-transposed.offsets + graph-transposed.properties - Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. - `Ultra-Large-Scale Repository Analysis via Graph Compression - `_. In - proceedings of `SANER 2020 `_: The 27th IEEE - International Conference on Software Analysis, Evolution and - Reengineering. IEEE 2020. - Links: `preprint - `_, - `bibtex - `_. +(The purpose of each of these files is detailed in the +:ref:`swh-graph-java-api` page. -In order to practically perform graph compression, install the ``swh.graph`` -module and use the ``swh graph compress`` command line interface of the -compression driver, that will conduct the various steps in the right order. -See ``swh graph compress --help`` for usage details. +For sufficiently large graphs, this command can take entire weeks. It is highly +recommended to run it in a systemd service or in a tmux session. +It is also possible to run single steps or step ranges from the CLI: -1. MPH +.. code:: bash + + swh graph compress -i orc/ -o compressed/ --steps mph-bfs + +See ``swh graph compress --help`` for syntax and usage details. + + +Compression steps +================= + +The compression pipeline consists of the following steps: + +.. figure:: images/compression_steps.png + :align: center + :alt: Compression steps + :target: _images/compression_steps.png + + Compression steps + +Each of these steps is briefly described below. For more details see the +original Software Heritage graph compression paper [SWHGraphCompression2020]_, +as well as chapters 9 and 10 of Antoine Pietri's PhD thesis +[PietriThesis2021]_. + +.. [SWHGraphCompression2020] + | Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. + | `Ultra-Large-Scale Repository Analysis via Graph Compression + `_. + | In proceedings of `SANER 2020 `_: The 27th + IEEE International Conference on Software Analysis, Evolution and + Reengineering. IEEE 2020. + | Links: `preprint + `_, + `bibtex + `_. + + + +.. [PietriThesis2021] + | Antoine Pietri + | `Organizing the graph of public software development for large-scale mining + `_. + | Doctoral dissertation. Inria, 2021. + + +1. EXTRACT_NODES +---------------- + +This step reads a graph dataset and extract all the unique node SWHIDs it +contains, including the ones that are not stored as actual objects in the +graph, but only *referred to* by the edges. Additionally, it extracts the set +of all unique edge labels in the graph. + +**Rationale:** Because the graph can contain holes, loose objects and dangling +objects, some nodes that are referred to as destinations in the edge +relationships might not actually be stored in the graph itself. However, to +compress the graph using a graph compressio library, it is necessary to have a +list of *all* the nodes in the graph, including the ones that are simply +referred to by the edges but not actually stored as concrete objects. + +This step reads the entire graph dataset, and uses ``sort -u`` to extract the +set of all the unique nodes and unique labels that will be needed as an input +for the compression process. It also write object count statistics in various +files: + +- The set of nodes is written in ``graph.nodes.csv.zst``, as a zst-compressed + sorted list of SWHIDs, one per line. +- The set of edge labels is written in ``graph.labels.csv.zst``, as a + zst-compressed sorted list of labels encoded in base64, one per line. +- The number of unique nodes referred to in the graph is written in a text + file, ``graph.nodes.count.txt`` +- The number of unique edges referred to in the graph is written in a text + file, ``graph.edges.count.txt`` +- The number of unique edge labels is written in a text file, + ``graph.labels.count.txt`` +- Statistics on the number of nodes of each type are written in a text file, + ``graph.nodes.stats.txt`` +- Statistics on the number of edges of each type are written in a text file, + ``graph.edges.stats.txt`` + + +2. MPH ------ -A node in the Software Heritage :ref:`data model ` is identified -using its SWHID (see :ref:`persistent identifiers -`). However, WebGraph internally uses integers to refer -to node ids. +As discussed in :ref:`swh-graph-java-basics`, a node in the Software Heritage +:ref:`data model ` is identified by its SWHID (see :ref:`persistent +identifiers `), but WebGraph internally uses integers +to refer to node ids. -Mapping between the strings and longs ids is needed before compressing the -graph. From the `Sux4J `_ utility tool, we use the +To create a mapping between integer node IDs and SWHIDs, we use the `GOVMinimalPerfectHashFunction `_ -class, mapping with no collisions N keys to N consecutive integers. +class of the `Sux4J `_ library, which maps N keys to N +consecutive integers. -The step produces a ``.mph`` file (MPH stands for *Minimal Perfect -Hash-function*) storing the hash function taking as input a string and returning -a unique integer. +We run this function on the list of SWHIDs stored in the +``graph.nodes.csv.zst`` file generated in the previous step. +This allows us to generate a bijection from the set of all the *n* SWHIDs in the +graph to the set of integers :math:`[0, n - 1]`. +The step produces a ``graph.mph`` file (MPH stands for *Minimal Perfect +Hash-function*), containing a function which takes a SWHID (as a bytestring) +and returns its associated node ID. -2. BV compress + +3. BV compress -------------- -This is the first actual compression step, building a compressed version of the -input graph using WebGraph techniques presented in the framework paper. We use -the `ScatteredArcsASCIIGraph +This is the first actual compression step, where we build a compressed version +of the input graph dataset. + +We use a ScatteredArcsORCGraph to load the dataset +(implementation inspired of the `ScatteredArcsASCIIGraph `_ -class, from WebGraph. +class in WebGraph). +This class wraps the ORC Graph dataset and exposes a *virtual* ImmutableGraph, +whose nodes and edges can be iterated sequentially as if it was any other +standard graph. To do so, it puts all the edges in batches and sorts them in an +aggressively parallel fashion, then stores them as ``.bitstream`` files, and +returns a `BatchGraph +` +created from these batches. + +Finally, it uses the ``BVGraph.store()`` method, which compresses the input +graph as a `BVGraph +`_, +using the compression techniques described in the article *The WebGraph +Framework I: Compression Techniques* cited above. The resulting BV graph is stored as a set of files: -- ``.graph``: the compressed graph in the BV format -- ``.offsets``: offsets values to read the bit stream graph file -- ``.obl``: offsets cache to load the graph faster -- ``.properties``: entries used to correctly decode graph and offset files +- ``graph-base.graph``: the compressed graph in the BV format +- ``graph-base.offsets``: offsets values to read the bit stream graph file +- ``graph-base.properties``: entries used to correctly decode graph and offset + files -3. BFS -------- +4. BFS +------ -In the LLP paper, authors propose an empirical analysis linking node ordering -and high compression ratio: it is important to use an ordering of nodes ids such -that vertices from the same host are close to one another. +In [LLP]_, the paper authors empirically demonstrate that a high graph +compression ratio can be achieved for the graph of the Web by ordering nodes +such that vertices from the same host are close to each other. -Building on this insight, the previous compression results in the BV compress -step are improved by re-ordering nodes ids using a BFS traversal order. We use -the `BFS +In Software Heritage, there is no notion of "host" that can be used to generate +these compression-friendly orderings, because the identifiers are just +uniformly random cryptographic hashes. However, we can generate these orderings +by running algorithms to inform us on which nodes are close to each other. + +In this step, we run a BFS traversal on the entire graph to get a more +compression-friendly ordering of nodes. We use the `BFS `_ class from the `LAW `_ library. -The resulting ordering is stored in the ``.order`` file, listing nodes ids in -order of traversal. +The resulting ordering is stored in a ``graph-bfs.order`` file, which contains +all the node IDs in the order of traversal. -4. Permute ----------- +5. PERMUTE_BFS +-------------- -Once the order is computed (BFS or another ordering technique), the final -compressed graph is created based on the initial BV compress result, and using -the new node order mapping. The permutation uses the `Transform +Once the BFS order is computed, we permute the initial "base" graph using the +this new ordering. The permutation uses the `Transform `_ class from WebGraph framework. -The final compressed graph is only stored in the resulting ``.graph``, -``.offsets``, ``.obl``, and ``.properties`` files. +The BFS-compressed graph is stored in the files +``graph-bfs.{graph,offsets,properties}``. +6. TRANSPOSE_BFS +---------------- -5. Stats --------- +We transpose the BFS-compressed graph, using the `Transform +`_ +class from WebGraph. +This step is a prerequisite for LLP compression. + +7. SIMPLIFY +----------- + +This step creates a loopless and symmetric version of the BFS-compressed graph, +using the `Transform +`_ +class from WebGraph. +This step is a prerequisite for LLP compression. + +8. LLP +------ + +Better compression ratios can be achieved by the Layered Label Propagation +(LLP) algorithm to reorder nodes. This algorithm is described in [LLP]_. +The LLP algorithm finds locality-preserving orders by clustering together nodes +in close proximity. Similar to the BFS, this algorithm is particularly +interesting for our use case as it is unsupervised, and does not rely on prior +information on the clusters present in the graph. The idea behind the +clustering algorithm is to randomly distribute communities to the nodes in the +graph, then iteratively assign to each node the community most represented in +its neighbors. + +.. [LLP] Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna. + *Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks.* + WWW 2011: 587-596 + DOI: https://doi.org/10.1145/1963405.1963488 + preprint: https://arxiv.org/abs/1011.5425 + +LLP is more costly than simple BFS-based compression in both time and memory. +Even though the algorithm has a linear time complexity, it does multiple +iterations on the graph and is significantly slower than the BFS which is just +one single traversal. Moreover, keeping track of the communities requires a +total of 13 bytes per node, which increases the RAM requirements. +Because of these constraints, it is unrealistic to run the LLP algorithm on the +uncompressed version of the graph; this is why we do an intermediate +compression with the BFS ordering first, then compress the entire graph *again* +with an even better ordering. + +The LLP algorithm takes a simplified (loopless, symmetric) graph as an input, +which we already computed in the previous steps. + +The algorithm is also parameterized by a list of γ values, a "resolution" parameter +which defines the shapes of the clustering it produces: either small, but +denser pieces, or larger, but unavoidably sparser pieces. The algorithm then +combines the different clusterings together to generate the output reordering. +γ values are given to the algorithm in the form :math:`\frac{j}{2^k}`; by +default, 12 different values of γ are used. However, the combination procedure +is very slow, and using that many γ values could take several months in our +case. +We thus narrowed down a smaller set of γ values that empirically give good +compression results, which are used by default in the pipeline. In general, +smaller values of γ seem to generate better compression ratios. The effect of a +given γ is that the density of the sparsest cluster is at least γ γ+1, so large +γ values imply small, more dense clusters. It is reasonable to assume that +since the graph is very sparse to start with, such clusters are not that +useful. + +The resulting ordering is stored in a ``graph-llp.order`` file. + +9. PERMUTE_LLP +-------------- -Compute various statistics on the final compressed graph: +Once the LLP order is computed, we permute the BFS-compressed graph using the +this new ordering. The LLP-compressed graph, which is our final compressed +graph, is stored in the files ``graph.{graph,offsets,properties}``. -- ``.stats``: entries such as number of nodes, edges, avg/min/max degree, +10. OBL +------- + +Cache the BVGraph offsets of the forward graph to make loading faster. The +resulting offset big list is stored in the ``graph.obl`` file. + +11. COMPOSE_ORDERS +------------------ + +To be able to translate the initial MPH inputs to their resulting rank in the +LLP-compressed graph, we need to use the two order permutations successively: +the base → BFS permutation, then the BFS → LLP permutation. + +To make this less wasteful, we *compose* the two permutations into a single +one. We use the `composePermutationsInPlace +`_ +function of the dsiutils library. The resulting permutation is stored as a +``graph.order`` file. Hashing a SWHID with the ``graph.mph`` function, then +permuting the result using the ``graph.order`` permutation yields the integer +node ID matching the input SWHID in the graph. + +12. STATS +--------- + +This step computes various statistics on the compressed graph: + +- ``.stats``: statistics such as number of nodes, edges, avg/min/max degree, average locality, etc. - ``.indegree``: graph indegree distribution - ``.outdegree``: graph outdegree distribution This step uses the `Stats `_ class from WebGraph. -6. Transpose ------------- +13. TRANSPOSE +------------- -Create a transposed graph to allow backward traversal, using the `Transform +Transpose the graph to allow backward traversal, using the `Transform `_ -class from WebGraph. +class from WebGraph. The resulting transposed graph is stored as the +``graph-transposed.{graph,offsets,properties}`` files. + + +14. TRANSPOSE_OBL +----------------- + +Same as OBL, but for the transposed graph. The resulting offset big list is +stored in the ``graph-transposed.obl`` file. + + +15. MAPS +-------- + +This steps generates the *node mappings* described in +:ref:`swh-graph-java-node-mappings`. In particular, it generates: + +- ``graph.node2swhid.bin``: a compact binary representation of all the + SWHIDs in the graph, ordered by their rank in the graph file. +- ``graph.node2type.bin``: a `LongBigArrayBitVector + `_ + which stores the type of each node. + +It does so by reading all the SWHIDs in the ``graph.nodes.csv.zst`` file generated in the +EXTRACT_NODES step, then getting their corresponding node IDs (using the +``.mph`` and ``.order`` files), then sorting all the SWHIDs according to +their node ID. It then writes these SWHIDs in order, in a compact but seekable +binary format, which can be used to return the SWHID corresponding to any given +node in O(1). + + +16. EXTRACT_PERSONS +------------------- + +This step reads the ORC graph dataset and extracts all the unique persons it +contains. Here, "persons" are defined as the set of unique pairs of name + +email, potentially pseudonymized, found either as revision authors, revision +committers or release authors. + +The ExtractPersons class reads all the persons from revision and release +tables, then uses ``sort -u`` to get a sorted list without any duplicates. The +resulting sorted list of authors is stored in the ``graph.persons.csv.zst`` +file. + + +17. MPH_PERSONS +--------------- + +This step computes a Minimal Perfect Hash function on the set of all the unique +persons extracted in the EXTRACT_PERSONS step. Each individual person is mapped +to a unique integer in :math:`[0, n-1]` where *n* is the total number of +persons. The resulting function is serialized and stored in the +``graph.persons.mph`` file. + + +18. NODE_PROPERTIES +------------------- + +This step generates the *node property files*, as described in +:ref:`swh-graph-java-node-properties`. +The nodes in the Software Heritage Graph each have associated *properties* +(e.g., commit timestamps, authors, messages, ...). The values of these +properties for each node in the graph are compressed and stored in files +alongside the compressed graph. + +The WriteNodeProperties class reads all the properties from the ORC Graph +Dataset, then serializes each of them in a representation suitable for +efficient random access (e.g., large binary arrays) and stores them on disk. + +For persons (authors, committers etc), the MPH computed in the MPH_PERSONS step +is used to store them as a single pseudonymized integer ID, which uniquely +represents a full name + email. + +The results are stored in the following list of files: + +- ``graph.property.author_id.bin`` +- ``graph.property.author_timestamp.bin`` +- ``graph.property.author_timestamp_offset.bin`` +- ``graph.property.committer_id.bin`` +- ``graph.property.committer_timestamp.bin`` +- ``graph.property.committer_timestamp_offset.bin`` +- ``graph.property.content.is_skipped.bin`` +- ``graph.property.content.length.bin`` +- ``graph.property.message.bin`` +- ``graph.property.message.offset.bin`` +- ``graph.property.tag_name.bin`` +- ``graph.property.tag_name.offset.bin`` + + +19. MPH_LABELS +-------------- + +This step computes a **monotone** Minimal Perfect Hash function on the set of +all the unique *arc label names* extracted in the EXTRACT_NODES step. Each +individual arc label name (i.e., directory entry names and snapshot branch +names) is monotonely mapped to a unique integer in :math:`[0, n-1]`, where *n* +is the total number of unique arc label names, which corresponds to their +**lexical rank** in the set of all arc labels. + +In other words, this MPH being monotone means that the hash of the *k*-th item +in the sorted input list of arc labels will always be *k*. +We use the `LcpMonotoneMinimalPerfectHashFunction +`_ +of Sux4J to generate this function. + +The rationale for using a monotone function here is that it will allow us to +quickly get back the arc label from its hash without having to store an +additional permutation. +The resulting MPH function is serialized and stored in the ``graph.labels.mph`` +file. + + +20. FCL_LABELS +-------------- + +This step computes a *reverse-mapping* for arc labels, i.e., a way to +efficiently get the arc label name from its hash computed with the monotone MPH +of the MPH_LABELS step. + +Thanks to the MPH being monotone, this boils down to storing all the labels in +lexicographic order in a string list format that allows O(1) access to its +elements. For this purpose, we use the `MappedFrontCodedStringBigList +`_ +class from the dsiutils library, using the ``graph.labels.csv.zst`` file as its +input. It stores the label names in a compact way by using front-coding +compression, which is particularly efficient here because the strings are +already in lexicographic order. The resulting FCL files are stored as +``graph.labels.fcl.*``, and they can be loaded using memory mapping. + + +21. EDGE_LABELS +--------------- + + +This step generates the *edge property files*, as described in +:ref:`swh-graph-java-edge-properties`. These files allow us to get the *edge +labels* as we iterate on the edges of the graph. The files essentially contain +compressed sorted triplets of the form (source, destination, label), with +additional offsets to allow random access. + +To generate these files, the LabelMapBuilder class starts by reading in +parallel the labelled edges in the ORC dataset, which can be thought of as +quadruplets containing the source SWHID, the destination SWHID, the label name +and the entry permission if applicable: + +.. code-block:: text + + swh:1:snp:4548a5… swh:1:rev:0d6834… cmVmcy9oZWFkcy9tYXN0ZXI= + swh:1:dir:05faa1… swh:1:cnt:a35136… dGVzdC5j 33188 + swh:1:dir:05faa1… swh:1:dir:d0ff82… dGVzdA== 16384 + ... + +Using the ``graph.mph`` and the ``graph.order`` files, we hash and permute the +source and destination nodes. We also monotonically hash the labels using the +``graph.labels.mph`` function to obtain the arc label identifiers. The +permissions are normalized as one of the 6 possible values in the +``DirEntry.Permission.Type`` enum, and are then stored in the 3 lowest bits of +the label field. + +.. code-block:: text + + 4421 14773 154 + 1877 21441 1134 + 1877 14143 1141 + ... + +These hashed edges and their compact-form labels are then put in large batches +sorted in an aggressively parallel fashion, which are then stored as +``.bitstream`` files. These batch files are put in a heap structure to perform +a merge sort on the fly on all the batches. + +Then, the LabelMapBuilder loads the graph and starts iterating on its edges. It +synchronizes the stream of edges read from the graph with the stream of sorted +edges and labels read from the bitstreams in the heap. At this point, it writes +the labels to the following output files: + +- ``graph-labelled.properties``: a property file describing the graph, notably + containing the basename of the wrapped graph. +- ``graph-labelled.labels``: the compressed labels +- ``graph-labelled.labeloffsets``: the offsets used to access the labels in + random order. + +It then does the same with backward edge batches to get the transposed +equivalent of these files: +``graph-transposed-labelled.{properties,labels,labeloffsets}``. + + + +22. CLEAN_TMP +------------- + +This step reclaims space by deleting the temporary directory, as well as all +the intermediate outputs that are no longer necessary now that the final graph +has been compressed (shown in gray in the step diagram). diff --git a/docs/images/Makefile b/docs/images/Makefile index 01fbfa2..9cb29d6 100644 --- a/docs/images/Makefile +++ b/docs/images/Makefile @@ -1,13 +1,13 @@ all: compression_steps.png compression_steps.svg %.png: %.dot - dot -Gdpi=300 -Tpng $< -o $@ + dot -Gdpi=150 -Tpng $< -o $@ %.svg: %.dot dot -Tsvg $< -o $@ .PHONY: clean clean: rm -f compression_steps.png rm -f compression_steps.svg diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot index 7156f62..f10774d 100644 --- a/docs/images/compression_steps.dot +++ b/docs/images/compression_steps.dot @@ -1,51 +1,105 @@ digraph "Compression steps" { - // Horizontal graph - rankdir=LR; + node [shape = none]; + + orc_dataset [label="ORC Graph\nDataset"]; + nodes_csv [label="graph.nodes.csv.zst"]; + labels_csv [label="graph.labels.csv.zst"]; + graph_mph [label="graph.mph"]; subgraph { - input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none]; - input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none]; - {rank=same; input_edges; input_nodes;} + node [fontcolor=darkgray]; + graph_base [label="graph-base.graph"] + graph_bfs_order [label="graph-bfs.order"] + graph_bfs [label="graph-bfs.graph"] + graph_bfs_transposed [label="graph-bfs-transposed.graph"] + graph_bfs_simplified [label="graph-bfs-simplified.graph"] + graph_llp_order [label="graph-llp.order"] } - mph [label="MPH", shape=box]; - mph_out [label="swh.mph", fontsize=9, shape=none]; - - bv_compress [label="BV compress", shape=box]; - bv_compress_out - [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties", - fontsize=9, shape=none]; - - bfs [label="BFS", shape=box]; - bfs_out [label="swh.order", fontsize=9, shape=none]; + graph_llp [label="graph.graph"] + graph_llp_transposed [label="graph-transposed.graph"] + graph_order [label="graph.order"] + graph_obl [label="graph.obl"] + graph_transposed_obl [label="graph-transposed.obl"] + stats [label="graph.stats"] + swhidmap [label="graph.node2swhid.bin"] + typemap [label="graph.node2type.bin"] + persons_csv [label="graph.persons.csv.zst"]; + persons_mph [label="graph.persons.mph"]; + node_properties [label="graph.property.*"]; + labels_mph [label="graph.labels.mph"]; + labels_fcl [label="graph.labels.fcl"]; + graph_labelled [label="graph-labelled.*"]; + graph_transposed_labelled [label="graph-transposed-labelled.*"]; - permute [label="Permute", shape=box]; - permute_out - [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties", - fontsize=9, shape=none]; - - stats [label="Stats", shape=box]; - stats_out - [label="swh.stats\lswh.indegree\lswh.outdegree", - fontsize=9, shape=none]; + subgraph { + node [shape=box, fontname="Courier New"]; + EXTRACT_NODES; + MPH; + BV; + BFS; + PERMUTE_BFS; + TRANSPOSE_BFS; + SIMPLIFY; + LLP; + PERMUTE_LLP; + COMPOSE_ORDERS; + STATS; + TRANSPOSE; + OBL; + TRANSPOSE_OBL; + NODE_MAP; + EXTRACT_PERSONS; + MPH_PERSONS; + NODE_PROPERTIES; + MPH_LABELS; + FCL_LABELS; + EDGE_LABELS; + } - transpose [label="Transpose", shape=box]; - transpose_out - [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties", - fontsize=9, shape=none]; - input_nodes -> mph; - input_edges -> bv_compress; - mph -> mph_out; - mph_out -> bv_compress; - bv_compress -> bv_compress_out; - bv_compress_out-> bfs; - bv_compress_out-> permute; - bfs -> bfs_out; - bfs_out -> permute; - permute -> permute_out; - permute_out -> stats; - permute_out -> transpose; - stats -> stats_out; - transpose -> transpose_out; + orc_dataset -> EXTRACT_NODES; + EXTRACT_NODES -> nodes_csv; + EXTRACT_NODES -> labels_csv; + nodes_csv -> MPH -> graph_mph; + graph_mph -> BV; + orc_dataset -> BV -> graph_base; + graph_base -> BFS -> graph_bfs_order; + graph_bfs_order -> PERMUTE_BFS; + graph_base -> PERMUTE_BFS -> graph_bfs; + graph_bfs -> TRANSPOSE_BFS -> graph_bfs_transposed; + graph_bfs_transposed -> SIMPLIFY; + graph_bfs -> SIMPLIFY -> graph_bfs_simplified; + graph_bfs_simplified -> LLP -> graph_llp_order; + graph_llp_order -> PERMUTE_LLP; + graph_bfs -> PERMUTE_LLP -> graph_llp; + graph_bfs_order -> COMPOSE_ORDERS; + graph_llp_order -> COMPOSE_ORDERS -> graph_order; + graph_llp -> TRANSPOSE -> graph_llp_transposed; + graph_llp -> OBL -> graph_obl; + graph_llp_transposed -> TRANSPOSE_OBL -> graph_transposed_obl; + graph_llp -> STATS -> stats; + graph_llp -> NODE_MAP; + nodes_csv -> NODE_MAP; + graph_mph -> NODE_MAP; + graph_order -> NODE_MAP; + NODE_MAP -> swhidmap; + NODE_MAP -> typemap; + orc_dataset -> EXTRACT_PERSONS -> persons_csv; + persons_csv -> MPH_PERSONS -> persons_mph; + orc_dataset -> NODE_PROPERTIES; + persons_mph -> NODE_PROPERTIES; + graph_mph -> NODE_PROPERTIES; + graph_order -> NODE_PROPERTIES; + NODE_PROPERTIES -> node_properties; + labels_csv -> MPH_LABELS -> labels_mph; + labels_mph -> FCL_LABELS; + labels_csv -> FCL_LABELS -> labels_fcl; + orc_dataset -> EDGE_LABELS; + labels_mph -> EDGE_LABELS; + graph_llp -> EDGE_LABELS; + graph_mph -> EDGE_LABELS; + graph_order -> EDGE_LABELS; + EDGE_LABELS -> graph_labelled; + EDGE_LABELS -> graph_transposed_labelled; } diff --git a/docs/index.rst b/docs/index.rst index 9bf477d..ec4b098 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,17 +1,18 @@ .. _swh-graph: .. include:: README.rst .. toctree:: :maxdepth: 1 :caption: Overview quickstart + api + java-api + memory compression cli - api - use-cases docker git2graph /apidoc/swh.graph diff --git a/docs/java-api.rst b/docs/java-api.rst new file mode 100644 index 0000000..07d0041 --- /dev/null +++ b/docs/java-api.rst @@ -0,0 +1,744 @@ +.. _swh-graph-java-api: + +Using the Java API +================== + +.. highlight:: java + +While the :ref:`HTTP API ` is useful for many common use-cases, +it is often not sufficient to implement more complex algorithms. This section +describes the low-level Java API that ``swh-graph`` provides on top of the +WebGraph framework to manipulate the compressed graph of Software Heritage. + +A cursory understanding of the `WebGraph framework +`_ and its API is helpful to understand the +notions detailed here. + +.. _swh-graph-java-basics: + +Basics +------ + +In the WebGraph framework, graphs are generally subclasses of +`ImmutableGraph +`_, +the abstract class providing the core API to manipulate and iterate on graphs. +Under the hood, compressed graphs are stored as the `BVGraph +`_ +class, which contains the actual codec used to compress and decompress +adjacency lists. + +Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]` +where *n* is the total number of nodes in the graph. +Each node has an associated *adjacency list*, i.e., a list of nodes going from +that source node to a destination node. This list represents the **edges** (or +**arcs**) of the graph. + +**Note**: edges are always directed. Undirected graphs are internally stored as +a pair of directed edges (src → dst, dst → src), and are called "symmetric" +graphs. + +On disk, a simple BVGraph with the basename ``graph`` would be represented as +the following set of files: + +- ``graph.graph``: contains the compressed adjacency lists of each node, which + can be decompressed by the BVGraph codec. +- ``graph.properties``: contains metadata on the graph, such as the number of + nodes and arcs, as well as additional loading information needed by the + BVGraph codec. +- ``graph.offsets``: a list of offsets of where the adjacency list of each node + is stored in the main graph file. +- ``graph.obl``: optionally, an "offset big-list file" which can be used to + load graphs faster. + +An ImmutableGraph can be loaded using different *load methods*, which have each +different performance implications: + +- ``load()``: the entire graph is loaded in RAM and supports random access. +- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see + ``mmap(1)``), at the cost of being potentially slower, especially when doing + random access on slow storage. +- ``loadOffline()``: no data is actually loaded is memory, only sequential + iteration is possible. + +The following code loads a graph stored on disk under the ``compressed/graph`` +basename, using the memory-mapped loading mode, and stores it as a generic +ImmutableGraph: + +.. code-block:: java + + ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph"); + +Note that most of the time you will want to use the SWH-specific subclass +**SwhUnidirectionalGraph** instead, which has the same API as an ImmutableGraph +except it adds other SWH-specific methods. It is described later in the +:ref:`swh-graph-java-node-mappings` section. + + +Running the code +---------------- + +To run a piece of Java code written using the Java API, you need to run it with +all the dependencies in your classpath (the WebGraph libraries and the +swh-graph library). The easiest way to do it is to add the *fat jar* +shipped in the swh-graph library on PyPI, which contains all the required +dependencies. + +.. code-block:: console + + $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java + + +Note that to load bigger graphs, the default heap size of the JVM is likely to +be insufficient to load entire graphs in memory. It is advised to increase this +heap size with the ``-Xmx`` flag: + +.. code-block:: console + + $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java + +For more information on performance tuning and memory considerations, you +should also read the :ref:`swh-graph-memory` page, in which we recommend +additional JVM options for loading large graphs. + + +Simple traversal +---------------- + +The ImmutableGraph class provides primitives to iterate and traverse graphs. It +contains the following methods: + +- ``graph.numNodes()`` returns the number of nodes in the graph (*n*). +- ``graph.numArcs()`` returns the number of arcs in the graph. + +And, given a node ID :math:`k \in [0, n - 1]`: + +- ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are + *adjacent* to *k* (i.e., its outgoing *neighbors*). +- ``graph.outdegree(k)`` returns the number of outgoing neighbors of *k*. + + +Example: Average outdegree +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code can be used to compute the average +outdegree of a graph, which is a useful measure of its density: + +.. code-block:: java + + public static long averageOutdegree(ImmutableGraph graph) { + return ((long) graph.numArcs()) / graph.numNodes(); + } + + +Example: Degree distributions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the ``outdegree()`` primitive, we can compute the outdegree distribution +of the graph by iterating on all its nodes. The distribution will be returned +as a map that associates to each degree *d* the number of nodes with outdegree +*d*. + +.. code-block:: java + + public static Map outdegreeDistribution(ImmutableGraph graph) { + HashMap distribution = new HashMap(); + for (long k = 0; k < graph.numNodes(); ++k) { + distribution.merge(graph.outdegree(k), 1L, Long::sum); + } + return distribution; + } + + +Example: Depth-First Traversal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``successors`` primitive can be used to write a simple stack-based DFS +traversal on the graph which starts from a given node and prints all the +descendant nodes in its transitive closure: + +.. code-block:: java + :emphasize-lines: 10 + + public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) { + Stack stack = new Stack<>(); + HashSet visited = new HashSet(); + stack.push(srcNodeId); + visited.add(srcNodeId); + + while (!stack.isEmpty()) { + long currentNodeId = stack.pop(); + System.out.println(currentNodeId); + + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + stack.push(neighborNodeId); + visited.add(neighborNodeId); + } + } + } + } + +Example: Breadth-First Traversal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Swapping the stack for a queue changes the traversal order from depth-first +to breadth-first: + +.. code-block:: java + :emphasize-lines: 2 + + public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) { + Queue queue = new ArrayDeque<>(); + HashSet visited = new HashSet(); + queue.add(srcNodeId); + visited.add(srcNodeId); + + while (!queue.isEmpty()) { + long currentNodeId = queue.poll(); + System.out.println(currentNodeId); + + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + queue.add(neighborNodeId); + visited.add(neighborNodeId); + } + } + } + } + + +.. _swh-graph-java-node-mappings: + +Node types and SWHIDs +--------------------- + +In the Software Heritage archive, nodes are not represented by a simple +integer, but by a :ref:`SWHID `, which contain both the +*type* of the node (revision, directory, blob...) and its unique identifier. We +use **node mappings** which allow us to translate between SWHIDs and the +compact node IDs used in the compressed graph. + +Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the +`GOVMinimalPerfectHashFunction +`_ +class of the Sux4J library, which maps N keys to N consecutive integers with no +collisions. + +The following files are used to store the mappings between the nodes and their +types: + +- ``graph.mph``: contains a serialized minimal perfect hash function computed + on the list of all the SWHIDs in the graph. +- ``graph.order``: contains the permutation that associates with each output of + the MPH the node ID to which it corresponds +- ``graph.node2swhid.bin``: contains a compact binary representation of all the + SWHIDs in the graph, ordered by their rank in the graph file. +- ``graph.node2type.bin``: contains a `LongBigArrayBitVector + `_ + which stores the type of each node. + +To use these mappings easily, we provide the class **SwhUnidirectionalGraph**, +an ImmutableGraph which wraps the underlying graph and adds a few +utility methods to obtain SWH-specific information on the graph. + +A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph, +as long as the mapping files listed above are present:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); + +This class exposes the same graph primitives as an ImmutableGraph, but it +additionally contains the following methods: + +- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given + node ID. This function does a lookup of the SWHID at offset *i* in the file + ``graph.node2swhid.bin``. + +- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given + SWHID. It works by hashing the SWHID with the function stored in + ``graph.mph``, then permuting it using the permutation stored in + ``graph.order``. It does additional domain-checking by calling ``getSWHID()`` + on its own result to check that the input SWHID was valid. + +- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as + an enum of all the different object types in the Software Heritage data + model. It does so by looking up the value at offset *i* in the bit vector + stored in ``graph.node2type.bin``. + + +Example: Find the target directory of a revision +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As an example, we use the methods mentioned above to perform the +following task: "given a revision, return its target directory". To do so, we +first look up the node ID of the given revision in the compressed graph. We +iterate on the successors of that node, and return the SWHID of the first +destination node that has the "directory" type. + + +.. code-block:: java + :emphasize-lines: 2 + + public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) { + long src = graph.getNodeId(revSwhid); + assert graph.getNodeType(src) == Node.Type.REV; + LazyLongIterator it = graph.successors(currentNodeId); + for (long dst; (dst = it.nextLong()) != -1;) { + if (graph.getNodeType(dst) == Node.Type.DIR) { + return graph.getSWHID(dst); + } + } + throw new RuntimeError("Revision has no target directory"); + } + +.. _swh-graph-java-node-properties: + +Node properties +--------------- + +The Software Heritage Graph is a *property graph*, which means it has various +properties associated with its nodes and edges (e.g., commit timestamps, +authors, messages, ...). We compress these properties and store them in files +alongside the compressed graph. This allows you to write traversal algorithms +that depend on these properties. + +By default, properties are not assumed to be present are are not loaded when +the graph itself is loaded. If you want to use a property, you need to +explicitly load it first. As an example, this is how you load the "content +length" property to get the length of a given blob:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); + graph.loadContentLength(); + long blobSize = graph.getContentLength(graph.getNodeID(swhid)); + +The documentation of the SwhGraphProperties class (**TODO: link**) lists all +the different properties, their types, and the methods used to load them and to get +their value for a specific node. + +A few things of note: + +- A single loading call can load multiple properties at once; this is because + they are stored in the same file to be more space efficient. + +- Persons (authors, committers etc) are exported as a single pseudonymized + integer ID, that uniquely represents a full name + email. + +- Timestamps are stored as a long integer (for the timestamp itself) and a + short integer (for the UTC offset). + + +.. _swh-graph-java-edge-properties: + +Edge labels +----------- + +While looking up graph properties on the *nodes* of the graph is relatively +straightforward, doing so for labels on the *arcs* is comparatively more +difficult. These include the names and permissions of directory entries, as +well as the branch names of snapshots. + +The `ArcLabelledImmutableGraph +`_ +class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them +*labelled iterators*, which essentially allow us to look up the label of the +arcs while iterating on them. + +This labelled graph is stored in the following files: + +- ``graph-labelled.properties``: a property file describing the graph, notably + containing the basename of the wrapped graph. +- ``graph-labelled.labels``: the compressed labels +- ``graph-labelled.labeloffsets``: the offsets used to access the labels in + random order. + +The SwhUnidirectionalGraph class contains *labelled* loading methods +(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods +are used instead of the standard non-labelled ones, the graph is loaded as an +ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods +can then be used: + +- ``labelledSuccessors(k)`` returns a `LabelledArcIterator + `_ + which is used in the same way as a LazyLongIterator except it also contains a + ``label()`` method to get the label of the currently traversed arc. +- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator + `_ + of all the nodes in the graph, which replaces the LazyLongIterator of the + ``successor()`` function by a LabelledArcIterator similar to above. + + +Label format +~~~~~~~~~~~~ + +The labels of each arc are returned as a ``DirEntry[]`` array. They encode +both the name of a directory entry and its permissions. For snapshot branches, +only the "name" field is useful. + +Arc label names are encoded as an integer ID representing each unique +entry/branch name present in the graph. To retrieve the actual name associated +with a given label ID, one needs to load the reverse mapping similar to how you +would do for a normal property:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename); + graph.loadLabelNames(); + +The byte array representing the actual label name can then be loaded with:: + + byte[] name = graph.getLabelName(label.filenameId); + + +Multiedges +~~~~~~~~~~ + +The Software Heritage is not a *simple graph*, where at most one edge can exist +between two vertices, but a *multigraph*, where multiple edges can be incident +to the same two vertices. Consider for instance the case of a single directory +``test/`` containing twice the same file blob (e.g., the empty file), under two +different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty). +The simple graph view of this directory will represent it as a single edge +``test`` → *empty file*, while the multigraph view will represent it as *two* +edges between the same nodes. + +Due to the copy-list model of compression, WebGraph only stores simple graphs, +and thus stores multiedges as single edges, to which we cannot associate +a single label name (in our example, we need to associate both names +``ISSUES.txt`` and ``TODO.txt``). +To represent this possibility of having multiple file names for a single arc, +in the case of multiple relationships between two identical nodes, each arc label is +stored as an *array* of DirEntry, each record representing one relationship +between two nodes. + + +Example: Printing all the entries of a directory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code showcases how one can print all the entries (name, +permission and target SWHID) of a given directory, using the labelled methods +seen above. + +.. code-block:: java + + public static void printEntries(ImmutableGraph g, long dirNode) { + LabelledArcIterator s = g.labelledSuccessors(dirNode); + for (long dst; (dst = it.nextLong()) >= 0;) { + DirEntry[] labels = (DirEntry[]) s.label().get(); + for (DirEntry label : labels) { + System.out.format( + "%s %s %d\n", + graph.getSWHID(dst); + new String(graph.getLabelName(label.filenameId)), + label.permission + ); + } + } + } + + // Usage: $PROGRAM + public static void main(String[] args) { + SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]); + g.loadLabelNames(); + long dirNode = g.getNodeID(new SWHID(args[1])); + printEntries(g, dirNode); + } + + +Transposed graph +---------------- + +Up until now, we have only looked at how to traverse the *forward* graph, i.e., +the directed graph whose edges are in the same direction as the Merkle DAG of +the Software Heritage archive. +For many purposes, especially that of finding the *provenance* of software +artifacts, it is useful to query the *backward* (or *transposed*) graph +instead, which is the same as the forward graph except all the edges are +reversed. + +The transposed graph has its own set of files, counterparts to the files needed +for the forward graph: + +- ``graph-transposed.graph`` +- ``graph-transposed.properties`` +- ``graph-transposed.offsets`` +- ``graph-transposed.obl`` +- ``graph-transposed-labelled.labels`` +- ``graph-transposed-labelled.labeloffsets`` +- ``graph-transposed-labelled.properties`` + +However, because node IDs are the same in the forward and the backward graph, +all the files that pertain to mappings between the node IDs and various +properties (SWHIDs, property data, node permutations etc) remain the same. + + +Example: Earliest revision containing a given blob +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code loads all the committer timestamps of the revisions in the +graph, then walks the *transposed* graph to return the earliest revision +containing a given object. + +.. code-block:: java + + public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) { + long oldestRev = -1; + long oldestRevTs = Long.MAX_VALUE; + + Stack stack = new Stack<>(); + HashSet visited = new HashSet(); + stack.push(src); + visited.add(src); + while (!stack.isEmpty()) { + long currentNodeId = stack.pop(); + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + stack.push(neighborNodeId); + visited.add(neighborNodeId); + if (g.getNodeType(neighborNodeId) == Node.Type.REV) { + Long ts = g.getCommitterTimestamp(neighborNodeId); + if (ts != null && ts < oldestRevTs) { + oldestRev = neighborNodeId; + oldestRevTs = ts; + } + } + } + } + } + return oldestRev; + } + + // Usage: $PROGRAM + public static void main(String[] args) { + // Load the backward (= transposed) graph as a SwhUnidirectionalGraph + SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed"); + g.loadCommitterTimestamps(); + long node = g.getNodeID(new SWHID(args[1])); + long oldestRev = findEarliestRevisionContaining(g, node); + System.out.println(g.getSWHID(oldestRev)); + } + + + + +Bidirectional Graph +------------------- + + +BidirectionalImmutableGraph +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and +then manipulated just like the forward graph, it is often convenient to have +*both* the forward and the backward graph in memory. Some traversal algorithms +require first going down in the forward graph to select some nodes, then going +up to find their provenance. + +To achieve that, we use the `BidirectionalImmutableGraph +`_ +class from WebGraph, which stores both a graph and its transpose. +This class provides the following methods to iterate on the **backward** graph, +shown here with their counterparts for the forward graph: + +.. list-table:: + :header-rows: 1 + + * - Forward graph operation + - Backward graph operation + + * - ``outdegree(k)`` + - ``indegree(k)`` + + * - ``successors(k)`` + - ``predecessors(k)`` + +In addition, the class offers a few convenience methods which are generally +useful when you have both a graph and its transpose: + +- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by + inverting the references to the forward and the backward graphs. Successors + become predecessors, and vice-versa. +- ``symmetrize()`` returns the symmetric (= undirected) version of the + bidirectional graph. It is implemented by a union between the forward and the + backward graph, which basically boils down to removing the directionality of + the edges (the successors of a node are also its predecessors). + + +SwhBidirectionalGraph +~~~~~~~~~~~~~~~~~~~~~ + +Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with +SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it +contains the method ``labelledPredecessors()``, the equivalent of +``labelledSuccessors()`` but on the backward graph. + +Because SwhUnidirectionalGraph inherits from ImmutableGraph, and +SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the +common behavior between the two classes in a SwhGraph interface, which can +represent either an unidirectional or a bidirectional graph. + +To avoid loading the node properties two times (once for each direction), they +are stored in a separate class called SwhGraphProperties. In a +SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node +properties in memory by storing references to the same SwhGraphProperty +object. + +.. code-block:: text + + + ┌──────────────┐ + │ImmutableGraph◄────────┐ + └────▲─────────┘ │extends + │ │ + │ ┌──────────┴────────────────┐ + extends│ │BidirectionalImmutableGraph│ + │ └────────────▲──────────────┘ + │ │extends + ┌──────────────┴───────┐ ┌──────┴──────────────┐ + │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│ + └──┬──────────────┬────┘ └────────┬───────────┬┘ + │ │ contains x2 │ │ + │ │ │ │ + │ implements│ │implements │ + │ ┌▼──────────┐ │ │ + │ │SwhGraph(I)◄────────┘ │ + contains│ └───────────┘ │contains + │ │ + │ ┌──────────────────┐ │ + └────────────►SwhGraphProperties◄──────────────┘ + └──────────────────┘ + + +Example: Find all the shared-commit forks of a given origin +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to define the *forks* of an origin as being the set of origins +which share at least one revision with that origin. + +The following code loads the graph in both directions using a +SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward* +graph to find all its root revisions. It then walks the *backward* graph to +find all the origins containing these root revisions, i.e., its *forks*. + +.. code-block:: java + + public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) { + Stack forwardStack = new Stack<>(); + HashSet forwardVisited = new HashSet(); + Stack backwardStack = new Stack<>(); + HashSet backwardVisited = new HashSet(); + + // First traversal (forward graph): find all the root revisions of the + // origin + forwardStack.push(srcOrigin); + forwardVisited.add(srcOrigin); + while (!forwardStack.isEmpty()) { + long curr = forwardStack.pop(); + LazyLongIterator it = graph.successors(curr); + boolean isRootRevision = true; + for (long succ; (succ = it.nextLong()) != -1;) { + Node.Type nt = g.getNodeType(succ); + if (!forwardVisited.contains(succ) + && nt != Node.Type.DIR && nt != Node.Type.CNT) { + forwardStack.push(succ); + forwardVisited.add(succ); + isRootRevision = false; + } + } + if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) { + // Found a root revision, add it to the second stack + backwardStack.push(curr); + backwardVisited.add(curr); + } + } + + // Second traversal (backward graph): find all the origins containing + // any of these root revisions and print them + while (!backwardStack.isEmpty()) { + long curr = backwardStack.pop(); + LazyLongIterator it = graph.predecessors(curr); + boolean isRootRevision = true; + for (long succ; (succ = it.nextLong()) != -1;) { + Node.Type nt = g.getNodeType(succ); + if (!backwardVisited.contains(succ)) { + backwardStack.push(succ); + backwardVisited.add(succ); + if (nt == Node.Type.ORI) { + // Found an origin, print it. + System.out.println(g.getSWHID(succ)); + } + } + } + } + } + + // Usage: $PROGRAM + public static void main(String[] args) { + // Load both forward and backward graphs as a SwhBidirectionalGraph + SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]); + long node = g.getNodeID(new SWHID(args[1])); + findSharedCommitForks(g, node); + } + + +Large-scale processing +---------------------- + +Multithreading +~~~~~~~~~~~~~~ + +ImmutableGraph is not thread-safe. When writing multithreaded algorithms, +calling ``successors()`` on the same graph from multiple threads will return +garbage. + +Instead, each thread should create its own "lightweight copy" of the graph by +calling ``.copy()``. This will not actually copy the entire graph data, which +will remain shared across threads, but it will create new instances of the +iterators so that each thread can independently iterate on the graph data. + + +Data structures for large traversals +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When doing very large traversals, such as a BFS on the entire graph, the +usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite +inefficient. If you know you are going to traverse large parts of the graph, +it's better to use more appropriate data structures, a lot of which can be +found in the dsiutils library. In particular: + +- `LongArrayBitVector + `_ + is an efficient bit-vector implementation, which can be used to store the + nodes that have already been seen in the visit. Its memory footprint is too + big to use for small traversals, but it is very efficient to traverse the + full graph, as every node only takes a single bit. + +- `ByteDiskQueue + `_ can + be used to efficiently store the queue of nodes to visit on disk, when it is + too large to fit in RAM. + +Other types in dsiutils and fastutil can save significant memory: +``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList``, +and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over +``HashMap``. We strongly recommend reading the documentation of the +unimi libraries and looking at the code for usage examples. + + +BigArrays +~~~~~~~~~ + +When working with the Software Heritage graph, is often necessary to store +large arrays of values, with a size exceeding 2^32 items. Unfortunately, +standard Java arrays do not support this. + +To circumvent this, WebGraph uses the `BigArrays scheme +`_ from +the fastutil library: "big arrays" are stored as arrays of arrays, supporting +quadrillions of records. + +A BigArray ``long[][] a`` can be used with the following methods: + +- ``BigArrays.get(a, i)`` to get the value at index *i* +- ``BigArrays.set(a, i, v)`` to set the value at index *i* to *v*. +- ``BigArrays.length(a)`` to get the total length of the bigarray. diff --git a/docs/memory.rst b/docs/memory.rst new file mode 100644 index 0000000..f30f9c4 --- /dev/null +++ b/docs/memory.rst @@ -0,0 +1,130 @@ +.. _swh-graph-memory: + +Memory & Performance tuning +=========================== + +This page discusses various considerations related to memory usage and +performance tuning when using the ``swh-graph`` library to load large +compressed graphs. + +JVM options +----------- + +In production, we tend to use very large servers which have enough RAM to load +the entire graph in RAM. In these setups, the default JVM options are often +suboptimal. We recommend to start the JVM with the following options, which +tend to significantly improve performance:: + + java \ + -ea \ + -server \ + -XX:PretenureSizeThreshold=512M \ + -XX:MaxNewSize=4G \ + -XX:+UseLargePages \ + -XX:+UseTransparentHugePages \ + -XX:+UseNUMA \ + -XX:+UseTLAB \ + -XX:+ResizeTLAB \ + +These options are documented in the manual of ``java(1)`` the Oracle +documentation. + + +Temporary directory +------------------- + +Many of the graph algorithms (either for compression or traversal) tend to +offload some of their run-time memory to disk. For instance, the `BFS +`_ +algorithm in the LAW library uses a temporary directory to write its queue of +nodes to visit. + +Because these can be quite large and sometimes overflow the default ``/tmp`` +partition, it is advised to systematically specify a path to a local temporary +directory with enough space to accommodate the needs of the Java programs. This +can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI:: + + java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp + + +Memory mapping vs Direct loading +-------------------------------- + +The main dial you can use to manage your memory usage is to chose between +memory-mapping and direct-loading the graph data. The different loading modes +available when loading the graph are documented in :ref:`swh-graph-java-api`. + +Loading in mapped mode will not load any extra data in RAM, but will instead +use the ``mmap(1)`` syscall to put the graph file located on disk in the +virtual address space. The Linux kernel will then be free to arbitrarily cache +the file, either partially or in its entirety, depending on the available +memory space. + +In our experiments, memory-mapping a small graph from a SSD only incurs a +relatively small slowdown (about 15-20%). However, when the graph is too big to +fit in RAM, the kernel has to constantly invalidate pages to cache newly +accessed sections, which incurs a very large performance penalty. A full +traversal of a large graph that usually takes about 20 hours when loaded in +main memory could take more than a year when mapped from a hard drive! + +When deciding what to direct-load and what to memory-map, here are a few rules +of thumb: + +- If you don't need random access to the graph edges, you can consider using + the "offline" loading mode. The offsets won't be loaded which will save + dozens of gigabytes of RAM. + +- If you only need to query some specific nodes or run trivial traversals, + memory-mapping the graph from a HDD should be a reasonable solution that + doesn't take an inordinate amount of time. It might be bad for your disks, + though. + +- If you are constrained in available RAM, memory-mapping the graph from an SSD + offers reasonable performance for reasonably complex algorithms. + +- If you have a heavy workload (i.e. running a full traversal of the entire + graph) and you can afford the RAM, direct loading will be orders of magnitude + faster than all the above options. + + +Sharing mapped data across processes +------------------------------------ + +Often, multiple processes can be working on the same data (mappings or the +graph itself), for instance when running different experiments on the same +graph. This is problematic in terms of RAM usage when using direct memory +loading, as the same data of potentially hundreds of gigabytes is loaded in +memory twice. +As we have seen, memory-mapping can be used to avoid storing redundant data in +RAM, but comes at the cost of potentially slower I/O as the data is no longer +guaranteed to be loaded in main memory and is reliant on kernel heuristics. + +To efficiently share data across two different compressed graph processes, +another option is to copy graph data to a ``tmpfs`` not backed by a disk swap, +which forces the kernel to load it entirely in RAM. Subsequent memory-mappings +of the files stored in the tmpfs will simply map the data stored in RAM to +virtual memory pages, and return a pointer to the in-memory structure. + +To do so, we create a directory in ``/dev/shm`` in which we **copy** all the +files that we want to direct-load in RAM, and **symlink** all the others. Then, +we load the graph using the memory-mapped loading mode, which makes it use the +shared memory stored in the tmpfs under the hood. + +Here is a systemd service that can be used to perform this task automatically: + +.. code-block:: ini + + [Unit] + Description=swh-graph memory sharing in tmpfs + + [Service] + Type=oneshot + RemainAfterExit=yes + ExecStart=mkdir -p /dev/shm/swh-graph/default + ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default" + ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default + ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default + ExecStop=rm -rf /dev/shm/swh-graph/default + + [Install] + WantedBy=multi-user.target diff --git a/docs/quickstart.rst b/docs/quickstart.rst index 7ac51bd..425a547 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -1,174 +1,132 @@ +.. _swh-graph-quickstart: + Quickstart ========== -This quick tutorial shows how to compress and browse a graph using ``swh.graph``. - -It does not cover the technical details behind the graph compression techniques -(refer to :ref:`graph-compression`). - +This quick tutorial shows how to start the ``swh.graph`` service to query +an existing compressed graph with the high-level HTTP API. Dependencies ------------ In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java -JRE, you do not need the JDK if you install the package from pypi, but may want -to install it if you want to hack the code or install it from this git -repository. To compress a graph, you will need zstd_ compression tools. - -It is highly recommended to install this package in a virtualenv. - -On a Debian stable (buster) system: - -.. code:: bash - - $ sudo apt install python3-virtualenv default-jre zstd +JRE. On a Debian system: +.. code:: console -.. _zstd: https://facebook.github.io/zstd/ + $ sudo apt install python3 python3-venv default-jre - -Install -------- +Installing swh.graph +-------------------- Create a virtualenv and activate it: -.. code:: bash +.. code:: console - ~/tmp$ mkdir swh-graph-tests - ~/tmp$ cd swh-graph-tests - ~/t/swh-graph-tests$ virtualenv swhenv - ~/t/swh-graph-tests$ . swhenv/bin/activate + $ python3 -m venv .venv + $ source .venv/bin/activate Install the ``swh.graph`` python package: -.. code:: bash +.. code:: console - (swhenv) ~/t/swh-graph-tests$ pip install swh.graph + (venv) $ pip install swh.graph [...] - (swhenv) ~/t/swh-graph-tests swh graph --help + (venv) $ swh graph --help Usage: swh graph [OPTIONS] COMMAND [ARGS]... Software Heritage graph tools. Options: -C, --config-file FILE YAML configuration file -h, --help Show this message and exit. Commands: - api-client client for the graph RPC service - cachemount Cache the mmapped files of the compressed graph in a tmpfs. compress Compress a graph using WebGraph Input: a pair of files... - map Manage swh-graph on-disk maps rpc-serve run the graph RPC service -Compression ------------ - -Existing datasets -^^^^^^^^^^^^^^^^^ - -You can directly use compressed graph datasets provided by Software Heritage. -Here is a small and realistic dataset (3.1GB): - - https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar - -.. code:: bash - (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar - (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar - (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading +.. _swh-graph-retrieving-compressed: -Note: not for the faint heart, but the full dataset is available at: +Retrieving a compressed graph +----------------------------- - https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ +Software Heritage provides a list of off-the-shelf datasets that can be used +for various research or prototyping purposes. Most of them are available in +*compressed* representation, i.e., in a format suitable to be loaded and +queried by the ``swh-graph`` library. -Own datasets -^^^^^^^^^^^^ +All the publicly available datasets are documented on this page: +https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html -A graph is described as both its adjacency list and the set of nodes -identifiers in plain text format. Such graph example can be found in the -``swh/graph/tests/dataset/`` folder. +A good way of retrieving these datasets is to use the `AWS S3 CLI +`_. -You can compress the example graph on the command line like this: +Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has +a relatively reasonable size (~15 GiB including property data, with +the compressed graph itself being less than 700 MiB): -.. code:: bash +.. code:: console + (venv) $ pip install awscli + [...] + (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed + (venv) $ cd 2021-03-23-popular-3k-python/ + (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed - (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/ - [...] +You can also retrieve larger graphs, but note that these graphs are generally +intended to be loaded fully in RAM, and do not fit on ordinary desktop +machines. The server we use in production to run the graph service has more +than 700 GiB of RAM. These memory considerations are discussed in more details +in :ref:`swh-graph-memory`. - (swhenv) ~/t/swh-graph-tests$ ls output/ - example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets - example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties - example.indegree example.node2type.map example.order example.stats example-transposed.obl +**Note:** for testing purposes, a fake test dataset is available in the +``swh-graph`` repository, with just a few dozen nodes. Its basename is +``swh-graph/swh/graph/tests/dataset/compressed/example``. API server ---------- -To start a ``swh.graph`` API server of a compressed graph dataset, run: +To start a ``swh.graph`` API server of a compressed graph dataset, you need to +use the ``rpc-serve`` command with the basename of the graph, which is the path prefix +of all the graph files (e.g., with the basename ``compressed/graph``, it will +attempt to load the files located at +``compressed/graph.{graph,properties,offsets,...}``. -.. code:: bash +In our example: + +.. code:: console - (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example - Loading graph output/example ... + (venv) $ swh graph rpc-serve -g compressed/graph + Loading graph compressed/graph ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) From there you can use this endpoint to query the compressed graph, for example -with httpie_ (``sudo apt install``) from another terminal: +with httpie_ (``sudo apt install httpie``): .. _httpie: https://httpie.org .. code:: bash - ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 - HTTP/1.1 200 OK - Content-Type: text/plain - Date: Tue, 15 Sep 2020 08:33:25 GMT - Server: Python/3.8 aiohttp/3.6.2 - Transfer-Encoding: chunked - - swh:1:rel:0000000000000000000000000000000000000010 - swh:1:rev:0000000000000000000000000000000000000009 - swh:1:rev:0000000000000000000000000000000000000003 - swh:1:dir:0000000000000000000000000000000000000002 - swh:1:cnt:0000000000000000000000000000000000000001 - swh:1:dir:0000000000000000000000000000000000000008 - swh:1:dir:0000000000000000000000000000000000000006 - swh:1:cnt:0000000000000000000000000000000000000004 - swh:1:cnt:0000000000000000000000000000000000000005 - swh:1:cnt:0000000000000000000000000000000000000007 - - -Running the existing ``python3kcompress`` dataset: - -.. code:: bash - - (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k - Loading graph python3kcompress/python3k ... - Graph loaded. - ======== Running on http://0.0.0.0:5009 ======== - (Press CTRL+C to quit) - - ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 200 OK Content-Type: text/plain Date: Tue, 15 Sep 2020 08:35:19 GMT Server: Python/3.8 aiohttp/3.6.2 Transfer-Encoding: chunked swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472 [...] swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 - -See the documentation of the :ref:`API ` for more details. +See the documentation of the :ref:`API ` for more details on how +to use the HTTP graph querying API. diff --git a/docs/use-cases.rst b/docs/use-cases.rst deleted file mode 100644 index ce01d8c..0000000 --- a/docs/use-cases.rst +++ /dev/null @@ -1,167 +0,0 @@ -========= -Use cases -========= - - -This document lists use cases and benchmark scenarios for the Software Heritage -graph service. - - -Conventions -=========== - -- **Node identification**: in the following, nodes are always identified by - their :ref:`SWHIDs `. - - -Use cases -========= - - -Browsing --------- - -The following use cases require traversing the *forward graph*. - -- **ls**: given a directory node, list (non recursively) all linked nodes of - type directory and content - - Implementation:: - - /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir - -- **ls -R**: given a directory node, recursively list all linked nodes of type - directory and content - - Implementation:: - - /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir - -- **git log**: given a revision node, recursively list all linked nodes of type - revision - - Implementation:: - - /graph/visit/nodes/:REV_ID?edges=rev:rev - - -Vault ------ - -The following use cases require traversing the *forward graph*. - -- **tarball** (same as *ls -R* above) - -- **git bundle**: given a node, recursively list all linked nodes of any kind - - Implementation:: - - /graph/visit/nodes/:NODE_ID?edges=* - - -Provenance ----------- - -The following use cases require traversing the *backward (transposed) -graph*. - -- **commit provenance**: given a content or directory node, return *a* commit - whose directory (recursively) contains it - - Implementation:: - - /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev - -- **complete commit provenance**: given a content or directory node, return - *all* commits whose directory (recursively) contains it - - Implementation:: - - /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev - -- **origin provenance**: given a content, directory, or commit node, return - *an* origin that has at least one snapshot that (recursively) contains it - - Implementation:: - - /graph/walk/:NODE_ID/ori?direction=backward&edges=* - -- **complete origin provenance**: given a content, directory, or commit node, - return *all* origins that have at least one snapshot that (recursively) - contains it - - Implementation:: - - /graph/leaves/:NODE_ID?direction=backward&edges=* - -- *SLOC tracking*: left as future work - - -Provenance statistics -~~~~~~~~~~~~~~~~~~~~~ - -The following use cases require traversing the *backward (transposed) -graph*. - -- **content popularity across commits**: for each content, count the number of - commits (or *commit popularity*) that link to a directory that (recursively) - includes it. Plot the distribution of content popularity across commits. - - Implementation: apply *complete commit provenance* to each content node, - count the returned commits, aggregate. - -- **commit popularity across origins**: for each commit, count the number of - origins (or *origin popularity*) that have a snapshot that (recursively) - includes it. Plot the distribution of commit popularity across origins. - - Implementation: apply *complete origin provenance* to each commit node, count - the returned origins, aggregate. - -- *SLOC popularity across contents*: left as future work - -The following use cases require traversing the *forward graph*. - -- **revision size** (as n. of contents) distribution: for each revision, count - the number of contents that are (recursively) reachable from it. Plot the - distribution of revision sizes. - -- **origin size** (as n. of revisions) distribution: for each origin, count the - number of revisions that are (recursively) reachable from it. Plot the - distribution of origin sizes. - - -Benchmarks -========== - -Notes on how to benchmark graph access: - -- separate pure-graph timings from timings related to use additional mappings - (e.g., node types), no matter if the mappings are in-memory or on-disk - -- separate in-memory timings from on-disk timings; in particular, separate the - timing of translating node identifiers between internal integers and SWHIDs - -- for each use case that requires a node as input, we will randomize the choice - of the input node and repeat the experiment a suitable number of times; where - possible we will aggregate results computing basic statistics (average, - standard deviation), as well as normalize results w.r.t. the “size” of the - chosen node (e.g., number of nodes/path length in the resulting visit) - - -Basic benchmarks ----------------- - -- **Edge traversal**: given a node, retrieve the first node in its adjacency - list. - - For reference: Apostolico, Drovandi in *Graph Compression by BFS* report - times to retrieve the adjacency list of a node (and/or test if an edge exists - between two nodes) in the 2-3 us range, for the largest graph in their - experiments (22 M nodes, 600 M edges). - - -Each use case is a benchmark ----------------------------- - -In addition to abstract benchmark, we will use each use case above as a -scenario-based benchmark. diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java index 88efd47..80e0c7e 100644 --- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java +++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java @@ -1,195 +1,194 @@ package org.softwareheritage.graph.compress; import com.github.luben.zstd.ZstdInputStream; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.BigArrays; import it.unimi.dsi.fastutil.Size64; import it.unimi.dsi.fastutil.io.BinIO; import it.unimi.dsi.fastutil.longs.LongBigArrays; import it.unimi.dsi.fastutil.longs.LongBigList; import it.unimi.dsi.fastutil.objects.Object2LongFunction; import it.unimi.dsi.io.FastBufferedReader; import it.unimi.dsi.io.LineIterator; import it.unimi.dsi.logging.ProgressLogger; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.softwareheritage.graph.Node; import org.softwareheritage.graph.SWHID; import org.softwareheritage.graph.maps.NodeIdMap; import org.softwareheritage.graph.maps.NodeTypesMap; import java.io.*; import java.nio.charset.StandardCharsets; import java.util.Scanner; import java.util.concurrent.TimeUnit; /** * Create maps needed at runtime by the graph service, in particular: *

*

    - *
  • SWHID → WebGraph long node id
  • - *
  • WebGraph long node id → SWHID (converse of the former)
  • + *
  • WebGraph long node id → SWHID
  • *
  • WebGraph long node id → SWH node type (enum)
  • *
* * @author The Software Heritage developers */ public class NodeMapBuilder { final static String SORT_BUFFER_SIZE = "40%"; final static Logger logger = LoggerFactory.getLogger(NodeMapBuilder.class); /** * Main entrypoint. * * @param args command line arguments */ public static void main(String[] args) throws IOException { if (args.length != 2) { logger.error("Usage: COMPRESSED_GRAPH_BASE_NAME TEMP_DIR < NODES_CSV"); System.exit(1); } String graphPath = args[0]; String tmpDir = args[1]; logger.info("starting maps generation..."); precomputeNodeIdMap(graphPath, tmpDir); logger.info("maps generation completed"); } /** * Computes and dumps on disk mapping files. * * @param graphPath path of the compressed graph */ static void precomputeNodeIdMap(String graphPath, String tmpDir) throws IOException { ProgressLogger plSWHID2Node = new ProgressLogger(logger, 10, TimeUnit.SECONDS); ProgressLogger plNode2SWHID = new ProgressLogger(logger, 10, TimeUnit.SECONDS); plSWHID2Node.itemsName = "nodes"; plNode2SWHID.itemsName = "nodes"; // first half of SWHID->node mapping: SWHID -> WebGraph MPH (long) Object2LongFunction mphMap = NodeIdMap.loadMph(graphPath + ".mph"); long nbIds = (mphMap instanceof Size64) ? ((Size64) mphMap).size64() : mphMap.size(); plSWHID2Node.expectedUpdates = nbIds; plNode2SWHID.expectedUpdates = nbIds; // second half of SWHID->node mapping: WebGraph MPH (long) -> BFS order (long) long[][] bfsMap = LongBigArrays.newBigArray(nbIds); logger.info("loading BFS order file..."); long loaded = BinIO.loadLongs(graphPath + ".order", bfsMap); logger.info("BFS order file loaded"); if (loaded != nbIds) { logger.error("graph contains " + nbIds + " nodes, but read " + loaded); System.exit(2); } /* * Read on stdin a list of SWHIDs, hash them with MPH, then permute them according to the .order * file */ FastBufferedReader buffer = new FastBufferedReader( new InputStreamReader(new ZstdInputStream(new BufferedInputStream(System.in)))); LineIterator swhidIterator = new LineIterator(buffer); /* * The WebGraph node id -> SWHID mapping can be obtained from the SWHID->node one by numerically * sorting on node id and sequentially writing obtained SWHIDs to a binary map. Delegates the * sorting job to /usr/bin/sort via pipes */ ProcessBuilder processBuilder = new ProcessBuilder(); processBuilder.command("sort", "--numeric-sort", "--key", "2", "--buffer-size", SORT_BUFFER_SIZE, "--temporary-directory", tmpDir); Process sort = processBuilder.start(); BufferedOutputStream sort_stdin = new BufferedOutputStream(sort.getOutputStream()); BufferedInputStream sort_stdout = new BufferedInputStream(sort.getInputStream()); // for the binary format of nodeToSwhidMap, see Python module swh.graph.swhid:IntToSwhidMap try (BufferedOutputStream nodeToSwhidMap = new BufferedOutputStream( new FileOutputStream(graphPath + NodeIdMap.NODE_TO_SWHID))) { /* * background handler for sort output, it will be fed SWHID/node pairs, and will itself fill * nodeToSwhidMap as soon as data from sort is ready. */ SortOutputHandler outputHandler = new SortOutputHandler(sort_stdout, nodeToSwhidMap, plNode2SWHID); outputHandler.start(); /* * Type map from WebGraph node ID to SWH type. Used at runtime by pure Java graph traversals to * efficiently check edge restrictions. */ final int nbBitsPerNodeType = (int) Math.ceil(Math.log(Node.Type.values().length) / Math.log(2)); LongArrayBitVector nodeTypesBitVector = LongArrayBitVector.ofLength(nbBitsPerNodeType * nbIds); LongBigList nodeTypesMap = nodeTypesBitVector.asLongBigList(nbBitsPerNodeType); plSWHID2Node.start("Hashing SWHIDs to fill sort input"); for (long iNode = 0; iNode < nbIds && swhidIterator.hasNext(); iNode++) { String swhidStr = swhidIterator.next().toString(); SWHID swhid = new SWHID(swhidStr); long mphId = mphMap.getLong(swhidStr.getBytes(StandardCharsets.US_ASCII)); long nodeId = BigArrays.get(bfsMap, mphId); sort_stdin.write((swhidStr + "\t" + nodeId + "\n").getBytes(StandardCharsets.US_ASCII)); nodeTypesMap.set(nodeId, swhid.getType().ordinal()); plSWHID2Node.lightUpdate(); } plSWHID2Node.done(); sort_stdin.close(); // write type map logger.info("storing type map"); BinIO.storeObject(nodeTypesMap, graphPath + NodeTypesMap.NODE_TO_TYPE); logger.info("type map stored"); // wait for nodeToSwhidMap filling try { logger.info("waiting for node2swhid map..."); int sortExitCode = sort.waitFor(); if (sortExitCode != 0) { logger.error("sort returned non-zero exit code: " + sortExitCode); System.exit(2); } outputHandler.join(); } catch (InterruptedException e) { logger.error("processing of sort output failed with: " + e); System.exit(2); } } } private static class SortOutputHandler extends Thread { private final Scanner input; private final OutputStream output; private final ProgressLogger pl; SortOutputHandler(InputStream input, OutputStream output, ProgressLogger pl) { this.input = new Scanner(input, StandardCharsets.US_ASCII); this.output = output; this.pl = pl; } public void run() { boolean sortDone = false; logger.info("node2swhid: waiting for sort output..."); while (input.hasNextLine()) { if (!sortDone) { sortDone = true; this.pl.start("filling node2swhid map"); } String line = input.nextLine(); // format: SWHID NODE_ID SWHID swhid = new SWHID(line.split("\\t")[0]); // get SWHID try { output.write(swhid.toBytes()); } catch (IOException e) { logger.error("writing to node->SWHID map failed with: " + e); } this.pl.lightUpdate(); } this.pl.done(); } } } diff --git a/swh/graph/webgraph.py b/swh/graph/webgraph.py index 9c765fd..4a6e9b6 100644 --- a/swh/graph/webgraph.py +++ b/swh/graph/webgraph.py @@ -1,356 +1,356 @@ # Copyright (C) 2019 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """WebGraph driver """ from datetime import datetime from enum import Enum import logging import os from pathlib import Path import subprocess from typing import Dict, List, Set from swh.graph.config import check_config_compress logger = logging.getLogger(__name__) class CompressionStep(Enum): EXTRACT_NODES = 1 MPH = 2 BV = 3 BFS = 4 PERMUTE_BFS = 5 TRANSPOSE_BFS = 6 SIMPLIFY = 7 LLP = 8 PERMUTE_LLP = 9 OBL = 10 COMPOSE_ORDERS = 11 STATS = 12 TRANSPOSE = 13 TRANSPOSE_OBL = 14 MAPS = 15 EXTRACT_PERSONS = 16 MPH_PERSONS = 17 NODE_PROPERTIES = 18 MPH_LABELS = 19 FCL_LABELS = 20 EDGE_LABELS = 21 CLEAN_TMP = 22 def __str__(self): return self.name # full compression pipeline COMP_SEQ = list(CompressionStep) # Mapping from compression steps to shell commands implementing them. Commands # will be executed by the shell, so be careful with meta characters. They are # specified here as lists of tokens that will be joined together only for ease # of line splitting. In commands, {tokens} will be interpolated with # configuration values, see :func:`compress`. STEP_ARGV: Dict[CompressionStep, List[str]] = { CompressionStep.EXTRACT_NODES: [ "{java}", "org.softwareheritage.graph.compress.ExtractNodes", "--format", "orc", "--temp-dir", "{tmp_dir}", "{in_dir}", "{out_dir}/{graph_name}", ], CompressionStep.MPH: [ "{java}", "it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction", "--byte-array", "--temp-dir", "{tmp_dir}", "--decompressor", "com.github.luben.zstd.ZstdInputStream", "{out_dir}/{graph_name}.mph", "{out_dir}/{graph_name}.nodes.csv.zst", ], CompressionStep.BV: [ "{java}", "org.softwareheritage.graph.compress.ScatteredArcsORCGraph", "--temp-dir", "{tmp_dir}", "--function", "{out_dir}/{graph_name}.mph", "{in_dir}", "{out_dir}/{graph_name}-base", ], CompressionStep.BFS: [ "{java}", "it.unimi.dsi.law.big.graph.BFS", "{out_dir}/{graph_name}-base", "{out_dir}/{graph_name}-bfs.order", ], CompressionStep.PERMUTE_BFS: [ "{java}", "it.unimi.dsi.big.webgraph.Transform", "mapOffline", "{out_dir}/{graph_name}-base", "{out_dir}/{graph_name}-bfs", "{out_dir}/{graph_name}-bfs.order", "{batch_size}", "{tmp_dir}", ], CompressionStep.TRANSPOSE_BFS: [ "{java}", "it.unimi.dsi.big.webgraph.Transform", "transposeOffline", "{out_dir}/{graph_name}-bfs", "{out_dir}/{graph_name}-bfs-transposed", "{batch_size}", "{tmp_dir}", ], CompressionStep.SIMPLIFY: [ "{java}", "it.unimi.dsi.big.webgraph.Transform", "simplify", "{out_dir}/{graph_name}-bfs", "{out_dir}/{graph_name}-bfs-transposed", "{out_dir}/{graph_name}-bfs-simplified", ], CompressionStep.LLP: [ "{java}", "it.unimi.dsi.law.big.graph.LayeredLabelPropagation", "-g", "{llp_gammas}", "{out_dir}/{graph_name}-bfs-simplified", "{out_dir}/{graph_name}-llp.order", ], CompressionStep.PERMUTE_LLP: [ "{java}", "it.unimi.dsi.big.webgraph.Transform", "mapOffline", "{out_dir}/{graph_name}-bfs", "{out_dir}/{graph_name}", "{out_dir}/{graph_name}-llp.order", "{batch_size}", "{tmp_dir}", ], CompressionStep.OBL: [ "{java}", "it.unimi.dsi.big.webgraph.BVGraph", "--list", "{out_dir}/{graph_name}", ], CompressionStep.COMPOSE_ORDERS: [ "{java}", "org.softwareheritage.graph.compress.ComposePermutations", "{out_dir}/{graph_name}-bfs.order", "{out_dir}/{graph_name}-llp.order", "{out_dir}/{graph_name}.order", ], CompressionStep.STATS: [ "{java}", "it.unimi.dsi.big.webgraph.Stats", "{out_dir}/{graph_name}", ], CompressionStep.TRANSPOSE: [ "{java}", "it.unimi.dsi.big.webgraph.Transform", "transposeOffline", "{out_dir}/{graph_name}", "{out_dir}/{graph_name}-transposed", "{batch_size}", "{tmp_dir}", ], CompressionStep.TRANSPOSE_OBL: [ "{java}", "it.unimi.dsi.big.webgraph.BVGraph", "--list", "{out_dir}/{graph_name}-transposed", ], CompressionStep.MAPS: [ "{java}", "org.softwareheritage.graph.compress.NodeMapBuilder", "{out_dir}/{graph_name}", "{tmp_dir}", "< {out_dir}/{graph_name}.nodes.csv.zst", ], + CompressionStep.EXTRACT_PERSONS: [ + "{java}", + "org.softwareheritage.graph.compress.ExtractPersons", + "--temp-dir", + "{tmp_dir}", + "{in_dir}", + "{out_dir}/{graph_name}", + ], CompressionStep.MPH_PERSONS: [ "{java}", "it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction", "--byte-array", "--decompressor", "com.github.luben.zstd.ZstdInputStream", "--temp-dir", "{tmp_dir}", "{out_dir}/{graph_name}.persons.mph", "{out_dir}/{graph_name}.persons.csv.zst", ], - CompressionStep.EXTRACT_PERSONS: [ - "{java}", - "org.softwareheritage.graph.compress.ExtractPersons", - "--temp-dir", - "{tmp_dir}", - "{in_dir}", - "{out_dir}/{graph_name}", - ], CompressionStep.NODE_PROPERTIES: [ "{java}", "org.softwareheritage.graph.compress.WriteNodeProperties", "{in_dir}", "{out_dir}/{graph_name}", ], CompressionStep.MPH_LABELS: [ "{java}", "it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction", "--byte-array", "--temp-dir", "{tmp_dir}", "--decompressor", "com.github.luben.zstd.ZstdInputStream", "{out_dir}/{graph_name}.labels.mph", "{out_dir}/{graph_name}.labels.csv.zst", ], CompressionStep.FCL_LABELS: [ "{java}", "it.unimi.dsi.big.util.MappedFrontCodedStringBigList", "--decompressor", "com.github.luben.zstd.ZstdInputStream", "{out_dir}/{graph_name}.labels.fcl", "< {out_dir}/{graph_name}.labels.csv.zst", ], CompressionStep.EDGE_LABELS: [ "{java}", "org.softwareheritage.graph.compress.LabelMapBuilder", "--temp-dir", "{tmp_dir}", "{in_dir}", "{out_dir}/{graph_name}", ], CompressionStep.CLEAN_TMP: [ "rm", "-rf", "{out_dir}/{graph_name}-base.graph", "{out_dir}/{graph_name}-base.offsets", "{out_dir}/{graph_name}-base.properties", "{out_dir}/{graph_name}-bfs-simplified.graph", "{out_dir}/{graph_name}-bfs-simplified.offsets", "{out_dir}/{graph_name}-bfs-simplified.properties", "{out_dir}/{graph_name}-bfs-transposed.graph", "{out_dir}/{graph_name}-bfs-transposed.offsets", "{out_dir}/{graph_name}-bfs-transposed.properties", "{out_dir}/{graph_name}-bfs.graph", "{out_dir}/{graph_name}-bfs.offsets", "{out_dir}/{graph_name}-bfs.order", "{out_dir}/{graph_name}-bfs.properties", "{out_dir}/{graph_name}-llp.order", "{tmp_dir}", ], } def do_step(step, conf): log_dir = Path(conf["out_dir"]) / "logs" log_dir.mkdir(exist_ok=True) step_logger = logger.getChild(f"steps.{step.name.lower()}") step_handler = logging.FileHandler( log_dir / ( f"{conf['graph_name']}" f"-{int(datetime.now().timestamp() * 1000)}" f"-{str(step).lower()}.log" ) ) step_logger.addHandler(step_handler) step_start_time = datetime.now() step_logger.info("Starting compression step %s at %s", step, step_start_time) cmd = " ".join(STEP_ARGV[step]).format(**conf) cmd_env = os.environ.copy() cmd_env["JAVA_TOOL_OPTIONS"] = conf["java_tool_options"] cmd_env["CLASSPATH"] = conf["classpath"] process = subprocess.Popen( ["/bin/bash", "-c", cmd], env=cmd_env, encoding="utf8", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, ) step_logger.info("Running: %s", cmd) with process.stdout as stdout: for line in stdout: step_logger.info(line.rstrip()) rc = process.wait() if rc != 0: raise RuntimeError(f"Compression step {step} returned non-zero exit code {rc}") step_end_time = datetime.now() step_duration = step_end_time - step_start_time step_logger.info( "Compression step %s finished at %s (in %s)", step, step_end_time, step_duration, ) step_logger.removeHandler(step_handler) step_handler.close() return rc def compress( graph_name: str, in_dir: Path, out_dir: Path, steps: Set[CompressionStep] = set(COMP_SEQ), conf: Dict[str, str] = {}, ): """graph compression pipeline driver from nodes/edges files to compressed on-disk representation Args: graph_name: graph base name, relative to in_dir in_dir: input directory, where the uncompressed graph can be found out_dir: output directory, where the compressed graph will be stored steps: compression steps to run (default: all steps) conf: compression configuration, supporting the following keys (all are optional, so an empty configuration is fine and is the default) - batch_size: batch size for `WebGraph transformations `_; defaults to 1 billion - classpath: java classpath, defaults to swh-graph JAR only - java: command to run java VM, defaults to "java" - java_tool_options: value for JAVA_TOOL_OPTIONS environment variable; defaults to various settings for high memory machines - logback: path to a logback.xml configuration file; if not provided a temporary one will be created and used - max_ram: maximum RAM to use for compression; defaults to available virtual memory - tmp_dir: temporary directory, defaults to the "tmp" subdir of out_dir """ if not steps: steps = set(COMP_SEQ) conf = check_config_compress(conf, graph_name, in_dir, out_dir) compression_start_time = datetime.now() logger.info("Starting compression at %s", compression_start_time) seq_no = 0 for step in COMP_SEQ: if step not in steps: logger.debug("Skipping compression step %s", step) continue seq_no += 1 logger.info("Running compression step %s (%s/%s)", step, seq_no, len(steps)) do_step(step, conf) compression_end_time = datetime.now() compression_duration = compression_end_time - compression_start_time logger.info("Completed compression in %s", compression_duration)