diff --git a/docs/api.rst b/docs/api.rst --- a/docs/api.rst +++ b/docs/api.rst @@ -1,7 +1,16 @@ .. _swh-graph-api: -Graph RPC API -============= +Graph Querying HTTP API +======================= + +The Graph Querying API is a high-level HTTP API intended to run common, +relatively simple traversal queries on the compressed graph. + +The client/server architecture allows it to only load the graph in memory once +then serve multiple different requests. However, it is limited in expressivity; +more complex or resource-intensive queries should rather use the +:ref:`Low-level Java API ` to run them as standalone +programs. Terminology @@ -53,8 +62,11 @@ - ``"cnt,snp"`` accepted node types returned in the query results. +Endpoints +--------- + Leaves ------- +~~~~~~ .. http:get:: /graph/leaves/:src @@ -97,7 +109,7 @@ Neighbors ---------- +~~~~~~~~~ .. http:get:: /graph/neighbors/:src @@ -138,7 +150,7 @@ Walk ----- +~~~~ .. .. http:get:: /graph/walk/:src/:dst @@ -246,7 +258,7 @@ Visit ------ +~~~~~ .. http:get:: /graph/visit/nodes/:src .. http:get:: /graph/visit/edges/:src @@ -340,7 +352,7 @@ Counting results ----------------- +~~~~~~~~~~~~~~~~ The following method variants, with trailing `/count` added, behave like their already discussed counterparts but, instead of returning results, return the @@ -363,7 +375,7 @@ Stats ------ +~~~~~ .. http:get:: /graph/stats @@ -405,3 +417,125 @@ "avg": 0.6107127825377487 } } + + +Use-case examples +----------------- + +This section showcases how to leverage the endpoints of the HTTP API described +above for some common use-cases. + + +Browsing +~~~~~~~~ + +The following use cases require traversing the *forward graph*. + +- **ls**: given a directory node, list (non recursively) all linked nodes of + type directory and content + + Endpoint:: + + /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir + +- **ls -R**: given a directory node, recursively list all linked nodes of type + directory and content + + Endpoint:: + + /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir + +- **git log**: given a revision node, recursively list all linked nodes of type + revision + + Endpoint:: + + /graph/visit/nodes/:REV_ID?edges=rev:rev + + +Vault +~~~~~ + +The following use cases require traversing the *forward graph*. + +- **tarball** (same as *ls -R* above) + +- **git bundle**: given a node, recursively list all linked nodes of any kind + + Endpoint:: + + /graph/visit/nodes/:NODE_ID?edges=* + + +Provenance +~~~~~~~~~~ + +The following use cases require traversing the *backward (transposed) +graph*. + +- **commit provenance**: given a content or directory node, return *a* commit + whose directory (recursively) contains it + + Endpoint:: + + /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev + +- **complete commit provenance**: given a content or directory node, return + *all* commits whose directory (recursively) contains it + + Endpoint:: + + /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev + +- **origin provenance**: given a content, directory, or commit node, return + *an* origin that has at least one snapshot that (recursively) contains it + + Endpoint:: + + /graph/walk/:NODE_ID/ori?direction=backward&edges=* + +- **complete origin provenance**: given a content, directory, or commit node, + return *all* origins that have at least one snapshot that (recursively) + contains it + + Endpoint:: + + /graph/leaves/:NODE_ID?direction=backward&edges=* + + +Provenance statistics +~~~~~~~~~~~~~~~~~~~~~ + +The following use cases require traversing the *backward (transposed) +graph*. + +- **content popularity across commits**: count the number of commits (or + *commit popularity*) that link to a directory that (recursively) includes a + given content. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev + +- **commit popularity across origins**: count the number of origins (or *origin + popularity*) that have a snapshot that (recursively) includes a given commit. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?direction=backward&edges=* + +The following use cases require traversing the *forward graph*. + +- **revision size** (as n. of contents) distribution: the number of contents + that are (recursively) reachable from a given revision. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?edges=* + +- **origin size** (as n. of revisions) distribution: count the number of + revisions that are (recursively) reachable from a given origin. + + Endpoint:: + + /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev diff --git a/docs/compression.rst b/docs/compression.rst --- a/docs/compression.rst +++ b/docs/compression.rst @@ -1,15 +1,135 @@ .. _graph-compression: +================= Graph compression ================= -The compression process is a pipeline implemented for the most part on top of -the `WebGraph framework `_ and ecosystem -libraries. The compression pipeline consists of the following steps: +The compression pipeline is implemented on top of the `WebGraph framework +`_. It takes an ORC Graph Dataset as an input, +such as the ones found in the :ref:`Graph Dataset List `, +and generates a compressed graph suitable for high intensity analyses on +large servers. + + +Running the compression pipeline +================================ + +Dependencies +------------ + +To compress a graph, you will need to install the ``swh.graph`` tool as well as +a recent JRE, as described in the :ref:`swh-graph-quickstart` page. + +You will also need the zstd_ compression tool:: + + $ sudo apt install zstd + +.. _zstd: https://facebook.github.io/zstd/ + + +Hardware Requirements +--------------------- + +The compression pipeline is even more demanding than the graph server in terms +of hardware requirements, especially RAM. Notably, the BFS compression step +loads a graph compressed in random order in memory, which is usually more than +a TiB for the full graph. While it is possible to do this step with a memory +mapping, our experiments show that this could take more than a year on hard +drives. + +The LLP compression step requires 13 bytes of RAM per node, which could amount +to storing hundreds of gigabytes in RAM in addition to loading the graph +itself. + +Some steps also involve sorting the entire set of edges and their labels, by +using large on-disk buffer files, sometimes reaching the size of the input +dataself itself. + +The machine we used to compress the entire graph (dataset version 2022-04-25) +has the following hardware specs: + +- 2 TiB of RAM (DDR4 ECC 2400Mhz) +- 64 vCPUs (Dual AMD EPYC 7302 16-Core) +- 24 TiB of SSD (NVMe) + +OVH rents high grade dedicated servers that do fit these hardware requirements +on a per-month basis for a relatively reasonable price and a 1 month +commitment. They are available here: +`HGR-HCI-4 `_ + + +Input dataset +------------- + +First, you need to retrieve a graph to compress, in ORC format. The :ref:`Graph +Dataset List ` has a list of datasets made available by the +Software Heritage archive, including "teaser" subdatasets which have a more +manageable size and are thus very useful for prototyping with less hardware +resources. + +The datasets can be retrieved from S3 or the annex, in a similar fashion to +what is described in :ref:`swh-graph-retrieving-compressed`, by simply +replacing "compressed" by "orc": + +.. code:: console + + (venv) $ mkdir -p 2021-03-23-popular-3k-python/orc + (venv) $ cd 2021-03-23-popular-3k-python/ + (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/ orc + +Alternatively, any custom ORC dataset can be used as long as it respects +:ref:`the schema ` of the Software Heritage Graph Dataset. + +**Note:** for testing purposes, a fake test dataset is available in the +``swh-graph`` repository, with just a few dozen nodes. The ORC tables are +available in ``swh-graph/swh/graph/tests/dataset/orc/``. + + +Compression +----------- + +You can compress your dataset by using the ``swh graph compress`` command. It +will run all the various steps of the pipeline in the right order. + +.. code:: console + + + (venv) $ swh graph compress --input-dataset orc/ --outdir compressed/ + [...] + (venv) $ ls compressed/ + graph.edges.count.txt + graph.edges.stats.txt + graph.graph + graph.indegree + graph-labelled.labeloffsets + graph-labelled.labels + [...] + graph-transposed.obl + graph-transposed.offsets + graph-transposed.properties + + +For sufficiently large graphs, this command can take entire weeks. It is highly +recommended to run it in a systemd service or in a tmux session. + +It is also possible to run single steps or step ranges from the CLI: + +.. code:: bash + + swh graph compress -i orc/ -o compressed/ --steps mph-bfs + +See ``swh graph compress --help`` for syntax and usage details. + + +Compression steps +================= + +The compression pipeline consists of the following steps: .. figure:: images/compression_steps.png :align: center :alt: Compression steps + :target: _images/compression_steps.png Compression steps @@ -30,84 +150,234 @@ `bibtex `_. -In order to practically perform graph compression, install the ``swh.graph`` -module and use the ``swh graph compress`` command line interface of the -compression driver, that will conduct the various steps in the right order. -See ``swh graph compress --help`` for usage details. +As well as chapters 9 and 10 of Antoine Pietri's PhD thesis: + +.. note:: -1. MPH + Antoine Pietri + `Organizing the graph of public software development for large-scale mining + `_. + Doctoral dissertation. Inria, 2021. + + +1. EXTRACT_NODES +---------------- + +This step reads a graph dataset and extract all the unique node SWHIDs it +contains, including the ones that are not stored as actual objects in the +graph, but only *referred to* by the edges. Additionally, it extracts the set +of all unique edge labels in the graph. + +**Rationale:** Because the graph can contain holes, loose objects and dangling +objects, some nodes that are referred to as destinations in the edge +relationships might not actually be stored in the graph itself. However, to +compress the graph using a graph compressio library, it is necessary to have a +list of *all* the nodes in the graph, including the ones that are simply +referred to by the edges but not actually stored as concrete objects. + +This step reads the entire graph dataset, and uses ``sort -u`` to extract the +set of all the unique nodes and unique labels that will be needed as an input +for the compression process. It also write object count statistics in various +files: + +- The set of nodes is written in ``graph.nodes.csv.zst``, as a zst-compressed + sorted list of SWHIDs, one per line. +- The set of edge labels is written in ``graph.labels.csv.zst``, as a + zst-compressed sorted list of labels encoded in base64, one per line. +- The number of unique nodes referred to in the graph is written in a text + file, ``graph.nodes.count.txt`` +- The number of unique edges referred to in the graph is written in a text + file, ``graph.edges.count.txt`` +- The number of unique edge labels is written in a text file, + ``graph.labels.count.txt`` +- Statistics on the number of nodes of each type are written in a text file, + ``graph.nodes.stats.txt`` +- Statistics on the number of edges of each type are written in a text file, + ``graph.edges.stats.txt`` + + +2. MPH ------ -A node in the Software Heritage :ref:`data model ` is identified -using its SWHID (see :ref:`persistent identifiers -`). However, WebGraph internally uses integers to refer -to node ids. +As discussed in :ref:`swh-graph-java-basics`, a node in the Software Heritage +:ref:`data model ` is identified by its SWHID (see :ref:`persistent +identifiers `), but WebGraph internally uses integers +to refer to node ids. -Mapping between the strings and longs ids is needed before compressing the -graph. From the `Sux4J `_ utility tool, we use the +To create a mapping between integer node IDs and SWHIDs, we use the `GOVMinimalPerfectHashFunction `_ -class, mapping with no collisions N keys to N consecutive integers. +class of the `Sux4J `_ library, which maps N keys to N +consecutive integers. -The step produces a ``.mph`` file (MPH stands for *Minimal Perfect -Hash-function*) storing the hash function taking as input a string and returning -a unique integer. +We run this function on the list of SWHIDs stored in the +``graph.nodes.csv.zst`` file generated in the previous step. +This allows us to generate a bijection from the set of all the *n* SWHIDs in the +graph to the set of integers :math:`[0, n - 1]`. +The step produces a ``graph.mph`` file (MPH stands for *Minimal Perfect +Hash-function*), containing a function which takes a SWHID (as a bytestring) +and returns its associated node ID. -2. BV compress + +3. BV compress -------------- -This is the first actual compression step, building a compressed version of the -input graph using WebGraph techniques presented in the framework paper. We use -the `ScatteredArcsASCIIGraph +This is the first actual compression step, where we build a compressed version +of the input graph dataset. + +We use a ScatteredArcsORCGraph to load the dataset +(implementation inspired of the `ScatteredArcsASCIIGraph `_ -class, from WebGraph. +class in WebGraph). +This class wraps the ORC Graph dataset and exposes a *virtual* ImmutableGraph, +whose nodes and edges can be iterated sequentially as if it was any other +standard graph. To do so, it puts all the edges in batches and sorts them in an +aggressively parallel fashion, then stores them as ``.bitstream`` files, and +returns a `BatchGraph +` +created from these batches. + +Finally, it uses the ``BVGraph.store()`` method, which compresses the input +graph as a `BVGraph +`_, +using the compression techniques described in the article *The WebGraph +Framework I: Compression Techniques* cited above. The resulting BV graph is stored as a set of files: -- ``.graph``: the compressed graph in the BV format -- ``.offsets``: offsets values to read the bit stream graph file -- ``.obl``: offsets cache to load the graph faster -- ``.properties``: entries used to correctly decode graph and offset files +- ``graph-base.graph``: the compressed graph in the BV format +- ``graph-base.offsets``: offsets values to read the bit stream graph file +- ``graph-base.properties``: entries used to correctly decode graph and offset + files -3. BFS -------- +4. BFS +------ + +In *Layered Label Propagation: A MultiResolution Coordinate-Free +Ordering for Compressing Social Networks* (Boldi, Rosa, Santini, Vigna 2010), +the paper authors empirically demonstrate that a high graph compression ratio +can be achieved for the graph of the Web by ordering nodes such that vertices +from the same host are close to each other. -In the LLP paper, authors propose an empirical analysis linking node ordering -and high compression ratio: it is important to use an ordering of nodes ids such -that vertices from the same host are close to one another. +In Software Heritage, there is no notion of "host" that can be used to generate +these compression-friendly orderings, because the identifiers are just +uniformly random cryptographic hashes. However, we can generate these orderings +by running algorithms to inform us on which nodes are close to each other. -Building on this insight, the previous compression results in the BV compress -step are improved by re-ordering nodes ids using a BFS traversal order. We use -the `BFS +In this step, we run a BFS traversal on the entire graph to get a more +compression-friendly ordering of nodes. We use the `BFS `_ class from the `LAW `_ library. -The resulting ordering is stored in the ``.order`` file, listing nodes ids in -order of traversal. +The resulting ordering is stored in a ``graph-bfs.order`` file, which contains +all the node IDs in the order of traversal. -4. Permute ----------- +5. PERMUTE_BFS +-------------- -Once the order is computed (BFS or another ordering technique), the final -compressed graph is created based on the initial BV compress result, and using -the new node order mapping. The permutation uses the `Transform +Once the BFS order is computed, we permute the initial "base" graph using the +this new ordering. The permutation uses the `Transform `_ class from WebGraph framework. -The final compressed graph is only stored in the resulting ``.graph``, -``.offsets``, ``.obl``, and ``.properties`` files. +The BFS-compressed graph is stored in the files +``graph-bfs.{graph,offsets,properties}``. + +6. TRANSPOSE_BFS +---------------- + +We transpose the BFS-compressed graph, using the `Transform +`_ +class from WebGraph. +This step is a prerequisite for LLP compression. + +7. SIMPLIFY +----------- + +This step creates a loopless and symmetric version of the BFS-compressed graph, +using the `Transform +`_ +class from WebGraph. +This step is a prerequisite for LLP compression. + +8. LLP +------ + +Better compression ratios can be achieved by the Layered Label Propagation +(LLP) algorithm to reorder nodes. This algorithm is described in *Layered Label +Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social +Networks* (Boldi, Rosa, Santini, Vigna 2010). +The LLP algorithm finds locality-preserving orders by clustering together nodes +in close proximity. Similar to the BFS, this algorithm is particularly +interesting for our use case as it is unsupervised, and does not rely on prior +information on the clusters present in the graph. The idea behind the +clustering algorithm is to randomly distribute communities to the nodes in the +graph, then iteratively assign to each node the community most represented in +its neighbors. + +LLP is more costly than simple BFS-based compression in both time and memory. +Even though the algorithm has a linear time complexity, it does multiple +iterations on the graph and is significantly slower than the BFS which is just +one single traversal. Moreover, keeping track of the communities requires a +total of 13 bytes per node, which increases the RAM requirements. +Because of these constraints, it is unrealistic to run the LLP algorithm on the +uncompressed version of the graph; this is why we do an intermediate +compression with the BFS ordering first, then compress the entire graph *again* +with an even better ordering. + +The LLP algorithm takes a simplified (loopless, symmetric) graph as an input, +which we already computed in the previous steps. + +The algorithm is also parameterized by a list of γ values, a "resolution" parameter +which defines the shapes of the clustering it produces: either small, but +denser pieces, or larger, but unavoidably sparser pieces. The algorithm then +combines the different clusterings together to generate the output reordering. +γ values are given to the algorithm in the form :math:`\frac{j}{2^k}`; by +default, 12 different values of γ are used. However, the combination procedure +is very slow, and using that many γ values could take several months in our +case. +We thus narrowed down a smaller set of γ values that empirically give good +compression results, which are used by default in the pipeline. In general, +smaller values of γ seem to generate better compression ratios. The effect of a +given γ is that the density of the sparsest cluster is at least γ γ+1, so large +γ values imply small, more dense clusters. It is reasonable to assume that +since the graph is very sparse to start with, such clusters are not that +useful. + +The resulting ordering is stored in a ``graph-llp.order`` file. + +9. PERMUTE_LLP +-------------- +Once the LLP order is computed, we permute the BFS-compressed graph using the +this new ordering. The LLP-compressed graph, which is our final compressed +graph, is stored in the files ``graph.{graph,offsets,properties}``. -5. Stats --------- +10. COMPOSE_ORDERS +------------------ -Compute various statistics on the final compressed graph: +To be able to translate the initial MPH inputs to their resulting rank in the +LLP-compressed graph, we need to use the two order permutations successively: +the base → BFS permutation, then the BFS → LLP permutation. -- ``.stats``: entries such as number of nodes, edges, avg/min/max degree, +To make this less wasteful, we *compose* the two permutations into a single +one. We use the `composePermutationsInPlace +`_ +function of the dsiutils library. The resulting permutation is stored as a +``graph.order`` file. Hashing a SWHID with the ``graph.mph`` function, then +permuting the result using the ``graph.order`` permutation yields the integer +node ID matching the input SWHID in the graph. + +11. STATS +--------- + +This step computes various statistics on the compressed graph: + +- ``.stats``: statistics such as number of nodes, edges, avg/min/max degree, average locality, etc. - ``.indegree``: graph indegree distribution - ``.outdegree``: graph outdegree distribution @@ -117,9 +387,209 @@ class from WebGraph. -6. Transpose ------------- +12. TRANSPOSE +------------- -Create a transposed graph to allow backward traversal, using the `Transform +Transpose the graph to allow backward traversal, using the `Transform `_ -class from WebGraph. +class from WebGraph. The resulting transposed graph is stored as the +``graph-transposed.{graph,offsets,properties}`` files. + + +13. OBL +------- + +Cache the BVGraph offsets of the forward graph to make loading faster. The +resulting offset big list is stored in the ``graph.obl`` file. + + +14. TRANSPOSE_OBL +----------------- + +Same as OBL, but for the transposed graph. The resulting offset big list is +stored in the ``graph-transposed.obl`` file. + + +15. NODE_MAPS +------------- + +This steps generates the *node mappings* described in +:ref:`swh-graph-java-node-mappings`. In particular, it generates: + +- ``graph.node2swhid.bin``: a compact binary representation of all the + SWHIDs in the graph, ordered by their rank in the graph file. +- ``graph.node2type.bin``: a `LongBigArrayBitVector + `_ + which stores the type of each node. + +It does so by reading all the SWHIDs in the ``graph.nodes.csv.zst`` file generated in the +EXTRACT_NODES step, then getting their corresponding node IDs (using the +``.mph`` and ``.order`` files), then sorting all the SWHIDs according to +their node ID. It then writes these SWHIDs in order, in a compact but seekable +binary format, which can be used to return the SWHID corresponding to any given +node in O(1). + + +16. EXTRACT_PERSONS +------------------- + +This step reads the ORC graph dataset and extracts all the unique persons it +contains. Here, "persons" are defined as the set of unique pairs of name + +email, potentially pseudonymized, found either as revision authors, revision +committers or release authors. + +The ExtractPersons class reads all the persons from revision and release +tables, then uses ``sort -u`` to get a sorted list without any duplicates. The +resulting sorted list of authors is stored in the ``graph.persons.csv.zst`` +file. + + +17. MPH_PERSONS +--------------- + +This step computes a Minimal Perfect Hash function on the set of all the unique +persons extracted in the EXTRACT_PERSONS step. Each individual person is mapped +to a unique integer in :math:`[0, n-1]` where *n* is the total number of +persons. The resulting function is serialized and stored in the +``graph.persons.mph`` file. + + +18. NODE_PROPERTIES +------------------- + +This step generates the *node property files*, as described in +:ref:`swh-graph-java-node-properties`. +The nodes in the Software Heritage Graph each have associated *properties* +(e.g., commit timestamps, authors, messages, ...). The values of these +properties for each node in the graph are compressed and stored in files +alongside the compressed graph. + +The WriteNodeProperties class reads all the properties from the ORC Graph +Dataset, then serializes each of them in a representation suitable for +efficient random access (e.g., large binary arrays) and stores them on disk. + +For persons (authors, committers etc), the MPH computed in the MPH_PERSONS step +is used to store them as a single pseudonymized integer ID, which uniquely +represents a full name + email. + +The results are stored in the following list of files: + +- ``graph.property.author_id.bin`` +- ``graph.property.author_timestamp.bin`` +- ``graph.property.author_timestamp_offset.bin`` +- ``graph.property.committer_id.bin`` +- ``graph.property.committer_timestamp.bin`` +- ``graph.property.committer_timestamp_offset.bin`` +- ``graph.property.content.is_skipped.bin`` +- ``graph.property.content.length.bin`` +- ``graph.property.message.bin`` +- ``graph.property.message.offset.bin`` +- ``graph.property.tag_name.bin`` +- ``graph.property.tag_name.offset.bin`` + + +19. MPH_LABELS +-------------- + +This step computes a **monotone** Minimal Perfect Hash function on the set of +all the unique *arc label names* extracted in the EXTRACT_NODES step. Each +individual arc label name (i.e., directory entry names and snapshot branch +names) is monotonely mapped to a unique integer in :math:`[0, n-1]`, where *n* +is the total number of unique arc label names, which corresponds to their +**lexical rank** in the set of all arc labels. + +In other words, this MPH being monotone means that the hash of the *k*-th item +in the sorted input list of arc labels will always be *k*. +We use the `LcpMonotoneMinimalPerfectHashFunction +`_ +of Sux4J to generate this function. + +The rationale for using a monotone function here is that it will allow us to +quickly get back the arc label from its hash without having to store an +additional permutation. +The resulting MPH function is serialized and stored in the ``graph.labels.mph`` +file. + + +20. FCL_LABELS +-------------- + +This step computes a *reverse-mapping* for arc labels, i.e., a way to +efficiently get the arc label name from its hash computed with the monotone MPH +of the MPH_LABELS step. + +Thanks to the MPH being monotone, this boils down to storing all the labels in +lexicographic order in a string list format that allows O(1) access to its +elements. For this purpose, we use the `MappedFrontCodedStringBigList +`_ +class from the dsiutils library, using the ``graph.labels.csv.zst`` file as its +input. It stores the label names in a compact way by using front-coding +compression, which is particularly efficient here because the strings are +already in lexicographic order. The resulting FCL files are stored as +``graph.labels.fcl.*``, and they can be loaded using memory mapping. + + +21. EDGE_LABELS +--------------- + + +This step generates the *edge property files*, as described in +:ref:`swh-graph-java-edge-properties`. These files allow us to get the *edge +labels* as iterate on the edges of the graph. The files essentially contain +compressed sorted triplets of the form (source, destination, label), with +additional offsets to allow random access. + +To generate these files, the LabelMapBuilder class starts by reading in +parallel the labelled edges in the ORC dataset, which can be thought of as +quadruplets containing the source SWHID, the destination SWHID, the label name +and the entry permission if applicable: + +.. code-block:: text + + swh:1:snp:4548a5… swh:1:rev:0d6834… cmVmcy9oZWFkcy9tYXN0ZXI= + swh:1:dir:05faa1… swh:1:cnt:a35136… dGVzdC5j 33188 + swh:1:dir:05faa1… swh:1:dir:d0ff82… dGVzdA== 16384 + ... + +Using the ``graph.mph`` and the ``graph.order`` files, we hash and permute the +source and destination nodes. We also monotonically hash the labels using the +``graph.labels.mph`` function to obtain the arc label identifiers. The +permissions are normalized as one of the 6 possible values in the +``DirEntry.Permission.Type`` enum, and are then stored in the 3 lowest bits of +the label field. + +.. code-block:: text + + 4421 14773 154 + 1877 21441 1134 + 1877 14143 1141 + ... + +These hashed edges and their compact-form labels are then put in large batches +sorted in an aggressively parallel fashion, which are then stored as +``.bitstream`` files. These batch files are put in a heap structure to perform +a merge sort on the fly on all the batches. + +Then, the LabelMapBuilder loads the graph and starts iterating on its edges. It +synchronizes the stream of edges read from the graph with the stream of sorted +edges and labels read from the bitstreams in the heap. At this point, it writes +the labels to the following output files: + +- ``graph-labelled.properties``: a property file describing the graph, notably + containing the basename of the wrapped graph. +- ``graph-labelled.labels``: the compressed labels +- ``graph-labelled.labeloffsets``: the offsets used to access the labels in + random order. + +It then does the same with backward edge batches to get the transposed +equivalent of these files: +``graph-transposed-labelled.{properties,labels,labeloffsets}``. + + + +22. CLEAN_TMP +------------- + +This step reclaims space by deleting the temporary directory, as well as all +the intermediate outputs that are no longer necessary now that the final graph +has been compressed (shown in gray in the step diagram). diff --git a/docs/images/Makefile b/docs/images/Makefile --- a/docs/images/Makefile +++ b/docs/images/Makefile @@ -1,7 +1,7 @@ all: compression_steps.png compression_steps.svg %.png: %.dot - dot -Gdpi=300 -Tpng $< -o $@ + dot -Gdpi=150 -Tpng $< -o $@ %.svg: %.dot dot -Tsvg $< -o $@ diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot --- a/docs/images/compression_steps.dot +++ b/docs/images/compression_steps.dot @@ -1,3 +1,108 @@ +digraph "Compression steps" { + // Horizontal graph + // rankdir=LR; + node [shape = none]; + + + orc_dataset [label="ORC Graph\nDataset"]; + nodes_csv [label="graph.nodes.csv.zst"]; + labels_csv [label="graph.labels.csv.zst"]; + graph_mph [label="graph.mph"]; + + subgraph { + node [fontcolor=darkgray]; + graph_base [label="graph-base.graph"] + graph_bfs_order [label="graph-bfs.order"] + graph_bfs [label="graph-bfs.graph"] + graph_bfs_transposed [label="graph-bfs-transposed.graph"] + graph_bfs_simplified [label="graph-bfs-simplified.graph"] + graph_llp_order [label="graph-llp.order"] + } + + graph_llp [label="graph.graph"] + graph_llp_transposed [label="graph-transposed.graph"] + graph_order [label="graph.order"] + stats [label="graph.stats"] + swhidmap [label="graph.node2swhid.bin"] + typemap [label="graph.node2type.bin"] + persons_csv [label="graph.persons.csv.zst"]; + persons_mph [label="graph.persons.mph"]; + node_properties [label="graph.property.*"]; + labels_mph [label="graph.labels.mph"]; + labels_fcl [label="graph.labels.fcl"]; + graph_labelled [label="graph-labelled.*"]; + graph_transposed_labelled [label="graph-transposed-labelled.*"]; + + subgraph { + node [shape=box]; + EXTRACT_NODES; + MPH; + BV; + BFS; + PERMUTE_BFS; + TRANSPOSE_BFS; + SIMPLIFY; + LLP; + PERMUTE_LLP; + COMPOSE_ORDERS; + TRANSPOSE; + STATS; + NODE_MAP; + EXTRACT_PERSONS; + MPH_PERSONS; + NODE_PROPERTIES; + MPH_LABELS; + FCL_LABELS; + EDGE_PROPERTIES; + } + + + orc_dataset -> EXTRACT_NODES; + EXTRACT_NODES -> nodes_csv; + EXTRACT_NODES -> labels_csv; + nodes_csv -> MPH -> graph_mph; + graph_mph -> BV; + orc_dataset -> BV -> graph_base; + graph_base -> BFS -> graph_bfs_order; + graph_bfs_order -> PERMUTE_BFS; + graph_base -> PERMUTE_BFS -> graph_bfs; + graph_bfs -> TRANSPOSE_BFS -> graph_bfs_transposed; + graph_bfs_transposed -> SIMPLIFY; + graph_bfs -> SIMPLIFY -> graph_bfs_simplified; + graph_bfs_simplified -> LLP -> graph_llp_order; + graph_llp_order -> PERMUTE_LLP; + graph_bfs -> PERMUTE_LLP -> graph_llp; + graph_bfs_order -> COMPOSE_ORDERS; + graph_llp_order -> COMPOSE_ORDERS -> graph_order; + graph_llp -> TRANSPOSE -> graph_llp_transposed; + graph_llp -> STATS -> stats; + graph_llp -> NODE_MAP; + nodes_csv -> NODE_MAP; + graph_mph -> NODE_MAP; + graph_order -> NODE_MAP; + NODE_MAP -> swhidmap; + NODE_MAP -> typemap; + orc_dataset -> EXTRACT_PERSONS -> persons_csv; + persons_csv -> MPH_PERSONS -> persons_mph; + orc_dataset -> NODE_PROPERTIES; + persons_mph -> NODE_PROPERTIES; + graph_mph -> NODE_PROPERTIES; + graph_order -> NODE_PROPERTIES; + NODE_PROPERTIES -> node_properties; + labels_csv -> MPH_LABELS -> labels_mph; + labels_mph -> FCL_LABELS; + labels_csv -> FCL_LABELS -> labels_fcl; + orc_dataset -> EDGE_PROPERTIES; + labels_mph -> EDGE_PROPERTIES; + graph_llp -> EDGE_PROPERTIES; + graph_mph -> EDGE_PROPERTIES; + graph_order -> EDGE_PROPERTIES; + EDGE_PROPERTIES -> graph_labelled; + EDGE_PROPERTIES -> graph_transposed_labelled; +} + + +/* digraph "Compression steps" { // Horizontal graph rankdir=LR; @@ -49,3 +154,4 @@ stats -> stats_out; transpose -> transpose_out; } +*/ diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -8,10 +8,11 @@ :caption: Overview quickstart + api + java-api + memory compression cli - api - use-cases docker git2graph /apidoc/swh.graph diff --git a/docs/java-api.rst b/docs/java-api.rst new file mode 100644 --- /dev/null +++ b/docs/java-api.rst @@ -0,0 +1,674 @@ +.. _swh-graph-java-api: + +Using the Java API +================== + +.. highlight:: java + +While the :ref:`HTTP API ` is useful for many common use-cases, +it is often not sufficient to implement more complex algorithms. This section +describes the low-level Java API that ``swh-graph`` provides on top of the +WebGraph framework to manipulate the compressed graph of Software Heritage. + +A cursory understanding of the `WebGraph framework +`_ and its API is helpful to understand the +notions detailed here. + +.. _swh-graph-java-basics: + +Basics +------ + +In the WebGraph framework, graphs are generally subclasses of +`ImmutableGraph +`_, +the abstract class providing the core API to manipulate and iterate on graphs. +Under the hood, compressed graphs are stored as the `BVGraph +`_ +class, which contains the actual codec used to compress and decompress +adjacency lists. + +Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]` +where *n* is the total number of nodes in the graph). +Each node has an associated *adjacency list*, i.e., a list of nodes going from +that source node to a destination node. This list represents the **edges** (or +**arcs**) of the graph. + +**Note**: edges are always directed. Undirected graphs are internally stored as +a pair of directed edges (src → dst, dst → src). + +On disk, a simple BVGraph with the basename ``graph`` would be represented as +the following set of files: + +- ``graph.graph``: contains the compressed adjacency lists of each node, which + can be decompressed by the BVGraph codec. +- ``graph.properties``: contains metadata on the graph, such as the number of + nodes and arcs, as well as additional loading information needed by the + BVGraph codec. +- ``graph.offsets``: a list of offsets of where the adjacency list of each node + is stored in the main graph file. +- ``graph.obl``: optionally, an "offset big-list file" which can be used to + load graphs faster. + +An ImmutableGraph can be loaded using different *load methods*, which have each +different performance implications: + +- ``load()``: the entire graph is loaded in RAM and supports random access. +- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see + ``mmap(1)``), at the cost of being potentially slower, especially when doing + random access on slow storage. +- ``loadOffline()``: no data is actually loaded is memory, only sequential + iteration is possible. + +The following code loads a graph stored on disk under the ``compressed/graph`` +basename, using the memory-mapped loading mode, and stores it as a generic +ImmutableGraph: + +.. code-block:: java + + ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph"); + + +Running the code +---------------- + +To run a piece of Java code written using the Java API, you need to run it with +all the dependencies in your classpath (the WebGraph libraries and the +swh-graph library). The easiest way to do it is to add the *fat jar* +shipped in the swh-graph library on PyPI, which contains all the required +dependencies. + +.. code-block:: console + + $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java + + +Note that to load bigger graphs, the default heap size of the JVM is likely to +be insufficient to load entire graphs in memory. It is advised to increase this +heap size with the ``-Xmx`` flag: + +.. code-block:: console + + $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java + +For more information on performance tuning and memory consideration, you should +also read the :ref:`swh-graph-memory` page. + + +Simple traversal +---------------- + +The ImmutableGraph class provides primitives to iterate and traverse graphs. It +contains the following methods: + +- ``graph.numNodes()`` returns the number of nodes in the graph (*n*). +- ``graph.numArcs()`` returns the number of arcs in the graph. + +And, given a node ID :math:`k \in [0, n - 1]`: + +- ``graph.successors(k)`` returns an LazyLong on the nodes that are *adjacent* + to *k* (i.e., its *neighbors*). +- ``graph.outdegree(k)`` returns the number of neighbors of *k*. + + +Example: Average outdegree +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code can be used to compute the average +outdegree of a graph, which is a useful measure of its density: + +.. code-block:: java + + public static long averageOutdegree(ImmutableGraph graph) { + return ((long) graph.numArcs()) / graph.numNodes(); + } + + +Example: Degree distributions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the ``outdegree()`` primitive, we can compute the outdegree distribution +of the graph by iterating on all its nodes. The distribution will be returned +as a map that associates to each degree *d* the number of nodes with outdegree +*d*. + +.. code-block:: java + + public static Map outdegreeDistribution(ImmutableGraph graph) { + HashMap distribution = new HashMap(); + for (long k = 0; k < graph.numNodes(); ++k) { + distribution.merge(graph.outdegree(k), 1L, Long::sum); + } + return distribution; + } + + +Example: Depth-First Traversal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``successors`` primitive can be used to write a simple stack-based DFS +traversal on the graph which starts from a given node and prints all the +descendant nodes in its transitive closure: + +.. code-block:: java + :emphasize-lines: 10 + + public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) { + Stack stack = new Stack<>(); + HashSet visited = new HashSet(); + stack.push(srcNodeId); + visited.add(srcNodeId); + + while (!stack.isEmpty()) { + long currentNodeId = stack.pop(); + System.out.println(currentNodeId); + + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + stack.push(neighborNodeId); + visited.add(neighborNodeId); + } + } + } + } + +Example: Breadth-First Traversal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Swapping the stack for a queue changes the traversal order from depth-first +to breadth-first: + +.. code-block:: java + :emphasize-lines: 2 + + public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) { + Queue queue = new ArrayDeque<>(); + HashSet visited = new HashSet(); + queue.add(srcNodeId); + visited.add(srcNodeId); + + while (!queue.isEmpty()) { + long currentNodeId = queue.poll(); + System.out.println(currentNodeId); + + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + queue.add(neighborNodeId); + visited.add(neighborNodeId); + } + } + } + } + + +.. _swh-graph-java-node-mappings: + +Node types and SWHIDs +--------------------- + +In the Software Heritage archive, nodes are not represented by a simple +integer, but by a :ref:`SWHID `, which contain both the +*type* of the node (revision, directory, blob...) and its unique identifier. We +use **node mappings** which allow us to translate between SWHIDs and the +compact node IDs used in the compressed graph. + +Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the +`GOVMinimalPerfectHashFunction +`_ +class of the Sux4J library, which maps N keys to N consecutive integers with no +collisions. + +The following files are used to store the mappings between the nodes and their +types: + +- ``graph.mph``: contains a serialized minimal perfect hash function computed + on the list of all the SWHIDs in the graph. +- ``graph.order``: contains the permutation that associates with each output of + the MPH the node ID to which it corresponds +- ``graph.node2swhid.bin``: contains a compact binary representation of all the + SWHIDs in the graph, ordered by their rank in the graph file. +- ``graph.node2type.bin``: contains a `LongBigArrayBitVector + `_ + which stores the type of each node. + +To use these mappings easily, we provide the class **SwhUnidirectionalGraph**, +an ImmutableGraph which wraps the underlying graph and adds a few +utility methods to obtain SWH-specific information on the graph. + +A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph, +as long as the mapping files listed above are present:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); + +This class exposes the same graph primitives as an ImmutableGraph, but it +additionally contains the following methods: + +- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given + node ID. This function does a lookup of the SWHID at offset *i* in the file + ``graph.node2swhid.bin``. + +- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given + SWHID. It works by hashing the SWHID with the function stored in + ``graph.mph``, then permuting it using the permutation stored in + ``graph.order``. It does additional domain-checking by calling ``getSWHID()`` + on its own result to check that the input SWHID was valid. + +- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as + an enum of all the different object types in the Software Heritage data + model. It does so by looking up the value at offset *i* in the bit vector + stored in ``graph.node2type.bin``. + + +Example: Find the target directory of a revision +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As an example, we use the methods mentioned above to perform the +following task: "given a revision, return its target directory". To do so, we +first look up the node ID of the given revision in the compressed graph. We +iterate on the successors of that node, and return the SWHID of the first +destination node that has the "directory" type. + + +.. code-block:: java + :emphasize-lines: 2 + + public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) { + long src = graph.getNodeId(revSwhid); + assert graph.getNodeType(src) == Node.Type.REV; + LazyLongIterator it = graph.successors(currentNodeId); + for (long dst; (dst = it.nextLong()) != -1;) { + if (graph.getNodeType(dst) == Node.Type.DIR) { + return graph.getSWHID(dst); + } + } + throw new RuntimeError("Revision has no target directory"); + } + +.. _swh-graph-java-node-properties: + +Node properties +--------------- + +The Software Heritage Graph is a *property graph*, which means it has various +properties associated with its nodes and edges (e.g., commit timestamps, +authors, messages, ...). We compress these properties and store them in files +alongside the compressed graph. This allows you to write traversal algorithms +that depend on these properties. + +By default, properties are not assumed to be present are are not loaded when +the graph itself is loaded. If you want to use a property, you need to +explicitly load it first. As an example, this is how you load the "content +length" property to get the length of a given blob:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); + graph.loadContentLength(); + long blobSize = graph.getContentLength(graph.getNodeID(swhid)); + +The documentation of the SwhGraphProperties class (**TODO: link**) lists all +the different properties, their types, and the methods used to load them and to get +their value for a specific node. + +A few things of note: + +- A single loading call can load multiple properties at once; this is because + they are stored in the same file to be more space efficient. + +- Persons (authors, committers etc) are exported as a single pseudonymized + integer ID, that uniquely represents a full name + email. + +- Timestamps are stored as a long integer (for the timestamp itself) and a + short integer (for the UTC offset). + + +.. _swh-graph-java-edge-properties: + +Edge properties +--------------- + +While looking up graph properties on the *nodes* of the graph is relatively +straightforward, doing so for properties on the *arcs* is comparatively more +difficult. These include the names and permissions of directory entries, as +well as the branch names of snapshots. + +The `ArcLabelledImmutableGraph +`_ +class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them +*labelled iterators*, which essentially allow us to look up the label of the +arcs while iterating on them. + +This labelled graph is stored in the following files: + +- ``graph-labelled.properties``: a property file describing the graph, notably + containing the basename of the wrapped graph. +- ``graph-labelled.labels``: the compressed labels +- ``graph-labelled.labeloffsets``: the offsets used to access the labels in + random order. + +The SwhUnidirectionalGraph class contains *labelled* loading methods +(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods +are used instead of the standard non-labelled ones, the graph is loaded as an +ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods +can then be used: + +- ``labelledSuccessors(k)`` returns a `LabelledArcIterator + `_ + which is used in the same way as a LazyLongIterator except it also contains a + ``label()`` method to get the label of the currently traversed arc. +- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator + `_ + of all the nodes in the graph, which replaces the LazyLongIterator of the + ``successor()`` function by a LabelledArcIterator similar to above. + + +Label format +~~~~~~~~~~~~ + +The labels of each arc are returned as a ``DirEntry[]`` array. They encode +both the name of a directory entry and its permissions. For snapshot branches, +only the "name" field is useful. + +Arc label names are encoded as an integer ID representing each unique +entry/branch name present in the graph. To retrieve the actual name associated +with a given label ID, one needs to load the reverse mapping similar to how you +would do for a normal property:: + + SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename); + graph.loadLabelNames(); + +The byte array representing the actual label name can then be loaded with:: + + byte[] name = graph.getLabelName(label.filenameId); + + +Multiedges +~~~~~~~~~~ + +The Software Heritage is not a *simple graph*, where at most one edge can exist +between two vertices, but a *multigraph*, where multiple edges can be incident +to the same two vertices. Consider for instance the case of a single directory +``test/`` containing twice the same file blob (e.g., the empty file), under two +different names (e.g., ``a`` and ``b``). The simple graph view of this +directory will represent it as a single edge ``test`` → *empty file*, while the +multigraph view will represent it as *two* edges between the same nodes. + +Due to the copy-list model of compression, WebGraph only stores simple graphs, +and thus stores multiedges as single edges, to which we cannot associate a +single file name. +To represent this possibility of having multiple file names for a single arc, +in the case of multiple relationships between two identical nodes, each arc label is +stored as an *array* of DirEntry, each record representing one relationship +between two nodes. + + +Example: Printing all the entries of a directory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code showcases how one can print all the entries (name, +permission and target SWHID) of a given directory, using the labelled methods +seen above. + +.. code-block:: java + + public static void printEntries(ImmutableGraph g, long dirNode) { + LabelledArcIterator s = g.labelledSuccessors(dirNode); + for (long dst; (dst = it.nextLong()) >= 0;) { + DirEntry[] labels = (DirEntry[]) s.label().get(); + for (DirEntry label : labels) { + System.out.format( + "%s %s %d\n", + graph.getSWHID(dst); + new String(graph.getLabelName(label.filenameId)), + label.permission + ); + } + } + } + + // Usage: $PROGRAM + public static void main(String[] args) { + SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]); + g.loadLabelNames(); + long dirNode = g.getNodeID(new SWHID(args[1])); + printEntries(g, dirNode); + } + + +Transposed graph +---------------- + +Up until now, we have only looked at how to traverse the *forward* graph, i.e., +the directed graph whose edges are in the same direction as the Merkle DAG of +the Software Heritage archive. +For many purposes, especially that of finding the *provenance* of software +artifacts, it is useful to query the *backward* (or *transposed*) graph +instead, which is the same as the forward graph except all the edges are +reversed. + +The transposed graph has its own set of files, counterparts to the files needed +for the forward graph: + +- ``graph-transposed.graph`` +- ``graph-transposed.properties`` +- ``graph-transposed.offsets`` +- ``graph-transposed.obl`` +- ``graph-transposed-labelled.labels`` +- ``graph-transposed-labelled.labeloffsets`` +- ``graph-transposed-labelled.properties`` + +However, because node IDs are the same in the forward and the backward graph, +all the files that pertain to mappings between the node IDs and various +properties (SWHIDs, property data, node permutations etc) remain the same. + + +Example: Earliest revision containing a given blob +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code loads all the committer timestamps of the revisions in the +graph, then walks the *transposed* graph to return the earliest revision +containing a given object. + +.. code-block:: java + + public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) { + long oldestRev = -1; + long oldestRevTs = Long.MAX_VALUE; + + Stack stack = new Stack<>(); + HashSet visited = new HashSet(); + stack.push(src); + visited.add(src); + while (!stack.isEmpty()) { + long currentNodeId = stack.pop(); + LazyLongIterator it = graph.successors(currentNodeId); + for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { + if (!visited.contains(neighborNodeId)) { + stack.push(neighborNodeId); + visited.add(neighborNodeId); + if (g.getNodeType(neighborNodeId) == Node.Type.REV) { + Long ts = g.getCommitterTimestamp(neighborNodeId); + if (ts != null && ts < oldestRevTs) { + oldestRev = neighborNodeId; + oldestRevTs = ts; + } + } + } + } + } + return oldestRev; + } + + // Usage: $PROGRAM + public static void main(String[] args) { + // Load the backward (= transposed) graph as a SwhUnidirectionalGraph + SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed"); + g.loadCommitterTimestamps(); + long node = g.getNodeID(new SWHID(args[1])); + long oldestRev = findEarliestRevisionContaining(g, node); + System.out.println(g.getSWHID(oldestRev)); + } + + + + +Bidirectional Graph +------------------- + + +BidirectionalImmutableGraph +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and +then manipulated just like the forward graph, it is often convenient to have +*both* the forward and the backward graph in memory. Some traversal algorithms +require first going down in the forward graph to select some nodes, then going +up to find their provenance. + +To achieve that, we use the `BidirectionalImmutableGraph +`_ +class from WebGraph, which stores both a graph and its transpose. +This class provides the following methods to iterate on the **backward** graph, +shown here with their counterparts for the forward graph: + +.. list-table:: + :header-rows: 1 + + * - Forward graph operation + - Backward graph operation + + * - ``outdegree(k)`` + - ``indegree(k)`` + + * - ``successors(k)`` + - ``predecessors(k)`` + +In addition, the class offers a few convenience methods which are generally +useful when you have both a graph and its transpose: + +- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by + inverting the references to the forward and the backward graphs. Successors + become predecessors, and vice-versa. +- ``symmetrize()`` returns the symmetric (= undirected) version of the + bidirectional graph. It is implemented by a union between the forward and the + backward graph, which basically boils down to removing the directionality of + the edges (the successors of a node are also its predecessors). + + +SwhBidirectionalGraph +~~~~~~~~~~~~~~~~~~~~~ + +Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with +SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it +contains the method ``labelledPredecessors()``, the equivalent of +``labelledSuccessors()`` but on the backward graph. + +Because SwhUnidirectionalGraph inherits from ImmutableGraph, and +SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the +common behavior between the two classes in a SwhGraph interface, which can +represent either an unidirectional or a bidirectional graph. + +To avoid loading the node properties two times (once for each direction), they +are stored in a separate class called SwhGraphProperties. In a +SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node +properties in memory by storing references to the same SwhGraphProperty +object. + +.. code-block:: text + + + ┌──────────────┐ + │ImmutableGraph◄────────┐ + └────▲─────────┘ │extends + │ │ + │ ┌──────────┴────────────────┐ + extends│ │BidirectionalImmutableGraph│ + │ └────────────▲──────────────┘ + │ │extends + ┌──────────────┴───────┐ ┌──────┴──────────────┐ + │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│ + └──┬──────────────┬────┘ └────────┬───────────┬┘ + │ │ contains x2 │ │ + │ │ │ │ + │ implements│ │implements │ + │ ┌▼──────────┐ │ │ + │ │SwhGraph(I)◄────────┘ │ + contains│ └───────────┘ │contains + │ │ + │ ┌──────────────────┐ │ + └────────────►SwhGraphProperties◄──────────────┘ + └──────────────────┘ + + +Example: Find all the shared-commit forks of a given origin +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to define the *forks* of an origin as being the set of origins +which share at least one revision with that origin. + +The following code loads the graph in both directions using a +SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward* +graph to find all its root revisions. It then walks the *backward* graph to +find all the origins containing these root revisions, i.e., its *forks*. + +.. code-block:: java + + public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) { + Stack forwardStack = new Stack<>(); + HashSet forwardVisited = new HashSet(); + Stack backwardStack = new Stack<>(); + HashSet backwardVisited = new HashSet(); + + // First traversal (forward graph): find all the root revisions of the + // origin + forwardStack.push(srcOrigin); + forwardVisited.add(srcOrigin); + while (!forwardStack.isEmpty()) { + long curr = forwardStack.pop(); + LazyLongIterator it = graph.successors(curr); + boolean isRootRevision = true; + for (long succ; (succ = it.nextLong()) != -1;) { + Node.Type nt = g.getNodeType(succ); + if (!forwardVisited.contains(succ) + && nt != Node.Type.DIR && nt != Node.Type.CNT) { + forwardStack.push(succ); + forwardVisited.add(succ); + isRootRevision = false; + } + } + if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) { + // Found a root revision, add it to the second stack + backwardStack.push(curr); + backwardVisited.add(curr); + } + } + + // Second traversal (backward graph): find all the origins containing + // any of these root revisions and print them + while (!backwardStack.isEmpty()) { + long curr = backwardStack.pop(); + LazyLongIterator it = graph.predecessors(curr); + boolean isRootRevision = true; + for (long succ; (succ = it.nextLong()) != -1;) { + Node.Type nt = g.getNodeType(succ); + if (!backwardVisited.contains(succ)) { + backwardStack.push(succ); + backwardVisited.add(succ); + if (nt == Node.Type.ORI) { + // Found an origin, print it. + System.out.println(g.getSWHID(succ)); + } + } + } + } + } + + // Usage: $PROGRAM + public static void main(String[] args) { + // Load both forward and backward graphs as a SwhBidirectionalGraph + SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]); + long node = g.getNodeID(new SWHID(args[1])); + findSharedCommitForks(g, node); + } + + diff --git a/docs/memory.rst b/docs/memory.rst new file mode 100644 --- /dev/null +++ b/docs/memory.rst @@ -0,0 +1,129 @@ +.. _swh-graph-memory: + +Memory & Performance tuning +=========================== + +This page discusses various considerations related to memory usage and +performance tuning when using the ``swh-graph`` library to load large +compressed graphs. + +JVM options +----------- + +In production, we tend to use very large servers which have enough RAM to load +the entire graph in RAM. In these setups, the default JVM options are often +suboptimal. We recommend to start the JVM with the following options, which +tend to significantly improve performance:: + + java \ + -server \ + -XX:PretenureSizeThreshold=512M \ + -XX:MaxNewSize=4G \ + -XX:+UseLargePages \ + -XX:+UseTransparentHugePages \ + -XX:+UseNUMA \ + -XX:+UseTLAB \ + -XX:+ResizeTLAB \ + +These options are documented in the manual of ``java(1)`` the Oracle +documentation. + + +Temporary directory +------------------- + +Many of the graph algorithms (either for compression or traversal) tend to +offload some of their run-time memory to disk. For instance, the `BFS +`_ +algorithm in the LAW library uses a temporary directory to write its queue of +nodes to visit. + +Because these can be quite large and sometimes overflow the default ``/tmp`` +partition, it is advised to systematically specify a path to a local temporary +directory with enough space to accommodate the needs of the Java programs. This +can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI:: + + java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp + + +Memory mapping vs Direct loading +-------------------------------- + +The main dial you can use to manage your memory usage is to chose between +memory-mapping and direct-loading the graph data. The different loading modes +available when loading the graph are documented in :ref:`swh-graph-java-api`. + +Loading in mapped mode will not load any extra data in RAM, but will instead +use the ``mmap(1)`` syscall to put the graph file located on disk in the +virtual address space. The Linux kernel will then be free to arbitrarily cache +the file, either partially or in its entirety, depending on the available +memory space. + +In our experiments, memory-mapping a small graph from a SSD only incurs a +relatively small slowdown (about 15-20%). However, when the graph is too big to +fit in RAM, the kernel has to constantly invalidate pages to cache newly +accessed sections, which incurs a very large performance penalty. A full +traversal of a large graph that usually takes about 20 hours when loaded in +main memory could take more than a year when mapped from a hard drive! + +When deciding what to direct-load and what to memory-map, here are a few rules +of thumb: + +- If you don't need random access to the graph edges, you can consider using + the "offline" loading mode. The offsets won't be loaded which will save + dozens of gigabytes of RAM. + +- If you only need to query some specific nodes or run trivial traversals, + memory-mapping the graph from a HDD should be a reasonable solution that + doesn't take an inordinate amount of time. It might be bad for your disks, + though. + +- If you are constrained in available RAM, memory-mapping the graph from an SSD + offers reasonable performance for reasonably complex algorithms. + +- If you have a heavy workload (i.e. running a full traversal of the entire + graph) and you can afford the RAM, direct loading will be orders of magnitude + faster than all the above options. + + +Sharing mapped data across processes +------------------------------------ + +Often, multiple processes can be working on the same data (mappings or the +graph itself), for instance when running different experiments on the same +graph. This is problematic in terms of RAM usage when using direct memory +loading, as the same data of potentially hundreds of gigabytes is loaded in +memory twice. +As we have seen, memory-mapping can be used to avoid storing redundant data in +RAM, but comes at the cost of potentially slower I/O as the data is no longer +guaranteed to be loaded in main memory and is reliant on kernel heuristics. + +To efficiently share data across two different compressed graph processes, +another option is to copy graph data to a ``tmpfs`` not backed by a disk swap, +which forces the kernel to load it entirely in RAM. Subsequent memory-mappings +of the files stored in the tmpfs will simply map the data stored in RAM to +virtual memory pages, and return a pointer to the in-memory structure. + +To do so, we create a directory in ``/dev/shm`` in which we **copy** all the +files that we want to direct-load in RAM, and **symlink** all the others. Then, +we load the graph using the memory-mapped loading mode, which makes it use the +shared memory stored in the tmpfs under the hood. + +Here is a systemd service that can be used to perform this task automatically: + +.. code-block:: ini + + [Unit] + Description=swh-graph memory sharing in tmpfs + + [Service] + Type=oneshot + RemainAfterExit=yes + ExecStart=mkdir -p /dev/shm/swh-graph/default + ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default" + ExecStart=sh -c "cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default" + ExecStart=sh -c "cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default" + ExecStop=rm -rf /dev/shm/swh-graph/default + + [Install] + WantedBy=multi-user.target diff --git a/docs/quickstart.rst b/docs/quickstart.rst --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -1,51 +1,38 @@ +.. _swh-graph-quickstart: + Quickstart ========== -This quick tutorial shows how to compress and browse a graph using ``swh.graph``. - -It does not cover the technical details behind the graph compression techniques -(refer to :ref:`graph-compression`). - +This quick tutorial shows how to start the ``swh.graph`` service to query +an existing compressed graph with the high-level HTTP API. Dependencies ------------ In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java -JRE, you do not need the JDK if you install the package from pypi, but may want -to install it if you want to hack the code or install it from this git -repository. To compress a graph, you will need zstd_ compression tools. - -It is highly recommended to install this package in a virtualenv. - -On a Debian stable (buster) system: - -.. code:: bash - - $ sudo apt install python3-virtualenv default-jre zstd +JRE. On a Debian system: +.. code:: console -.. _zstd: https://facebook.github.io/zstd/ + $ sudo apt install python3 default-jre - -Install -------- +Installing swh.graph +-------------------- Create a virtualenv and activate it: -.. code:: bash +.. code:: console - ~/tmp$ mkdir swh-graph-tests - ~/tmp$ cd swh-graph-tests - ~/t/swh-graph-tests$ virtualenv swhenv - ~/t/swh-graph-tests$ . swhenv/bin/activate + $ python3 -m .venv + $ source .venv/bin/activate Install the ``swh.graph`` python package: -.. code:: bash +.. code:: console - (swhenv) ~/t/swh-graph-tests$ pip install swh.graph + (venv) $ pip install swh.graph [...] - (swhenv) ~/t/swh-graph-tests swh graph --help + (venv) $ swh graph --help Usage: swh graph [OPTIONS] COMMAND [ARGS]... Software Heritage graph tools. @@ -55,106 +42,77 @@ -h, --help Show this message and exit. Commands: - api-client client for the graph RPC service - cachemount Cache the mmapped files of the compressed graph in a tmpfs. compress Compress a graph using WebGraph Input: a pair of files... - map Manage swh-graph on-disk maps rpc-serve run the graph RPC service -Compression ------------ - -Existing datasets -^^^^^^^^^^^^^^^^^ - -You can directly use compressed graph datasets provided by Software Heritage. -Here is a small and realistic dataset (3.1GB): - - https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar - -.. code:: bash - (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar - (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar - (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading +.. _swh-graph-retrieving-compressed: -Note: not for the faint heart, but the full dataset is available at: +Retrieving a compressed graph +----------------------------- - https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ +Software Heritage provides a list of off-the-shelf datasets that can be used +for various research or prototyping purposes. Most of them are available in +*compressed* representation, i.e., in a format suitable to be loaded and +queried by the ``swh-graph`` library. -Own datasets -^^^^^^^^^^^^ +All the publicly available datasets are documented on this page: +https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html -A graph is described as both its adjacency list and the set of nodes -identifiers in plain text format. Such graph example can be found in the -``swh/graph/tests/dataset/`` folder. +A good way of retrieving these datasets is to use the `AWS S3 CLI +`_. -You can compress the example graph on the command line like this: +Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has +a relatively reasonable size (~15 GiB including property data, with +the compressed graph itself being less than 700 MiB): -.. code:: bash +.. code:: console + (venv) $ pip install awscli + [...] + (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed + (venv) $ cd 2021-03-23-popular-3k-python/ + (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed - (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/ - [...] +You can also retrieve larger graphs, but note that these graphs are generally +intended to be loaded fully in RAM, and do not fit on ordinary desktop +machines. The server we use in production to run the graph service has more +than 700 GiB of RAM. These memory considerations are discussed in more details +in **TODO**. - (swhenv) ~/t/swh-graph-tests$ ls output/ - example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets - example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties - example.indegree example.node2type.map example.order example.stats example-transposed.obl +**Note:** for testing purposes, a fake test dataset is available in the +``swh-graph`` repository, with just a few dozen nodes. Its basename is +``swh-graph/swh/graph/tests/dataset/compressed/example``. API server ---------- -To start a ``swh.graph`` API server of a compressed graph dataset, run: +To start a ``swh.graph`` API server of a compressed graph dataset, you need to +use the ``rpc-serve`` command with the basename of the graph, which is the path prefix +of all the graph files (e.g., with the basename ``compressed/graph``, it will +attempt to load the files located at +``compressed/graph.{graph,properties,offsets,...}``. -.. code:: bash +In our example: + +.. code:: console - (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example - Loading graph output/example ... + (venv) $ swh graph rpc-serve -g compressed/graph + Loading graph compressed/graph ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) From there you can use this endpoint to query the compressed graph, for example -with httpie_ (``sudo apt install``) from another terminal: +with httpie_ (``sudo apt install httpie``): .. _httpie: https://httpie.org .. code:: bash - ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 - HTTP/1.1 200 OK - Content-Type: text/plain - Date: Tue, 15 Sep 2020 08:33:25 GMT - Server: Python/3.8 aiohttp/3.6.2 - Transfer-Encoding: chunked - - swh:1:rel:0000000000000000000000000000000000000010 - swh:1:rev:0000000000000000000000000000000000000009 - swh:1:rev:0000000000000000000000000000000000000003 - swh:1:dir:0000000000000000000000000000000000000002 - swh:1:cnt:0000000000000000000000000000000000000001 - swh:1:dir:0000000000000000000000000000000000000008 - swh:1:dir:0000000000000000000000000000000000000006 - swh:1:cnt:0000000000000000000000000000000000000004 - swh:1:cnt:0000000000000000000000000000000000000005 - swh:1:cnt:0000000000000000000000000000000000000007 - - -Running the existing ``python3kcompress`` dataset: - -.. code:: bash - - (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k - Loading graph python3kcompress/python3k ... - Graph loaded. - ======== Running on http://0.0.0.0:5009 ======== - (Press CTRL+C to quit) - - ~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 200 OK Content-Type: text/plain @@ -170,5 +128,5 @@ swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406 - -See the documentation of the :ref:`API ` for more details. +See the documentation of the :ref:`API ` for more details on how +to use the HTTP graph querying API. diff --git a/docs/use-cases.rst b/docs/use-cases.rst deleted file mode 100644 --- a/docs/use-cases.rst +++ /dev/null @@ -1,167 +0,0 @@ -========= -Use cases -========= - - -This document lists use cases and benchmark scenarios for the Software Heritage -graph service. - - -Conventions -=========== - -- **Node identification**: in the following, nodes are always identified by - their :ref:`SWHIDs `. - - -Use cases -========= - - -Browsing --------- - -The following use cases require traversing the *forward graph*. - -- **ls**: given a directory node, list (non recursively) all linked nodes of - type directory and content - - Implementation:: - - /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir - -- **ls -R**: given a directory node, recursively list all linked nodes of type - directory and content - - Implementation:: - - /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir - -- **git log**: given a revision node, recursively list all linked nodes of type - revision - - Implementation:: - - /graph/visit/nodes/:REV_ID?edges=rev:rev - - -Vault ------ - -The following use cases require traversing the *forward graph*. - -- **tarball** (same as *ls -R* above) - -- **git bundle**: given a node, recursively list all linked nodes of any kind - - Implementation:: - - /graph/visit/nodes/:NODE_ID?edges=* - - -Provenance ----------- - -The following use cases require traversing the *backward (transposed) -graph*. - -- **commit provenance**: given a content or directory node, return *a* commit - whose directory (recursively) contains it - - Implementation:: - - /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev - -- **complete commit provenance**: given a content or directory node, return - *all* commits whose directory (recursively) contains it - - Implementation:: - - /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev - -- **origin provenance**: given a content, directory, or commit node, return - *an* origin that has at least one snapshot that (recursively) contains it - - Implementation:: - - /graph/walk/:NODE_ID/ori?direction=backward&edges=* - -- **complete origin provenance**: given a content, directory, or commit node, - return *all* origins that have at least one snapshot that (recursively) - contains it - - Implementation:: - - /graph/leaves/:NODE_ID?direction=backward&edges=* - -- *SLOC tracking*: left as future work - - -Provenance statistics -~~~~~~~~~~~~~~~~~~~~~ - -The following use cases require traversing the *backward (transposed) -graph*. - -- **content popularity across commits**: for each content, count the number of - commits (or *commit popularity*) that link to a directory that (recursively) - includes it. Plot the distribution of content popularity across commits. - - Implementation: apply *complete commit provenance* to each content node, - count the returned commits, aggregate. - -- **commit popularity across origins**: for each commit, count the number of - origins (or *origin popularity*) that have a snapshot that (recursively) - includes it. Plot the distribution of commit popularity across origins. - - Implementation: apply *complete origin provenance* to each commit node, count - the returned origins, aggregate. - -- *SLOC popularity across contents*: left as future work - -The following use cases require traversing the *forward graph*. - -- **revision size** (as n. of contents) distribution: for each revision, count - the number of contents that are (recursively) reachable from it. Plot the - distribution of revision sizes. - -- **origin size** (as n. of revisions) distribution: for each origin, count the - number of revisions that are (recursively) reachable from it. Plot the - distribution of origin sizes. - - -Benchmarks -========== - -Notes on how to benchmark graph access: - -- separate pure-graph timings from timings related to use additional mappings - (e.g., node types), no matter if the mappings are in-memory or on-disk - -- separate in-memory timings from on-disk timings; in particular, separate the - timing of translating node identifiers between internal integers and SWHIDs - -- for each use case that requires a node as input, we will randomize the choice - of the input node and repeat the experiment a suitable number of times; where - possible we will aggregate results computing basic statistics (average, - standard deviation), as well as normalize results w.r.t. the “size” of the - chosen node (e.g., number of nodes/path length in the resulting visit) - - -Basic benchmarks ----------------- - -- **Edge traversal**: given a node, retrieve the first node in its adjacency - list. - - For reference: Apostolico, Drovandi in *Graph Compression by BFS* report - times to retrieve the adjacency list of a node (and/or test if an edge exists - between two nodes) in the 2-3 us range, for the largest graph in their - experiments (22 M nodes, 600 M edges). - - -Each use case is a benchmark ----------------------------- - -In addition to abstract benchmark, we will use each use case above as a -scenario-based benchmark. diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java --- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java +++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java @@ -27,8 +27,7 @@ * Create maps needed at runtime by the graph service, in particular: *

*

    - *
  • SWHID → WebGraph long node id
  • - *
  • WebGraph long node id → SWHID (converse of the former)
  • + *
  • WebGraph long node id → SWHID
  • *
  • WebGraph long node id → SWH node type (enum)
  • *
* diff --git a/swh/graph/webgraph.py b/swh/graph/webgraph.py --- a/swh/graph/webgraph.py +++ b/swh/graph/webgraph.py @@ -179,6 +179,14 @@ "{tmp_dir}", "< {out_dir}/{graph_name}.nodes.csv.zst", ], + CompressionStep.EXTRACT_PERSONS: [ + "{java}", + "org.softwareheritage.graph.compress.ExtractPersons", + "--temp-dir", + "{tmp_dir}", + "{in_dir}", + "{out_dir}/{graph_name}", + ], CompressionStep.MPH_PERSONS: [ "{java}", "it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction", @@ -190,14 +198,6 @@ "{out_dir}/{graph_name}.persons.mph", "{out_dir}/{graph_name}.persons.csv.zst", ], - CompressionStep.EXTRACT_PERSONS: [ - "{java}", - "org.softwareheritage.graph.compress.ExtractPersons", - "--temp-dir", - "{tmp_dir}", - "{in_dir}", - "{out_dir}/{graph_name}", - ], CompressionStep.NODE_PROPERTIES: [ "{java}", "org.softwareheritage.graph.compress.WriteNodeProperties",