D7839.diff
No OneTemporary
Actions

Size

89 KB

Subscribers

None

D7839.diff
View Options

	diff --git a/docs/api.rst b/docs/api.rst
	--- a/docs/api.rst
	+++ b/docs/api.rst
	@@ -1,7 +1,16 @@
	.. _swh-graph-api:

	-Graph RPC API
	-=============
	+Graph Querying HTTP API
	+=======================
	+
	+The Graph Querying API is a high-level HTTP API intended to run common,
	+relatively simple traversal queries on the compressed graph.
	+
	+The client/server architecture allows it to only load the graph in memory once
	+then serve multiple different requests. However, it is limited in expressivity;
	+more complex or resource-intensive queries should rather use the
	+:ref:`Low-level Java API <swh-graph-java-api>` to run them as standalone
	+programs.


	Terminology
	@@ -53,8 +62,11 @@
	- ``"cnt,snp"`` accepted node types returned in the query results.


	+Endpoints
	+---------
	+
	Leaves
	-------
	+~~~~~~

	.. http:get:: /graph/leaves/:src

	@@ -97,7 +109,7 @@


	Neighbors
	----------
	+~~~~~~~~~

	.. http:get:: /graph/neighbors/:src

	@@ -138,7 +150,7 @@


	Walk
	-----
	+~~~~

	..
	.. http:get:: /graph/walk/:src/:dst
	@@ -246,7 +258,7 @@


	Visit
	------
	+~~~~~

	.. http:get:: /graph/visit/nodes/:src
	.. http:get:: /graph/visit/edges/:src
	@@ -340,7 +352,7 @@


	Counting results
	-----------------
	+~~~~~~~~~~~~~~~~

	The following method variants, with trailing `/count` added, behave like their
	already discussed counterparts but, instead of returning results, return the
	@@ -363,7 +375,7 @@


	Stats
	------
	+~~~~~

	.. http:get:: /graph/stats

	@@ -405,3 +417,125 @@
	"avg": 0.6107127825377487
	}
	}
	+
	+
	+Use-case examples
	+-----------------
	+
	+This section showcases how to leverage the endpoints of the HTTP API described
	+above for some common use-cases.
	+
	+
	+Browsing
	+~~~~~~~~
	+
	+The following use cases require traversing the forward graph.
	+
	+- ls: given a directory node, list (non recursively) all linked nodes of
	+ type directory and content
	+
	+ Endpoint::
	+
	+ /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
	+
	+- ls -R: given a directory node, recursively list all linked nodes of type
	+ directory and content
	+
	+ Endpoint::
	+
	+ /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
	+
	+- git log: given a revision node, recursively list all linked nodes of type
	+ revision
	+
	+ Endpoint::
	+
	+ /graph/visit/nodes/:REV_ID?edges=rev:rev
	+
	+
	+Vault
	+~~~~~
	+
	+The following use cases require traversing the forward graph.
	+
	+- tarball (same as ls -R above)
	+
	+- git bundle: given a node, recursively list all linked nodes of any kind
	+
	+ Endpoint::
	+
	+ /graph/visit/nodes/:NODE_ID?edges=*
	+
	+
	+Provenance
	+~~~~~~~~~~
	+
	+The following use cases require traversing the *backward (transposed)
	+graph*.
	+
	+- commit provenance: given a content or directory node, return a commit
	+ whose directory (recursively) contains it
	+
	+ Endpoint::
	+
	+ /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
	+
	+- complete commit provenance: given a content or directory node, return
	+ all commits whose directory (recursively) contains it
	+
	+ Endpoint::
	+
	+ /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
	+
	+- origin provenance: given a content, directory, or commit node, return
	+ an origin that has at least one snapshot that (recursively) contains it
	+
	+ Endpoint::
	+
	+ /graph/walk/:NODE_ID/ori?direction=backward&edges=*
	+
	+- complete origin provenance: given a content, directory, or commit node,
	+ return all origins that have at least one snapshot that (recursively)
	+ contains it
	+
	+ Endpoint::
	+
	+ /graph/leaves/:NODE_ID?direction=backward&edges=*
	+
	+
	+Provenance statistics
	+~~~~~~~~~~~~~~~~~~~~~
	+
	+The following use cases require traversing the *backward (transposed)
	+graph*.
	+
	+- content popularity across commits: count the number of commits (or
	+ commit popularity) that link to a directory that (recursively) includes a
	+ given content.
	+
	+ Endpoint::
	+
	+ /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev
	+
	+- commit popularity across origins: count the number of origins (or *origin
	+ popularity*) that have a snapshot that (recursively) includes a given commit.
	+
	+ Endpoint::
	+
	+ /graph/count/leaves/:NODE_ID?direction=backward&edges=*
	+
	+The following use cases require traversing the forward graph.
	+
	+- revision size (as n. of contents) distribution: the number of contents
	+ that are (recursively) reachable from a given revision.
	+
	+ Endpoint::
	+
	+ /graph/count/leaves/:NODE_ID?edges=*
	+
	+- origin size (as n. of revisions) distribution: count the number of
	+ revisions that are (recursively) reachable from a given origin.
	+
	+ Endpoint::
	+
	+ /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev
	diff --git a/docs/compression.rst b/docs/compression.rst
	--- a/docs/compression.rst
	+++ b/docs/compression.rst
	@@ -1,113 +1,391 @@
	.. _graph-compression:

	+=================
	Graph compression
	=================

	-The compression process is a pipeline implemented for the most part on top of
	-the `WebGraph framework <http://webgraph.di.unimi.it/>`_ and ecosystem
	-libraries. The compression pipeline consists of the following steps:
	+The compression pipeline is implemented on top of the `WebGraph framework
	+<http://webgraph.di.unimi.it/>`_. It takes an ORC Graph Dataset as an input,
	+such as the ones found in the :ref:`Graph Dataset List <swh-dataset-list>`,
	+and generates a compressed graph suitable for high intensity analyses on
	+large servers.

	-.. figure:: images/compression_steps.png
	- :align: center
	- :alt: Compression steps

	- Compression steps
	+Running the compression pipeline
	+================================
	+
	+Dependencies
	+------------
	+
	+To compress a graph, you will need to install the ``swh.graph`` tool as well as
	+a recent JRE, as described in the :ref:`swh-graph-quickstart` page.
	+
	+You will also need the zstd_ compression tool::
	+
	+ $ sudo apt install zstd
	+
	+.. _zstd: https://facebook.github.io/zstd/
	+
	+
	+Hardware Requirements
	+---------------------
	+
	+The compression pipeline is even more demanding than the graph server in terms
	+of hardware requirements, especially RAM. Notably, the BFS compression step
	+loads a graph compressed in random order in memory, which is usually more than
	+a TiB for the full graph. While it is possible to do this step with a memory
	+mapping, our experiments show that this could take a very long time (several
	+months) on hard drives.
	+
	+The LLP compression step requires 13 bytes of RAM per node, which could amount
	+to storing hundreds of gigabytes in RAM in addition to loading the graph
	+itself.
	+
	+Some steps also involve sorting the entire set of edges and their labels, by
	+using large on-disk buffer files, sometimes reaching the size of the input
	+dataself itself.
	+
	+The machine we used to compress the entire graph (dataset version 2022-04-25)
	+has the following hardware specs:
	+
	+- 2 TiB of RAM (DDR4 ECC 2400Mhz)
	+- 64 vCPUs (Dual AMD EPYC 7302 16-Core)
	+- 24 TiB of SSD (NVMe)
	+
	+The server we rented is from the
	+`HGR-HCI-4 <https://www.ovhcloud.com/en/bare-metal/high-grade/hgr-hci-4/>`_
	+series from OVH.
	+
	+
	+Input dataset
	+-------------
	+
	+First, you need to retrieve a graph to compress, in ORC format. The :ref:`Graph
	+Dataset List <swh-dataset-list>` has a list of datasets made available by the
	+Software Heritage archive, including "teaser" subdatasets which have a more
	+manageable size and are thus very useful for prototyping with less hardware
	+resources.
	+
	+The datasets can be retrieved from S3 or the annex, in a similar fashion to
	+what is described in :ref:`swh-graph-retrieving-compressed`, by simply
	+replacing "compressed" by "orc":
	+
	+.. code:: console
	+
	+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/orc
	+ (venv) $ cd 2021-03-23-popular-3k-python/
	+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/ orc
	+
	+Alternatively, any custom ORC dataset can be used as long as it respects
	+:ref:`the schema <swh-dataset-schema>` of the Software Heritage Graph Dataset.
	+
	+Note: for testing purposes, a fake test dataset is available in the
	+``swh-graph`` repository, with just a few dozen nodes. The ORC tables are
	+available in ``swh-graph/swh/graph/tests/dataset/orc/``.
	+
	+
	+Compression
	+-----------
	+
	+You can compress your dataset by using the ``swh graph compress`` command. It
	+will run all the various steps of the pipeline in the right order.
	+
	+.. code:: console

	-Each of these steps is briefly described below. For more details see the
	-following paper:

	-.. note::
	+ (venv) $ swh graph compress --input-dataset orc/ --outdir compressed/
	+ [...]
	+ (venv) $ ls compressed/
	+ graph.edges.count.txt
	+ graph.edges.stats.txt
	+ graph.graph
	+ graph.indegree
	+ graph-labelled.labeloffsets
	+ graph-labelled.labels
	+ [...]
	+ graph-transposed.obl
	+ graph-transposed.offsets
	+ graph-transposed.properties

	- Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
	- `Ultra-Large-Scale Repository Analysis via Graph Compression
	- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_. In
	- proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th IEEE
	- International Conference on Software Analysis, Evolution and
	- Reengineering. IEEE 2020.

	- Links: `preprint
	- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
	- `bibtex
	- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
	+(The purpose of each of these files is detailed in the
	+:ref:`swh-graph-java-api` page.

	-In order to practically perform graph compression, install the ``swh.graph``
	-module and use the ``swh graph compress`` command line interface of the
	-compression driver, that will conduct the various steps in the right order.
	-See ``swh graph compress --help`` for usage details.
	+For sufficiently large graphs, this command can take entire weeks. It is highly
	+recommended to run it in a systemd service or in a tmux session.

	+It is also possible to run single steps or step ranges from the CLI:

	-1. MPH
	+.. code:: bash
	+
	+ swh graph compress -i orc/ -o compressed/ --steps mph-bfs
	+
	+See ``swh graph compress --help`` for syntax and usage details.
	+
	+
	+Compression steps
	+=================
	+
	+The compression pipeline consists of the following steps:
	+
	+.. figure:: images/compression_steps.png
	+ :align: center
	+ :alt: Compression steps
	+ :target: _images/compression_steps.png
	+
	+ Compression steps
	+
	+Each of these steps is briefly described below. For more details see the
	+original Software Heritage graph compression paper [SWHGraphCompression2020]_,
	+as well as chapters 9 and 10 of Antoine Pietri's PhD thesis
	+[PietriThesis2021]_.
	+
	+.. [SWHGraphCompression2020]
	+ \| Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
	+ \| `Ultra-Large-Scale Repository Analysis via Graph Compression
	+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_.
	+ \| In proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th
	+ IEEE International Conference on Software Analysis, Evolution and
	+ Reengineering. IEEE 2020.
	+ \| Links: `preprint
	+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
	+ `bibtex
	+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
	+
	+
	+
	+.. [PietriThesis2021]
	+ \| Antoine Pietri
	+ \| `Organizing the graph of public software development for large-scale mining
	+ <https://hal.archives-ouvertes.fr/tel-03515795/>`_.
	+ \| Doctoral dissertation. Inria, 2021.
	+
	+
	+1. EXTRACT_NODES
	+----------------
	+
	+This step reads a graph dataset and extract all the unique node SWHIDs it
	+contains, including the ones that are not stored as actual objects in the
	+graph, but only referred to by the edges. Additionally, it extracts the set
	+of all unique edge labels in the graph.
	+
	+Rationale: Because the graph can contain holes, loose objects and dangling
	+objects, some nodes that are referred to as destinations in the edge
	+relationships might not actually be stored in the graph itself. However, to
	+compress the graph using a graph compressio library, it is necessary to have a
	+list of all the nodes in the graph, including the ones that are simply
	+referred to by the edges but not actually stored as concrete objects.
	+
	+This step reads the entire graph dataset, and uses ``sort -u`` to extract the
	+set of all the unique nodes and unique labels that will be needed as an input
	+for the compression process. It also write object count statistics in various
	+files:
	+
	+- The set of nodes is written in ``graph.nodes.csv.zst``, as a zst-compressed
	+ sorted list of SWHIDs, one per line.
	+- The set of edge labels is written in ``graph.labels.csv.zst``, as a
	+ zst-compressed sorted list of labels encoded in base64, one per line.
	+- The number of unique nodes referred to in the graph is written in a text
	+ file, ``graph.nodes.count.txt``
	+- The number of unique edges referred to in the graph is written in a text
	+ file, ``graph.edges.count.txt``
	+- The number of unique edge labels is written in a text file,
	+ ``graph.labels.count.txt``
	+- Statistics on the number of nodes of each type are written in a text file,
	+ ``graph.nodes.stats.txt``
	+- Statistics on the number of edges of each type are written in a text file,
	+ ``graph.edges.stats.txt``
	+
	+
	+2. MPH
	------

	-A node in the Software Heritage :ref:`data model <data-model>` is identified
	-using its SWHID (see :ref:`persistent identifiers
	-<persistent-identifiers>`). However, WebGraph internally uses integers to refer
	-to node ids.
	+As discussed in :ref:`swh-graph-java-basics`, a node in the Software Heritage
	+:ref:`data model <data-model>` is identified by its SWHID (see :ref:`persistent
	+identifiers <persistent-identifiers>`), but WebGraph internally uses integers
	+to refer to node ids.

	-Mapping between the strings and longs ids is needed before compressing the
	-graph. From the `Sux4J <http://sux.di.unimi.it/>`_ utility tool, we use the
	+To create a mapping between integer node IDs and SWHIDs, we use the
	`GOVMinimalPerfectHashFunction
	<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
	-class, mapping with no collisions N keys to N consecutive integers.
	+class of the `Sux4J <http://sux.di.unimi.it/>`_ library, which maps N keys to N
	+consecutive integers.

	-The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
	-Hash-function*) storing the hash function taking as input a string and returning
	-a unique integer.
	+We run this function on the list of SWHIDs stored in the
	+``graph.nodes.csv.zst`` file generated in the previous step.
	+This allows us to generate a bijection from the set of all the n SWHIDs in the
	+graph to the set of integers :math:`[0, n - 1]`.

	+The step produces a ``graph.mph`` file (MPH stands for *Minimal Perfect
	+Hash-function*), containing a function which takes a SWHID (as a bytestring)
	+and returns its associated node ID.

	-2. BV compress
	+
	+3. BV compress
	--------------

	-This is the first actual compression step, building a compressed version of the
	-input graph using WebGraph techniques presented in the framework paper. We use
	-the `ScatteredArcsASCIIGraph
	+This is the first actual compression step, where we build a compressed version
	+of the input graph dataset.
	+
	+We use a ScatteredArcsORCGraph to load the dataset
	+(implementation inspired of the `ScatteredArcsASCIIGraph
	<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/ScatteredArcsASCIIGraph.html>`_
	-class, from WebGraph.
	+class in WebGraph).
	+This class wraps the ORC Graph dataset and exposes a virtual ImmutableGraph,
	+whose nodes and edges can be iterated sequentially as if it was any other
	+standard graph. To do so, it puts all the edges in batches and sorts them in an
	+aggressively parallel fashion, then stores them as ``.bitstream`` files, and
	+returns a `BatchGraph
	+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.BatchGraph.html>`
	+created from these batches.
	+
	+Finally, it uses the ``BVGraph.store()`` method, which compresses the input
	+graph as a `BVGraph
	+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BVGraph.html>`_,
	+using the compression techniques described in the article *The WebGraph
	+Framework I: Compression Techniques* cited above.

	The resulting BV graph is stored as a set of files:

	-- ``.graph``: the compressed graph in the BV format
	-- ``.offsets``: offsets values to read the bit stream graph file
	-- ``.obl``: offsets cache to load the graph faster
	-- ``.properties``: entries used to correctly decode graph and offset files
	+- ``graph-base.graph``: the compressed graph in the BV format
	+- ``graph-base.offsets``: offsets values to read the bit stream graph file
	+- ``graph-base.properties``: entries used to correctly decode graph and offset
	+ files


	-3. BFS
	--------
	+4. BFS
	+------

	-In the LLP paper, authors propose an empirical analysis linking node ordering
	-and high compression ratio: it is important to use an ordering of nodes ids such
	-that vertices from the same host are close to one another.
	+In [LLP]_, the paper authors empirically demonstrate that a high graph
	+compression ratio can be achieved for the graph of the Web by ordering nodes
	+such that vertices from the same host are close to each other.

	-Building on this insight, the previous compression results in the BV compress
	-step are improved by re-ordering nodes ids using a BFS traversal order. We use
	-the `BFS
	+In Software Heritage, there is no notion of "host" that can be used to generate
	+these compression-friendly orderings, because the identifiers are just
	+uniformly random cryptographic hashes. However, we can generate these orderings
	+by running algorithms to inform us on which nodes are close to each other.
	+
	+In this step, we run a BFS traversal on the entire graph to get a more
	+compression-friendly ordering of nodes. We use the `BFS
	<http://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
	class from the `LAW <http://law.di.unimi.it/>`_ library.

	-The resulting ordering is stored in the ``.order`` file, listing nodes ids in
	-order of traversal.
	+The resulting ordering is stored in a ``graph-bfs.order`` file, which contains
	+all the node IDs in the order of traversal.


	-4. Permute
	-----------
	+5. PERMUTE_BFS
	+--------------

	-Once the order is computed (BFS or another ordering technique), the final
	-compressed graph is created based on the initial BV compress result, and using
	-the new node order mapping. The permutation uses the `Transform
	+Once the BFS order is computed, we permute the initial "base" graph using the
	+this new ordering. The permutation uses the `Transform
	<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
	class from WebGraph framework.

	-The final compressed graph is only stored in the resulting ``.graph``,
	-``.offsets``, ``.obl``, and ``.properties`` files.
	+The BFS-compressed graph is stored in the files
	+``graph-bfs.{graph,offsets,properties}``.

	+6. TRANSPOSE_BFS
	+----------------

	-5. Stats
	---------
	+We transpose the BFS-compressed graph, using the `Transform
	+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
	+class from WebGraph.
	+This step is a prerequisite for LLP compression.
	+
	+7. SIMPLIFY
	+-----------
	+
	+This step creates a loopless and symmetric version of the BFS-compressed graph,
	+using the `Transform
	+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
	+class from WebGraph.
	+This step is a prerequisite for LLP compression.
	+
	+8. LLP
	+------
	+
	+Better compression ratios can be achieved by the Layered Label Propagation
	+(LLP) algorithm to reorder nodes. This algorithm is described in [LLP]_.
	+The LLP algorithm finds locality-preserving orders by clustering together nodes
	+in close proximity. Similar to the BFS, this algorithm is particularly
	+interesting for our use case as it is unsupervised, and does not rely on prior
	+information on the clusters present in the graph. The idea behind the
	+clustering algorithm is to randomly distribute communities to the nodes in the
	+graph, then iteratively assign to each node the community most represented in
	+its neighbors.
	+
	+.. [LLP] Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna.
	+ Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks.
	+ WWW 2011: 587-596
	+ DOI: https://doi.org/10.1145/1963405.1963488
	+ preprint: https://arxiv.org/abs/1011.5425
	+
	+LLP is more costly than simple BFS-based compression in both time and memory.
	+Even though the algorithm has a linear time complexity, it does multiple
	+iterations on the graph and is significantly slower than the BFS which is just
	+one single traversal. Moreover, keeping track of the communities requires a
	+total of 13 bytes per node, which increases the RAM requirements.
	+Because of these constraints, it is unrealistic to run the LLP algorithm on the
	+uncompressed version of the graph; this is why we do an intermediate
	+compression with the BFS ordering first, then compress the entire graph again
	+with an even better ordering.
	+
	+The LLP algorithm takes a simplified (loopless, symmetric) graph as an input,
	+which we already computed in the previous steps.
	+
	+The algorithm is also parameterized by a list of γ values, a "resolution" parameter
	+which defines the shapes of the clustering it produces: either small, but
	+denser pieces, or larger, but unavoidably sparser pieces. The algorithm then
	+combines the different clusterings together to generate the output reordering.
	+γ values are given to the algorithm in the form :math:`\frac{j}{2^k}`; by
	+default, 12 different values of γ are used. However, the combination procedure
	+is very slow, and using that many γ values could take several months in our
	+case.
	+We thus narrowed down a smaller set of γ values that empirically give good
	+compression results, which are used by default in the pipeline. In general,
	+smaller values of γ seem to generate better compression ratios. The effect of a
	+given γ is that the density of the sparsest cluster is at least γ γ+1, so large
	+γ values imply small, more dense clusters. It is reasonable to assume that
	+since the graph is very sparse to start with, such clusters are not that
	+useful.
	+
	+The resulting ordering is stored in a ``graph-llp.order`` file.
	+
	+9. PERMUTE_LLP
	+--------------

	-Compute various statistics on the final compressed graph:
	+Once the LLP order is computed, we permute the BFS-compressed graph using the
	+this new ordering. The LLP-compressed graph, which is our final compressed
	+graph, is stored in the files ``graph.{graph,offsets,properties}``.

	-- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
	+10. OBL
	+-------
	+
	+Cache the BVGraph offsets of the forward graph to make loading faster. The
	+resulting offset big list is stored in the ``graph.obl`` file.
	+
	+11. COMPOSE_ORDERS
	+------------------
	+
	+To be able to translate the initial MPH inputs to their resulting rank in the
	+LLP-compressed graph, we need to use the two order permutations successively:
	+the base → BFS permutation, then the BFS → LLP permutation.
	+
	+To make this less wasteful, we compose the two permutations into a single
	+one. We use the `composePermutationsInPlace
	+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/Util.html#composePermutationsInPlace(long%5B%5D%5B%5D,long%5B%5D%5B%5D)>`_
	+function of the dsiutils library. The resulting permutation is stored as a
	+``graph.order`` file. Hashing a SWHID with the ``graph.mph`` function, then
	+permuting the result using the ``graph.order`` permutation yields the integer
	+node ID matching the input SWHID in the graph.
	+
	+12. STATS
	+---------
	+
	+This step computes various statistics on the compressed graph:
	+
	+- ``.stats``: statistics such as number of nodes, edges, avg/min/max degree,
	average locality, etc.
	- ``.indegree``: graph indegree distribution
	- ``.outdegree``: graph outdegree distribution
	@@ -117,9 +395,202 @@
	class from WebGraph.


	-6. Transpose
	-------------
	+13. TRANSPOSE
	+-------------

	-Create a transposed graph to allow backward traversal, using the `Transform
	+Transpose the graph to allow backward traversal, using the `Transform
	<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
	-class from WebGraph.
	+class from WebGraph. The resulting transposed graph is stored as the
	+``graph-transposed.{graph,offsets,properties}`` files.
	+
	+
	+14. TRANSPOSE_OBL
	+-----------------
	+
	+Same as OBL, but for the transposed graph. The resulting offset big list is
	+stored in the ``graph-transposed.obl`` file.
	+
	+
	+15. MAPS
	+--------
	+
	+This steps generates the node mappings described in
	+:ref:`swh-graph-java-node-mappings`. In particular, it generates:
	+
	+- ``graph.node2swhid.bin``: a compact binary representation of all the
	+ SWHIDs in the graph, ordered by their rank in the graph file.
	+- ``graph.node2type.bin``: a `LongBigArrayBitVector
	+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
	+ which stores the type of each node.
	+
	+It does so by reading all the SWHIDs in the ``graph.nodes.csv.zst`` file generated in the
	+EXTRACT_NODES step, then getting their corresponding node IDs (using the
	+``.mph`` and ``.order`` files), then sorting all the SWHIDs according to
	+their node ID. It then writes these SWHIDs in order, in a compact but seekable
	+binary format, which can be used to return the SWHID corresponding to any given
	+node in O(1).
	+
	+
	+16. EXTRACT_PERSONS
	+-------------------
	+
	+This step reads the ORC graph dataset and extracts all the unique persons it
	+contains. Here, "persons" are defined as the set of unique pairs of name +
	+email, potentially pseudonymized, found either as revision authors, revision
	+committers or release authors.
	+
	+The ExtractPersons class reads all the persons from revision and release
	+tables, then uses ``sort -u`` to get a sorted list without any duplicates. The
	+resulting sorted list of authors is stored in the ``graph.persons.csv.zst``
	+file.
	+
	+
	+17. MPH_PERSONS
	+---------------
	+
	+This step computes a Minimal Perfect Hash function on the set of all the unique
	+persons extracted in the EXTRACT_PERSONS step. Each individual person is mapped
	+to a unique integer in :math:`[0, n-1]` where n is the total number of
	+persons. The resulting function is serialized and stored in the
	+``graph.persons.mph`` file.
	+
	+
	+18. NODE_PROPERTIES
	+-------------------
	+
	+This step generates the node property files, as described in
	+:ref:`swh-graph-java-node-properties`.
	+The nodes in the Software Heritage Graph each have associated properties
	+(e.g., commit timestamps, authors, messages, ...). The values of these
	+properties for each node in the graph are compressed and stored in files
	+alongside the compressed graph.
	+
	+The WriteNodeProperties class reads all the properties from the ORC Graph
	+Dataset, then serializes each of them in a representation suitable for
	+efficient random access (e.g., large binary arrays) and stores them on disk.
	+
	+For persons (authors, committers etc), the MPH computed in the MPH_PERSONS step
	+is used to store them as a single pseudonymized integer ID, which uniquely
	+represents a full name + email.
	+
	+The results are stored in the following list of files:
	+
	+- ``graph.property.author_id.bin``
	+- ``graph.property.author_timestamp.bin``
	+- ``graph.property.author_timestamp_offset.bin``
	+- ``graph.property.committer_id.bin``
	+- ``graph.property.committer_timestamp.bin``
	+- ``graph.property.committer_timestamp_offset.bin``
	+- ``graph.property.content.is_skipped.bin``
	+- ``graph.property.content.length.bin``
	+- ``graph.property.message.bin``
	+- ``graph.property.message.offset.bin``
	+- ``graph.property.tag_name.bin``
	+- ``graph.property.tag_name.offset.bin``
	+
	+
	+19. MPH_LABELS
	+--------------
	+
	+This step computes a monotone Minimal Perfect Hash function on the set of
	+all the unique arc label names extracted in the EXTRACT_NODES step. Each
	+individual arc label name (i.e., directory entry names and snapshot branch
	+names) is monotonely mapped to a unique integer in :math:`[0, n-1]`, where n
	+is the total number of unique arc label names, which corresponds to their
	+lexical rank in the set of all arc labels.
	+
	+In other words, this MPH being monotone means that the hash of the k-th item
	+in the sorted input list of arc labels will always be k.
	+We use the `LcpMonotoneMinimalPerfectHashFunction
	+<https://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/LcpMonotoneMinimalPerfectHashFunction.html>`_
	+of Sux4J to generate this function.
	+
	+The rationale for using a monotone function here is that it will allow us to
	+quickly get back the arc label from its hash without having to store an
	+additional permutation.
	+The resulting MPH function is serialized and stored in the ``graph.labels.mph``
	+file.
	+
	+
	+20. FCL_LABELS
	+--------------
	+
	+This step computes a reverse-mapping for arc labels, i.e., a way to
	+efficiently get the arc label name from its hash computed with the monotone MPH
	+of the MPH_LABELS step.
	+
	+Thanks to the MPH being monotone, this boils down to storing all the labels in
	+lexicographic order in a string list format that allows O(1) access to its
	+elements. For this purpose, we use the `MappedFrontCodedStringBigList
	+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/big/util/MappedFrontCodedStringBigList.html>`_
	+class from the dsiutils library, using the ``graph.labels.csv.zst`` file as its
	+input. It stores the label names in a compact way by using front-coding
	+compression, which is particularly efficient here because the strings are
	+already in lexicographic order. The resulting FCL files are stored as
	+``graph.labels.fcl.*``, and they can be loaded using memory mapping.
	+
	+
	+21. EDGE_LABELS
	+---------------
	+
	+
	+This step generates the edge property files, as described in
	+:ref:`swh-graph-java-edge-properties`. These files allow us to get the *edge
	+labels* as we iterate on the edges of the graph. The files essentially contain
	+compressed sorted triplets of the form (source, destination, label), with
	+additional offsets to allow random access.
	+
	+To generate these files, the LabelMapBuilder class starts by reading in
	+parallel the labelled edges in the ORC dataset, which can be thought of as
	+quadruplets containing the source SWHID, the destination SWHID, the label name
	+and the entry permission if applicable:
	+
	+.. code-block:: text
	+
	+ swh:1:snp:4548a5… swh:1:rev:0d6834… cmVmcy9oZWFkcy9tYXN0ZXI=
	+ swh:1:dir:05faa1… swh:1:cnt:a35136… dGVzdC5j 33188
	+ swh:1:dir:05faa1… swh:1:dir:d0ff82… dGVzdA== 16384
	+ ...
	+
	+Using the ``graph.mph`` and the ``graph.order`` files, we hash and permute the
	+source and destination nodes. We also monotonically hash the labels using the
	+``graph.labels.mph`` function to obtain the arc label identifiers. The
	+permissions are normalized as one of the 6 possible values in the
	+``DirEntry.Permission.Type`` enum, and are then stored in the 3 lowest bits of
	+the label field.
	+
	+.. code-block:: text
	+
	+ 4421 14773 154
	+ 1877 21441 1134
	+ 1877 14143 1141
	+ ...
	+
	+These hashed edges and their compact-form labels are then put in large batches
	+sorted in an aggressively parallel fashion, which are then stored as
	+``.bitstream`` files. These batch files are put in a heap structure to perform
	+a merge sort on the fly on all the batches.
	+
	+Then, the LabelMapBuilder loads the graph and starts iterating on its edges. It
	+synchronizes the stream of edges read from the graph with the stream of sorted
	+edges and labels read from the bitstreams in the heap. At this point, it writes
	+the labels to the following output files:
	+
	+- ``graph-labelled.properties``: a property file describing the graph, notably
	+ containing the basename of the wrapped graph.
	+- ``graph-labelled.labels``: the compressed labels
	+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
	+ random order.
	+
	+It then does the same with backward edge batches to get the transposed
	+equivalent of these files:
	+``graph-transposed-labelled.{properties,labels,labeloffsets}``.
	+
	+
	+
	+22. CLEAN_TMP
	+-------------
	+
	+This step reclaims space by deleting the temporary directory, as well as all
	+the intermediate outputs that are no longer necessary now that the final graph
	+has been compressed (shown in gray in the step diagram).
	diff --git a/docs/images/Makefile b/docs/images/Makefile
	--- a/docs/images/Makefile
	+++ b/docs/images/Makefile
	@@ -1,7 +1,7 @@
	all: compression_steps.png compression_steps.svg

	%.png: %.dot
	- dot -Gdpi=300 -Tpng $< -o $@
	+ dot -Gdpi=150 -Tpng $< -o $@

	%.svg: %.dot
	dot -Tsvg $< -o $@
	diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot
	--- a/docs/images/compression_steps.dot
	+++ b/docs/images/compression_steps.dot
	@@ -1,51 +1,105 @@
	digraph "Compression steps" {
	- // Horizontal graph
	- rankdir=LR;
	+ node [shape = none];
	+
	+ orc_dataset [label="ORC Graph\nDataset"];
	+ nodes_csv [label="graph.nodes.csv.zst"];
	+ labels_csv [label="graph.labels.csv.zst"];
	+ graph_mph [label="graph.mph"];

	subgraph {
	- input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none];
	- input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none];
	- {rank=same; input_edges; input_nodes;}
	+ node [fontcolor=darkgray];
	+ graph_base [label="graph-base.graph"]
	+ graph_bfs_order [label="graph-bfs.order"]
	+ graph_bfs [label="graph-bfs.graph"]
	+ graph_bfs_transposed [label="graph-bfs-transposed.graph"]
	+ graph_bfs_simplified [label="graph-bfs-simplified.graph"]
	+ graph_llp_order [label="graph-llp.order"]
	}

	- mph [label="MPH", shape=box];
	- mph_out [label="swh.mph", fontsize=9, shape=none];
	-
	- bv_compress [label="BV compress", shape=box];
	- bv_compress_out
	- [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties",
	- fontsize=9, shape=none];
	-
	- bfs [label="BFS", shape=box];
	- bfs_out [label="swh.order", fontsize=9, shape=none];
	+ graph_llp [label="graph.graph"]
	+ graph_llp_transposed [label="graph-transposed.graph"]
	+ graph_order [label="graph.order"]
	+ graph_obl [label="graph.obl"]
	+ graph_transposed_obl [label="graph-transposed.obl"]
	+ stats [label="graph.stats"]
	+ swhidmap [label="graph.node2swhid.bin"]
	+ typemap [label="graph.node2type.bin"]
	+ persons_csv [label="graph.persons.csv.zst"];
	+ persons_mph [label="graph.persons.mph"];
	+ node_properties [label="graph.property.*"];
	+ labels_mph [label="graph.labels.mph"];
	+ labels_fcl [label="graph.labels.fcl"];
	+ graph_labelled [label="graph-labelled.*"];
	+ graph_transposed_labelled [label="graph-transposed-labelled.*"];

	- permute [label="Permute", shape=box];
	- permute_out
	- [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties",
	- fontsize=9, shape=none];
	-
	- stats [label="Stats", shape=box];
	- stats_out
	- [label="swh.stats\lswh.indegree\lswh.outdegree",
	- fontsize=9, shape=none];
	+ subgraph {
	+ node [shape=box, fontname="Courier New"];
	+ EXTRACT_NODES;
	+ MPH;
	+ BV;
	+ BFS;
	+ PERMUTE_BFS;
	+ TRANSPOSE_BFS;
	+ SIMPLIFY;
	+ LLP;
	+ PERMUTE_LLP;
	+ COMPOSE_ORDERS;
	+ STATS;
	+ TRANSPOSE;
	+ OBL;
	+ TRANSPOSE_OBL;
	+ NODE_MAP;
	+ EXTRACT_PERSONS;
	+ MPH_PERSONS;
	+ NODE_PROPERTIES;
	+ MPH_LABELS;
	+ FCL_LABELS;
	+ EDGE_LABELS;
	+ }

	- transpose [label="Transpose", shape=box];
	- transpose_out
	- [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties",
	- fontsize=9, shape=none];

	- input_nodes -> mph;
	- input_edges -> bv_compress;
	- mph -> mph_out;
	- mph_out -> bv_compress;
	- bv_compress -> bv_compress_out;
	- bv_compress_out-> bfs;
	- bv_compress_out-> permute;
	- bfs -> bfs_out;
	- bfs_out -> permute;
	- permute -> permute_out;
	- permute_out -> stats;
	- permute_out -> transpose;
	- stats -> stats_out;
	- transpose -> transpose_out;
	+ orc_dataset -> EXTRACT_NODES;
	+ EXTRACT_NODES -> nodes_csv;
	+ EXTRACT_NODES -> labels_csv;
	+ nodes_csv -> MPH -> graph_mph;
	+ graph_mph -> BV;
	+ orc_dataset -> BV -> graph_base;
	+ graph_base -> BFS -> graph_bfs_order;
	+ graph_bfs_order -> PERMUTE_BFS;
	+ graph_base -> PERMUTE_BFS -> graph_bfs;
	+ graph_bfs -> TRANSPOSE_BFS -> graph_bfs_transposed;
	+ graph_bfs_transposed -> SIMPLIFY;
	+ graph_bfs -> SIMPLIFY -> graph_bfs_simplified;
	+ graph_bfs_simplified -> LLP -> graph_llp_order;
	+ graph_llp_order -> PERMUTE_LLP;
	+ graph_bfs -> PERMUTE_LLP -> graph_llp;
	+ graph_bfs_order -> COMPOSE_ORDERS;
	+ graph_llp_order -> COMPOSE_ORDERS -> graph_order;
	+ graph_llp -> TRANSPOSE -> graph_llp_transposed;
	+ graph_llp -> OBL -> graph_obl;
	+ graph_llp_transposed -> TRANSPOSE_OBL -> graph_transposed_obl;
	+ graph_llp -> STATS -> stats;
	+ graph_llp -> NODE_MAP;
	+ nodes_csv -> NODE_MAP;
	+ graph_mph -> NODE_MAP;
	+ graph_order -> NODE_MAP;
	+ NODE_MAP -> swhidmap;
	+ NODE_MAP -> typemap;
	+ orc_dataset -> EXTRACT_PERSONS -> persons_csv;
	+ persons_csv -> MPH_PERSONS -> persons_mph;
	+ orc_dataset -> NODE_PROPERTIES;
	+ persons_mph -> NODE_PROPERTIES;
	+ graph_mph -> NODE_PROPERTIES;
	+ graph_order -> NODE_PROPERTIES;
	+ NODE_PROPERTIES -> node_properties;
	+ labels_csv -> MPH_LABELS -> labels_mph;
	+ labels_mph -> FCL_LABELS;
	+ labels_csv -> FCL_LABELS -> labels_fcl;
	+ orc_dataset -> EDGE_LABELS;
	+ labels_mph -> EDGE_LABELS;
	+ graph_llp -> EDGE_LABELS;
	+ graph_mph -> EDGE_LABELS;
	+ graph_order -> EDGE_LABELS;
	+ EDGE_LABELS -> graph_labelled;
	+ EDGE_LABELS -> graph_transposed_labelled;
	}
	diff --git a/docs/index.rst b/docs/index.rst
	--- a/docs/index.rst
	+++ b/docs/index.rst
	@@ -8,10 +8,11 @@
	:caption: Overview

	quickstart
	+ api
	+ java-api
	+ memory
	compression
	cli
	- api
	- use-cases
	docker
	git2graph
	/apidoc/swh.graph
	diff --git a/docs/java-api.rst b/docs/java-api.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/java-api.rst
	@@ -0,0 +1,744 @@
	+.. _swh-graph-java-api:
	+
	+Using the Java API
	+==================
	+
	+.. highlight:: java
	+
	+While the :ref:`HTTP API <swh-graph-api>` is useful for many common use-cases,
	+it is often not sufficient to implement more complex algorithms. This section
	+describes the low-level Java API that ``swh-graph`` provides on top of the
	+WebGraph framework to manipulate the compressed graph of Software Heritage.
	+
	+A cursory understanding of the `WebGraph framework
	+<https://webgraph.di.unimi.it/>`_ and its API is helpful to understand the
	+notions detailed here.
	+
	+.. _swh-graph-java-basics:
	+
	+Basics
	+------
	+
	+In the WebGraph framework, graphs are generally subclasses of
	+`ImmutableGraph
	+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_,
	+the abstract class providing the core API to manipulate and iterate on graphs.
	+Under the hood, compressed graphs are stored as the `BVGraph
	+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_
	+class, which contains the actual codec used to compress and decompress
	+adjacency lists.
	+
	+Graphs nodes are mapped to a contiguous set of integers :math:`[0, n - 1]`
	+where n is the total number of nodes in the graph.
	+Each node has an associated adjacency list, i.e., a list of nodes going from
	+that source node to a destination node. This list represents the edges (or
	+arcs) of the graph.
	+
	+Note: edges are always directed. Undirected graphs are internally stored as
	+a pair of directed edges (src → dst, dst → src), and are called "symmetric"
	+graphs.
	+
	+On disk, a simple BVGraph with the basename ``graph`` would be represented as
	+the following set of files:
	+
	+- ``graph.graph``: contains the compressed adjacency lists of each node, which
	+ can be decompressed by the BVGraph codec.
	+- ``graph.properties``: contains metadata on the graph, such as the number of
	+ nodes and arcs, as well as additional loading information needed by the
	+ BVGraph codec.
	+- ``graph.offsets``: a list of offsets of where the adjacency list of each node
	+ is stored in the main graph file.
	+- ``graph.obl``: optionally, an "offset big-list file" which can be used to
	+ load graphs faster.
	+
	+An ImmutableGraph can be loaded using different load methods, which have each
	+different performance implications:
	+
	+- ``load()``: the entire graph is loaded in RAM and supports random access.
	+- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see
	+ ``mmap(1)``), at the cost of being potentially slower, especially when doing
	+ random access on slow storage.
	+- ``loadOffline()``: no data is actually loaded is memory, only sequential
	+ iteration is possible.
	+
	+The following code loads a graph stored on disk under the ``compressed/graph``
	+basename, using the memory-mapped loading mode, and stores it as a generic
	+ImmutableGraph:
	+
	+.. code-block:: java
	+
	+ ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph");
	+
	+Note that most of the time you will want to use the SWH-specific subclass
	+SwhUnidirectionalGraph instead, which has the same API as an ImmutableGraph
	+except it adds other SWH-specific methods. It is described later in the
	+:ref:`swh-graph-java-node-mappings` section.
	+
	+
	+Running the code
	+----------------
	+
	+To run a piece of Java code written using the Java API, you need to run it with
	+all the dependencies in your classpath (the WebGraph libraries and the
	+swh-graph library). The easiest way to do it is to add the fat jar
	+shipped in the swh-graph library on PyPI, which contains all the required
	+dependencies.
	+
	+.. code-block:: console
	+
	+ $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
	+
	+
	+Note that to load bigger graphs, the default heap size of the JVM is likely to
	+be insufficient to load entire graphs in memory. It is advised to increase this
	+heap size with the ``-Xmx`` flag:
	+
	+.. code-block:: console
	+
	+ $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
	+
	+For more information on performance tuning and memory considerations, you
	+should also read the :ref:`swh-graph-memory` page, in which we recommend
	+additional JVM options for loading large graphs.
	+
	+
	+Simple traversal
	+----------------
	+
	+The ImmutableGraph class provides primitives to iterate and traverse graphs. It
	+contains the following methods:
	+
	+- ``graph.numNodes()`` returns the number of nodes in the graph (n).
	+- ``graph.numArcs()`` returns the number of arcs in the graph.
	+
	+And, given a node ID :math:`k \in [0, n - 1]`:
	+
	+- ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are
	+ adjacent to k (i.e., its outgoing neighbors).
	+- ``graph.outdegree(k)`` returns the number of outgoing neighbors of k.
	+
	+
	+Example: Average outdegree
	+~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+The following code can be used to compute the average
	+outdegree of a graph, which is a useful measure of its density:
	+
	+.. code-block:: java
	+
	+ public static long averageOutdegree(ImmutableGraph graph) {
	+ return ((long) graph.numArcs()) / graph.numNodes();
	+ }
	+
	+
	+Example: Degree distributions
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+Using the ``outdegree()`` primitive, we can compute the outdegree distribution
	+of the graph by iterating on all its nodes. The distribution will be returned
	+as a map that associates to each degree d the number of nodes with outdegree
	+d.
	+
	+.. code-block:: java
	+
	+ public static Map<Long, Long> outdegreeDistribution(ImmutableGraph graph) {
	+ HashMap<Long, Long> distribution = new HashMap<Long, Long>();
	+ for (long k = 0; k < graph.numNodes(); ++k) {
	+ distribution.merge(graph.outdegree(k), 1L, Long::sum);
	+ }
	+ return distribution;
	+ }
	+
	+
	+Example: Depth-First Traversal
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+The ``successors`` primitive can be used to write a simple stack-based DFS
	+traversal on the graph which starts from a given node and prints all the
	+descendant nodes in its transitive closure:
	+
	+.. code-block:: java
	+ :emphasize-lines: 10
	+
	+ public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) {
	+ Stack<Long> stack = new Stack<>();
	+ HashSet<Long> visited = new HashSet<Long>();
	+ stack.push(srcNodeId);
	+ visited.add(srcNodeId);
	+
	+ while (!stack.isEmpty()) {
	+ long currentNodeId = stack.pop();
	+ System.out.println(currentNodeId);
	+
	+ LazyLongIterator it = graph.successors(currentNodeId);
	+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
	+ if (!visited.contains(neighborNodeId)) {
	+ stack.push(neighborNodeId);
	+ visited.add(neighborNodeId);
	+ }
	+ }
	+ }
	+ }
	+
	+Example: Breadth-First Traversal
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+Swapping the stack for a queue changes the traversal order from depth-first
	+to breadth-first:
	+
	+.. code-block:: java
	+ :emphasize-lines: 2
	+
	+ public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) {
	+ Queue<Long> queue = new ArrayDeque<>();
	+ HashSet<Long> visited = new HashSet<Long>();
	+ queue.add(srcNodeId);
	+ visited.add(srcNodeId);
	+
	+ while (!queue.isEmpty()) {
	+ long currentNodeId = queue.poll();
	+ System.out.println(currentNodeId);
	+
	+ LazyLongIterator it = graph.successors(currentNodeId);
	+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
	+ if (!visited.contains(neighborNodeId)) {
	+ queue.add(neighborNodeId);
	+ visited.add(neighborNodeId);
	+ }
	+ }
	+ }
	+ }
	+
	+
	+.. _swh-graph-java-node-mappings:
	+
	+Node types and SWHIDs
	+---------------------
	+
	+In the Software Heritage archive, nodes are not represented by a simple
	+integer, but by a :ref:`SWHID <persistent-identifiers>`, which contain both the
	+type of the node (revision, directory, blob...) and its unique identifier. We
	+use node mappings which allow us to translate between SWHIDs and the
	+compact node IDs used in the compressed graph.
	+
	+Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the
	+`GOVMinimalPerfectHashFunction
	+<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
	+class of the Sux4J library, which maps N keys to N consecutive integers with no
	+collisions.
	+
	+The following files are used to store the mappings between the nodes and their
	+types:
	+
	+- ``graph.mph``: contains a serialized minimal perfect hash function computed
	+ on the list of all the SWHIDs in the graph.
	+- ``graph.order``: contains the permutation that associates with each output of
	+ the MPH the node ID to which it corresponds
	+- ``graph.node2swhid.bin``: contains a compact binary representation of all the
	+ SWHIDs in the graph, ordered by their rank in the graph file.
	+- ``graph.node2type.bin``: contains a `LongBigArrayBitVector
	+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
	+ which stores the type of each node.
	+
	+To use these mappings easily, we provide the class SwhUnidirectionalGraph,
	+an ImmutableGraph which wraps the underlying graph and adds a few
	+utility methods to obtain SWH-specific information on the graph.
	+
	+A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph,
	+as long as the mapping files listed above are present::
	+
	+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
	+
	+This class exposes the same graph primitives as an ImmutableGraph, but it
	+additionally contains the following methods:
	+
	+- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given
	+ node ID. This function does a lookup of the SWHID at offset i in the file
	+ ``graph.node2swhid.bin``.
	+
	+- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given
	+ SWHID. It works by hashing the SWHID with the function stored in
	+ ``graph.mph``, then permuting it using the permutation stored in
	+ ``graph.order``. It does additional domain-checking by calling ``getSWHID()``
	+ on its own result to check that the input SWHID was valid.
	+
	+- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as
	+ an enum of all the different object types in the Software Heritage data
	+ model. It does so by looking up the value at offset i in the bit vector
	+ stored in ``graph.node2type.bin``.
	+
	+
	+Example: Find the target directory of a revision
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+As an example, we use the methods mentioned above to perform the
	+following task: "given a revision, return its target directory". To do so, we
	+first look up the node ID of the given revision in the compressed graph. We
	+iterate on the successors of that node, and return the SWHID of the first
	+destination node that has the "directory" type.
	+
	+
	+.. code-block:: java
	+ :emphasize-lines: 2
	+
	+ public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) {
	+ long src = graph.getNodeId(revSwhid);
	+ assert graph.getNodeType(src) == Node.Type.REV;
	+ LazyLongIterator it = graph.successors(currentNodeId);
	+ for (long dst; (dst = it.nextLong()) != -1;) {
	+ if (graph.getNodeType(dst) == Node.Type.DIR) {
	+ return graph.getSWHID(dst);
	+ }
	+ }
	+ throw new RuntimeError("Revision has no target directory");
	+ }
	+
	+.. _swh-graph-java-node-properties:
	+
	+Node properties
	+---------------
	+
	+The Software Heritage Graph is a property graph, which means it has various
	+properties associated with its nodes and edges (e.g., commit timestamps,
	+authors, messages, ...). We compress these properties and store them in files
	+alongside the compressed graph. This allows you to write traversal algorithms
	+that depend on these properties.
	+
	+By default, properties are not assumed to be present are are not loaded when
	+the graph itself is loaded. If you want to use a property, you need to
	+explicitly load it first. As an example, this is how you load the "content
	+length" property to get the length of a given blob::
	+
	+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
	+ graph.loadContentLength();
	+ long blobSize = graph.getContentLength(graph.getNodeID(swhid));
	+
	+The documentation of the SwhGraphProperties class (TODO: link) lists all
	+the different properties, their types, and the methods used to load them and to get
	+their value for a specific node.
	+
	+A few things of note:
	+
	+- A single loading call can load multiple properties at once; this is because
	+ they are stored in the same file to be more space efficient.
	+
	+- Persons (authors, committers etc) are exported as a single pseudonymized
	+ integer ID, that uniquely represents a full name + email.
	+
	+- Timestamps are stored as a long integer (for the timestamp itself) and a
	+ short integer (for the UTC offset).
	+
	+
	+.. _swh-graph-java-edge-properties:
	+
	+Edge labels
	+-----------
	+
	+While looking up graph properties on the nodes of the graph is relatively
	+straightforward, doing so for labels on the arcs is comparatively more
	+difficult. These include the names and permissions of directory entries, as
	+well as the branch names of snapshots.
	+
	+The `ArcLabelledImmutableGraph
	+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledImmutableGraph.html>`_
	+class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them
	+labelled iterators, which essentially allow us to look up the label of the
	+arcs while iterating on them.
	+
	+This labelled graph is stored in the following files:
	+
	+- ``graph-labelled.properties``: a property file describing the graph, notably
	+ containing the basename of the wrapped graph.
	+- ``graph-labelled.labels``: the compressed labels
	+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
	+ random order.
	+
	+The SwhUnidirectionalGraph class contains labelled loading methods
	+(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods
	+are used instead of the standard non-labelled ones, the graph is loaded as an
	+ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods
	+can then be used:
	+
	+- ``labelledSuccessors(k)`` returns a `LabelledArcIterator
	+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.LabelledArcIterator.html>`_
	+ which is used in the same way as a LazyLongIterator except it also contains a
	+ ``label()`` method to get the label of the currently traversed arc.
	+- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator
	+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.html>`_
	+ of all the nodes in the graph, which replaces the LazyLongIterator of the
	+ ``successor()`` function by a LabelledArcIterator similar to above.
	+
	+
	+Label format
	+~~~~~~~~~~~~
	+
	+The labels of each arc are returned as a ``DirEntry[]`` array. They encode
	+both the name of a directory entry and its permissions. For snapshot branches,
	+only the "name" field is useful.
	+
	+Arc label names are encoded as an integer ID representing each unique
	+entry/branch name present in the graph. To retrieve the actual name associated
	+with a given label ID, one needs to load the reverse mapping similar to how you
	+would do for a normal property::
	+
	+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename);
	+ graph.loadLabelNames();
	+
	+The byte array representing the actual label name can then be loaded with::
	+
	+ byte[] name = graph.getLabelName(label.filenameId);
	+
	+
	+Multiedges
	+~~~~~~~~~~
	+
	+The Software Heritage is not a simple graph, where at most one edge can exist
	+between two vertices, but a multigraph, where multiple edges can be incident
	+to the same two vertices. Consider for instance the case of a single directory
	+``test/`` containing twice the same file blob (e.g., the empty file), under two
	+different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty).
	+The simple graph view of this directory will represent it as a single edge
	+``test`` → empty file, while the multigraph view will represent it as two
	+edges between the same nodes.
	+
	+Due to the copy-list model of compression, WebGraph only stores simple graphs,
	+and thus stores multiedges as single edges, to which we cannot associate
	+a single label name (in our example, we need to associate both names
	+``ISSUES.txt`` and ``TODO.txt``).
	+To represent this possibility of having multiple file names for a single arc,
	+in the case of multiple relationships between two identical nodes, each arc label is
	+stored as an array of DirEntry, each record representing one relationship
	+between two nodes.
	+
	+
	+Example: Printing all the entries of a directory
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+The following code showcases how one can print all the entries (name,
	+permission and target SWHID) of a given directory, using the labelled methods
	+seen above.
	+
	+.. code-block:: java
	+
	+ public static void printEntries(ImmutableGraph g, long dirNode) {
	+ LabelledArcIterator s = g.labelledSuccessors(dirNode);
	+ for (long dst; (dst = it.nextLong()) >= 0;) {
	+ DirEntry[] labels = (DirEntry[]) s.label().get();
	+ for (DirEntry label : labels) {
	+ System.out.format(
	+ "%s %s %d\n",
	+ graph.getSWHID(dst);
	+ new String(graph.getLabelName(label.filenameId)),
	+ label.permission
	+ );
	+ }
	+ }
	+ }
	+
	+ // Usage: $PROGRAM <GRAPH_BASENAME> <DIR_SWHID>
	+ public static void main(String[] args) {
	+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]);
	+ g.loadLabelNames();
	+ long dirNode = g.getNodeID(new SWHID(args[1]));
	+ printEntries(g, dirNode);
	+ }
	+
	+
	+Transposed graph
	+----------------
	+
	+Up until now, we have only looked at how to traverse the forward graph, i.e.,
	+the directed graph whose edges are in the same direction as the Merkle DAG of
	+the Software Heritage archive.
	+For many purposes, especially that of finding the provenance of software
	+artifacts, it is useful to query the backward (or transposed) graph
	+instead, which is the same as the forward graph except all the edges are
	+reversed.
	+
	+The transposed graph has its own set of files, counterparts to the files needed
	+for the forward graph:
	+
	+- ``graph-transposed.graph``
	+- ``graph-transposed.properties``
	+- ``graph-transposed.offsets``
	+- ``graph-transposed.obl``
	+- ``graph-transposed-labelled.labels``
	+- ``graph-transposed-labelled.labeloffsets``
	+- ``graph-transposed-labelled.properties``
	+
	+However, because node IDs are the same in the forward and the backward graph,
	+all the files that pertain to mappings between the node IDs and various
	+properties (SWHIDs, property data, node permutations etc) remain the same.
	+
	+
	+Example: Earliest revision containing a given blob
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+The following code loads all the committer timestamps of the revisions in the
	+graph, then walks the transposed graph to return the earliest revision
	+containing a given object.
	+
	+.. code-block:: java
	+
	+ public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) {
	+ long oldestRev = -1;
	+ long oldestRevTs = Long.MAX_VALUE;
	+
	+ Stack<Long> stack = new Stack<>();
	+ HashSet<Long> visited = new HashSet<Long>();
	+ stack.push(src);
	+ visited.add(src);
	+ while (!stack.isEmpty()) {
	+ long currentNodeId = stack.pop();
	+ LazyLongIterator it = graph.successors(currentNodeId);
	+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
	+ if (!visited.contains(neighborNodeId)) {
	+ stack.push(neighborNodeId);
	+ visited.add(neighborNodeId);
	+ if (g.getNodeType(neighborNodeId) == Node.Type.REV) {
	+ Long ts = g.getCommitterTimestamp(neighborNodeId);
	+ if (ts != null && ts < oldestRevTs) {
	+ oldestRev = neighborNodeId;
	+ oldestRevTs = ts;
	+ }
	+ }
	+ }
	+ }
	+ }
	+ return oldestRev;
	+ }
	+
	+ // Usage: $PROGRAM <GRAPH_BASENAME> <BLOB_SWHID>
	+ public static void main(String[] args) {
	+ // Load the backward (= transposed) graph as a SwhUnidirectionalGraph
	+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed");
	+ g.loadCommitterTimestamps();
	+ long node = g.getNodeID(new SWHID(args[1]));
	+ long oldestRev = findEarliestRevisionContaining(g, node);
	+ System.out.println(g.getSWHID(oldestRev));
	+ }
	+
	+
	+
	+
	+Bidirectional Graph
	+-------------------
	+
	+
	+BidirectionalImmutableGraph
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and
	+then manipulated just like the forward graph, it is often convenient to have
	+both the forward and the backward graph in memory. Some traversal algorithms
	+require first going down in the forward graph to select some nodes, then going
	+up to find their provenance.
	+
	+To achieve that, we use the `BidirectionalImmutableGraph
	+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BidirectionalImmutableGraph.html>`_
	+class from WebGraph, which stores both a graph and its transpose.
	+This class provides the following methods to iterate on the backward graph,
	+shown here with their counterparts for the forward graph:
	+
	+.. list-table::
	+ :header-rows: 1
	+
	+ * - Forward graph operation
	+ - Backward graph operation
	+
	+ * - ``outdegree(k)``
	+ - ``indegree(k)``
	+
	+ * - ``successors(k)``
	+ - ``predecessors(k)``
	+
	+In addition, the class offers a few convenience methods which are generally
	+useful when you have both a graph and its transpose:
	+
	+- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by
	+ inverting the references to the forward and the backward graphs. Successors
	+ become predecessors, and vice-versa.
	+- ``symmetrize()`` returns the symmetric (= undirected) version of the
	+ bidirectional graph. It is implemented by a union between the forward and the
	+ backward graph, which basically boils down to removing the directionality of
	+ the edges (the successors of a node are also its predecessors).
	+
	+
	+SwhBidirectionalGraph
	+~~~~~~~~~~~~~~~~~~~~~
	+
	+Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with
	+SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it
	+contains the method ``labelledPredecessors()``, the equivalent of
	+``labelledSuccessors()`` but on the backward graph.
	+
	+Because SwhUnidirectionalGraph inherits from ImmutableGraph, and
	+SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the
	+common behavior between the two classes in a SwhGraph interface, which can
	+represent either an unidirectional or a bidirectional graph.
	+
	+To avoid loading the node properties two times (once for each direction), they
	+are stored in a separate class called SwhGraphProperties. In a
	+SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node
	+properties in memory by storing references to the same SwhGraphProperty
	+object.
	+
	+.. code-block:: text
	+
	+
	+ ┌──────────────┐
	+ │ImmutableGraph◄────────┐
	+ └────▲─────────┘ │extends
	+ │ │
	+ │ ┌──────────┴────────────────┐
	+ extends│ │BidirectionalImmutableGraph│
	+ │ └────────────▲──────────────┘
	+ │ │extends
	+ ┌──────────────┴───────┐ ┌──────┴──────────────┐
	+ │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│
	+ └──┬──────────────┬────┘ └────────┬───────────┬┘
	+ │ │ contains x2 │ │
	+ │ │ │ │
	+ │ implements│ │implements │
	+ │ ┌▼──────────┐ │ │
	+ │ │SwhGraph(I)◄────────┘ │
	+ contains│ └───────────┘ │contains
	+ │ │
	+ │ ┌──────────────────┐ │
	+ └────────────►SwhGraphProperties◄──────────────┘
	+ └──────────────────┘
	+
	+
	+Example: Find all the shared-commit forks of a given origin
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+It is possible to define the forks of an origin as being the set of origins
	+which share at least one revision with that origin.
	+
	+The following code loads the graph in both directions using a
	+SwhBidirectionalGraph. Given an origin SWHID, it first walks the forward
	+graph to find all its root revisions. It then walks the backward graph to
	+find all the origins containing these root revisions, i.e., its forks.
	+
	+.. code-block:: java
	+
	+ public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) {
	+ Stack<Long> forwardStack = new Stack<>();
	+ HashSet<Long> forwardVisited = new HashSet<Long>();
	+ Stack<Long> backwardStack = new Stack<>();
	+ HashSet<Long> backwardVisited = new HashSet<Long>();
	+
	+ // First traversal (forward graph): find all the root revisions of the
	+ // origin
	+ forwardStack.push(srcOrigin);
	+ forwardVisited.add(srcOrigin);
	+ while (!forwardStack.isEmpty()) {
	+ long curr = forwardStack.pop();
	+ LazyLongIterator it = graph.successors(curr);
	+ boolean isRootRevision = true;
	+ for (long succ; (succ = it.nextLong()) != -1;) {
	+ Node.Type nt = g.getNodeType(succ);
	+ if (!forwardVisited.contains(succ)
	+ && nt != Node.Type.DIR && nt != Node.Type.CNT) {
	+ forwardStack.push(succ);
	+ forwardVisited.add(succ);
	+ isRootRevision = false;
	+ }
	+ }
	+ if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) {
	+ // Found a root revision, add it to the second stack
	+ backwardStack.push(curr);
	+ backwardVisited.add(curr);
	+ }
	+ }
	+
	+ // Second traversal (backward graph): find all the origins containing
	+ // any of these root revisions and print them
	+ while (!backwardStack.isEmpty()) {
	+ long curr = backwardStack.pop();
	+ LazyLongIterator it = graph.predecessors(curr);
	+ boolean isRootRevision = true;
	+ for (long succ; (succ = it.nextLong()) != -1;) {
	+ Node.Type nt = g.getNodeType(succ);
	+ if (!backwardVisited.contains(succ)) {
	+ backwardStack.push(succ);
	+ backwardVisited.add(succ);
	+ if (nt == Node.Type.ORI) {
	+ // Found an origin, print it.
	+ System.out.println(g.getSWHID(succ));
	+ }
	+ }
	+ }
	+ }
	+ }
	+
	+ // Usage: $PROGRAM <GRAPH_BASENAME> <ORI_SWHID>
	+ public static void main(String[] args) {
	+ // Load both forward and backward graphs as a SwhBidirectionalGraph
	+ SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]);
	+ long node = g.getNodeID(new SWHID(args[1]));
	+ findSharedCommitForks(g, node);
	+ }
	+
	+
	+Large-scale processing
	+----------------------
	+
	+Multithreading
	+~~~~~~~~~~~~~~
	+
	+ImmutableGraph is not thread-safe. When writing multithreaded algorithms,
	+calling ``successors()`` on the same graph from multiple threads will return
	+garbage.
	+
	+Instead, each thread should create its own "lightweight copy" of the graph by
	+calling ``.copy()``. This will not actually copy the entire graph data, which
	+will remain shared across threads, but it will create new instances of the
	+iterators so that each thread can independently iterate on the graph data.
	+
	+
	+Data structures for large traversals
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+When doing very large traversals, such as a BFS on the entire graph, the
	+usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite
	+inefficient. If you know you are going to traverse large parts of the graph,
	+it's better to use more appropriate data structures, a lot of which can be
	+found in the dsiutils library. In particular:
	+
	+- `LongArrayBitVector
	+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongArrayBitVector.html>`_
	+ is an efficient bit-vector implementation, which can be used to store the
	+ nodes that have already been seen in the visit. Its memory footprint is too
	+ big to use for small traversals, but it is very efficient to traverse the
	+ full graph, as every node only takes a single bit.
	+
	+- `ByteDiskQueue
	+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/io/ByteDiskQueue.html>`_ can
	+ be used to efficiently store the queue of nodes to visit on disk, when it is
	+ too large to fit in RAM.
	+
	+Other types in dsiutils and fastutil can save significant memory:
	+``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList<Long>``,
	+and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over
	+``HashMap<Long, Long>``. We strongly recommend reading the documentation of the
	+unimi libraries and looking at the code for usage examples.
	+
	+
	+BigArrays
	+~~~~~~~~~
	+
	+When working with the Software Heritage graph, is often necessary to store
	+large arrays of values, with a size exceeding 2^32 items. Unfortunately,
	+standard Java arrays do not support this.
	+
	+To circumvent this, WebGraph uses the `BigArrays scheme
	+<https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/BigArrays.html>`_ from
	+the fastutil library: "big arrays" are stored as arrays of arrays, supporting
	+quadrillions of records.
	+
	+A BigArray ``long[][] a`` can be used with the following methods:
	+
	+- ``BigArrays.get(a, i)`` to get the value at index i
	+- ``BigArrays.set(a, i, v)`` to set the value at index i to v.
	+- ``BigArrays.length(a)`` to get the total length of the bigarray.
	diff --git a/docs/memory.rst b/docs/memory.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/memory.rst
	@@ -0,0 +1,130 @@
	+.. _swh-graph-memory:
	+
	+Memory & Performance tuning
	+===========================
	+
	+This page discusses various considerations related to memory usage and
	+performance tuning when using the ``swh-graph`` library to load large
	+compressed graphs.
	+
	+JVM options
	+-----------
	+
	+In production, we tend to use very large servers which have enough RAM to load
	+the entire graph in RAM. In these setups, the default JVM options are often
	+suboptimal. We recommend to start the JVM with the following options, which
	+tend to significantly improve performance::
	+
	+ java \
	+ -ea \
	+ -server \
	+ -XX:PretenureSizeThreshold=512M \
	+ -XX:MaxNewSize=4G \
	+ -XX:+UseLargePages \
	+ -XX:+UseTransparentHugePages \
	+ -XX:+UseNUMA \
	+ -XX:+UseTLAB \
	+ -XX:+ResizeTLAB \
	+
	+These options are documented in the manual of ``java(1)`` the Oracle
	+documentation.
	+
	+
	+Temporary directory
	+-------------------
	+
	+Many of the graph algorithms (either for compression or traversal) tend to
	+offload some of their run-time memory to disk. For instance, the `BFS
	+<https://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
	+algorithm in the LAW library uses a temporary directory to write its queue of
	+nodes to visit.
	+
	+Because these can be quite large and sometimes overflow the default ``/tmp``
	+partition, it is advised to systematically specify a path to a local temporary
	+directory with enough space to accommodate the needs of the Java programs. This
	+can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI::
	+
	+ java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp
	+
	+
	+Memory mapping vs Direct loading
	+--------------------------------
	+
	+The main dial you can use to manage your memory usage is to chose between
	+memory-mapping and direct-loading the graph data. The different loading modes
	+available when loading the graph are documented in :ref:`swh-graph-java-api`.
	+
	+Loading in mapped mode will not load any extra data in RAM, but will instead
	+use the ``mmap(1)`` syscall to put the graph file located on disk in the
	+virtual address space. The Linux kernel will then be free to arbitrarily cache
	+the file, either partially or in its entirety, depending on the available
	+memory space.
	+
	+In our experiments, memory-mapping a small graph from a SSD only incurs a
	+relatively small slowdown (about 15-20%). However, when the graph is too big to
	+fit in RAM, the kernel has to constantly invalidate pages to cache newly
	+accessed sections, which incurs a very large performance penalty. A full
	+traversal of a large graph that usually takes about 20 hours when loaded in
	+main memory could take more than a year when mapped from a hard drive!
	+
	+When deciding what to direct-load and what to memory-map, here are a few rules
	+of thumb:
	+
	+- If you don't need random access to the graph edges, you can consider using
	+ the "offline" loading mode. The offsets won't be loaded which will save
	+ dozens of gigabytes of RAM.
	+
	+- If you only need to query some specific nodes or run trivial traversals,
	+ memory-mapping the graph from a HDD should be a reasonable solution that
	+ doesn't take an inordinate amount of time. It might be bad for your disks,
	+ though.
	+
	+- If you are constrained in available RAM, memory-mapping the graph from an SSD
	+ offers reasonable performance for reasonably complex algorithms.
	+
	+- If you have a heavy workload (i.e. running a full traversal of the entire
	+ graph) and you can afford the RAM, direct loading will be orders of magnitude
	+ faster than all the above options.
	+
	+
	+Sharing mapped data across processes
	+------------------------------------
	+
	+Often, multiple processes can be working on the same data (mappings or the
	+graph itself), for instance when running different experiments on the same
	+graph. This is problematic in terms of RAM usage when using direct memory
	+loading, as the same data of potentially hundreds of gigabytes is loaded in
	+memory twice.
	+As we have seen, memory-mapping can be used to avoid storing redundant data in
	+RAM, but comes at the cost of potentially slower I/O as the data is no longer
	+guaranteed to be loaded in main memory and is reliant on kernel heuristics.
	+
	+To efficiently share data across two different compressed graph processes,
	+another option is to copy graph data to a ``tmpfs`` not backed by a disk swap,
	+which forces the kernel to load it entirely in RAM. Subsequent memory-mappings
	+of the files stored in the tmpfs will simply map the data stored in RAM to
	+virtual memory pages, and return a pointer to the in-memory structure.
	+
	+To do so, we create a directory in ``/dev/shm`` in which we copy all the
	+files that we want to direct-load in RAM, and symlink all the others. Then,
	+we load the graph using the memory-mapped loading mode, which makes it use the
	+shared memory stored in the tmpfs under the hood.
	+
	+Here is a systemd service that can be used to perform this task automatically:
	+
	+.. code-block:: ini
	+
	+ [Unit]
	+ Description=swh-graph memory sharing in tmpfs
	+
	+ [Service]
	+ Type=oneshot
	+ RemainAfterExit=yes
	+ ExecStart=mkdir -p /dev/shm/swh-graph/default
	+ ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default"
	+ ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default
	+ ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default
	+ ExecStop=rm -rf /dev/shm/swh-graph/default
	+
	+ [Install]
	+ WantedBy=multi-user.target
	diff --git a/docs/quickstart.rst b/docs/quickstart.rst
	--- a/docs/quickstart.rst
	+++ b/docs/quickstart.rst
	@@ -1,51 +1,38 @@
	+.. _swh-graph-quickstart:
	+
	Quickstart
	==========

	-This quick tutorial shows how to compress and browse a graph using ``swh.graph``.
	-
	-It does not cover the technical details behind the graph compression techniques
	-(refer to :ref:`graph-compression`).
	-
	+This quick tutorial shows how to start the ``swh.graph`` service to query
	+an existing compressed graph with the high-level HTTP API.

	Dependencies
	------------

	In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java
	-JRE, you do not need the JDK if you install the package from pypi, but may want
	-to install it if you want to hack the code or install it from this git
	-repository. To compress a graph, you will need zstd_ compression tools.
	-
	-It is highly recommended to install this package in a virtualenv.
	-
	-On a Debian stable (buster) system:
	-
	-.. code:: bash
	-
	- $ sudo apt install python3-virtualenv default-jre zstd
	+JRE. On a Debian system:

	+.. code:: console

	-.. _zstd: https://facebook.github.io/zstd/
	+ $ sudo apt install python3 python3-venv default-jre

	-
	-Install
	--------
	+Installing swh.graph
	+--------------------

	Create a virtualenv and activate it:

	-.. code:: bash
	+.. code:: console

	- ~/tmp$ mkdir swh-graph-tests
	- ~/tmp$ cd swh-graph-tests
	- ~/t/swh-graph-tests$ virtualenv swhenv
	- ~/t/swh-graph-tests$ . swhenv/bin/activate
	+ $ python3 -m venv .venv
	+ $ source .venv/bin/activate

	Install the ``swh.graph`` python package:

	-.. code:: bash
	+.. code:: console

	- (swhenv) ~/t/swh-graph-tests$ pip install swh.graph
	+ (venv) $ pip install swh.graph
	[...]
	- (swhenv) ~/t/swh-graph-tests swh graph --help
	+ (venv) $ swh graph --help
	Usage: swh graph [OPTIONS] COMMAND [ARGS]...

	Software Heritage graph tools.
	@@ -55,106 +42,77 @@
	-h, --help Show this message and exit.

	Commands:
	- api-client client for the graph RPC service
	- cachemount Cache the mmapped files of the compressed graph in a tmpfs.
	compress Compress a graph using WebGraph Input: a pair of files...
	- map Manage swh-graph on-disk maps
	rpc-serve run the graph RPC service

	-Compression
	------------
	-
	-Existing datasets
	-^^^^^^^^^^^^^^^^^
	-
	-You can directly use compressed graph datasets provided by Software Heritage.
	-Here is a small and realistic dataset (3.1GB):
	-
	- https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
	-
	-.. code:: bash

	- (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
	- (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar
	- (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading
	+.. _swh-graph-retrieving-compressed:

	-Note: not for the faint heart, but the full dataset is available at:
	+Retrieving a compressed graph
	+-----------------------------

	- https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/
	+Software Heritage provides a list of off-the-shelf datasets that can be used
	+for various research or prototyping purposes. Most of them are available in
	+compressed representation, i.e., in a format suitable to be loaded and
	+queried by the ``swh-graph`` library.

	-Own datasets
	-^^^^^^^^^^^^
	+All the publicly available datasets are documented on this page:
	+https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html

	-A graph is described as both its adjacency list and the set of nodes
	-identifiers in plain text format. Such graph example can be found in the
	-``swh/graph/tests/dataset/`` folder.
	+A good way of retrieving these datasets is to use the `AWS S3 CLI
	+<https://docs.aws.amazon.com/cli/latest/reference/s3/>`_.

	-You can compress the example graph on the command line like this:
	+Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has
	+a relatively reasonable size (~15 GiB including property data, with
	+the compressed graph itself being less than 700 MiB):

	-.. code:: bash
	+.. code:: console

	+ (venv) $ pip install awscli
	+ [...]
	+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed
	+ (venv) $ cd 2021-03-23-popular-3k-python/
	+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed

	- (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/

	- [...]
	+You can also retrieve larger graphs, but note that these graphs are generally
	+intended to be loaded fully in RAM, and do not fit on ordinary desktop
	+machines. The server we use in production to run the graph service has more
	+than 700 GiB of RAM. These memory considerations are discussed in more details
	+in :ref:`swh-graph-memory`.

	- (swhenv) ~/t/swh-graph-tests$ ls output/
	- example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets
	- example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties
	- example.indegree example.node2type.map example.order example.stats example-transposed.obl
	+Note: for testing purposes, a fake test dataset is available in the
	+``swh-graph`` repository, with just a few dozen nodes. Its basename is
	+``swh-graph/swh/graph/tests/dataset/compressed/example``.


	API server
	----------

	-To start a ``swh.graph`` API server of a compressed graph dataset, run:
	+To start a ``swh.graph`` API server of a compressed graph dataset, you need to
	+use the ``rpc-serve`` command with the basename of the graph, which is the path prefix
	+of all the graph files (e.g., with the basename ``compressed/graph``, it will
	+attempt to load the files located at
	+``compressed/graph.{graph,properties,offsets,...}``.

	-.. code:: bash
	+In our example:
	+
	+.. code:: console

	- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example
	- Loading graph output/example ...
	+ (venv) $ swh graph rpc-serve -g compressed/graph
	+ Loading graph compressed/graph ...
	Graph loaded.
	======== Running on http://0.0.0.0:5009 ========
	(Press CTRL+C to quit)

	From there you can use this endpoint to query the compressed graph, for example
	-with httpie_ (``sudo apt install``) from another terminal:
	+with httpie_ (``sudo apt install httpie``):

	.. _httpie: https://httpie.org


	.. code:: bash

	- ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010
	- HTTP/1.1 200 OK
	- Content-Type: text/plain
	- Date: Tue, 15 Sep 2020 08:33:25 GMT
	- Server: Python/3.8 aiohttp/3.6.2
	- Transfer-Encoding: chunked
	-
	- swh:1:rel:0000000000000000000000000000000000000010
	- swh:1:rev:0000000000000000000000000000000000000009
	- swh:1:rev:0000000000000000000000000000000000000003
	- swh:1:dir:0000000000000000000000000000000000000002
	- swh:1:cnt:0000000000000000000000000000000000000001
	- swh:1:dir:0000000000000000000000000000000000000008
	- swh:1:dir:0000000000000000000000000000000000000006
	- swh:1:cnt:0000000000000000000000000000000000000004
	- swh:1:cnt:0000000000000000000000000000000000000005
	- swh:1:cnt:0000000000000000000000000000000000000007
	-
	-
	-Running the existing ``python3kcompress`` dataset:
	-
	-.. code:: bash
	-
	- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k
	- Loading graph python3kcompress/python3k ...
	- Graph loaded.
	- ======== Running on http://0.0.0.0:5009 ========
	- (Press CTRL+C to quit)
	-
	-
	~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
	HTTP/1.1 200 OK
	Content-Type: text/plain
	@@ -170,5 +128,5 @@
	swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a
	swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406

	-
	-See the documentation of the :ref:`API <swh-graph-api>` for more details.
	+See the documentation of the :ref:`API <swh-graph-api>` for more details on how
	+to use the HTTP graph querying API.
	diff --git a/docs/use-cases.rst b/docs/use-cases.rst
	deleted file mode 100644
	--- a/docs/use-cases.rst
	+++ /dev/null
	@@ -1,167 +0,0 @@
	-=========
	-Use cases
	-=========
	-
	-
	-This document lists use cases and benchmark scenarios for the Software Heritage
	-graph service.
	-
	-
	-Conventions
	-===========
	-
	-- Node identification: in the following, nodes are always identified by
	- their :ref:`SWHIDs <persistent-identifiers>`.
	-
	-
	-Use cases
	-=========
	-
	-
	-Browsing
	---------
	-
	-The following use cases require traversing the forward graph.
	-
	-- ls: given a directory node, list (non recursively) all linked nodes of
	- type directory and content
	-
	- Implementation::
	-
	- /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
	-
	-- ls -R: given a directory node, recursively list all linked nodes of type
	- directory and content
	-
	- Implementation::
	-
	- /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
	-
	-- git log: given a revision node, recursively list all linked nodes of type
	- revision
	-
	- Implementation::
	-
	- /graph/visit/nodes/:REV_ID?edges=rev:rev
	-
	-
	-Vault
	------
	-
	-The following use cases require traversing the forward graph.
	-
	-- tarball (same as ls -R above)
	-
	-- git bundle: given a node, recursively list all linked nodes of any kind
	-
	- Implementation::
	-
	- /graph/visit/nodes/:NODE_ID?edges=*
	-
	-
	-Provenance
	-----------
	-
	-The following use cases require traversing the *backward (transposed)
	-graph*.
	-
	-- commit provenance: given a content or directory node, return a commit
	- whose directory (recursively) contains it
	-
	- Implementation::
	-
	- /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
	-
	-- complete commit provenance: given a content or directory node, return
	- all commits whose directory (recursively) contains it
	-
	- Implementation::
	-
	- /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
	-
	-- origin provenance: given a content, directory, or commit node, return
	- an origin that has at least one snapshot that (recursively) contains it
	-
	- Implementation::
	-
	- /graph/walk/:NODE_ID/ori?direction=backward&edges=*
	-
	-- complete origin provenance: given a content, directory, or commit node,
	- return all origins that have at least one snapshot that (recursively)
	- contains it
	-
	- Implementation::
	-
	- /graph/leaves/:NODE_ID?direction=backward&edges=*
	-
	-- SLOC tracking: left as future work
	-
	-
	-Provenance statistics
	-~~~~~~~~~~~~~~~~~~~~~
	-
	-The following use cases require traversing the *backward (transposed)
	-graph*.
	-
	-- content popularity across commits: for each content, count the number of
	- commits (or commit popularity) that link to a directory that (recursively)
	- includes it. Plot the distribution of content popularity across commits.
	-
	- Implementation: apply complete commit provenance to each content node,
	- count the returned commits, aggregate.
	-
	-- commit popularity across origins: for each commit, count the number of
	- origins (or origin popularity) that have a snapshot that (recursively)
	- includes it. Plot the distribution of commit popularity across origins.
	-
	- Implementation: apply complete origin provenance to each commit node, count
	- the returned origins, aggregate.
	-
	-- SLOC popularity across contents: left as future work
	-
	-The following use cases require traversing the forward graph.
	-
	-- revision size (as n. of contents) distribution: for each revision, count
	- the number of contents that are (recursively) reachable from it. Plot the
	- distribution of revision sizes.
	-
	-- origin size (as n. of revisions) distribution: for each origin, count the
	- number of revisions that are (recursively) reachable from it. Plot the
	- distribution of origin sizes.
	-
	-
	-Benchmarks
	-==========
	-
	-Notes on how to benchmark graph access:
	-
	-- separate pure-graph timings from timings related to use additional mappings
	- (e.g., node types), no matter if the mappings are in-memory or on-disk
	-
	-- separate in-memory timings from on-disk timings; in particular, separate the
	- timing of translating node identifiers between internal integers and SWHIDs
	-
	-- for each use case that requires a node as input, we will randomize the choice
	- of the input node and repeat the experiment a suitable number of times; where
	- possible we will aggregate results computing basic statistics (average,
	- standard deviation), as well as normalize results w.r.t. the “size” of the
	- chosen node (e.g., number of nodes/path length in the resulting visit)
	-
	-
	-Basic benchmarks
	-----------------
	-
	-- Edge traversal: given a node, retrieve the first node in its adjacency
	- list.
	-
	- For reference: Apostolico, Drovandi in Graph Compression by BFS report
	- times to retrieve the adjacency list of a node (and/or test if an edge exists
	- between two nodes) in the 2-3 us range, for the largest graph in their
	- experiments (22 M nodes, 600 M edges).
	-
	-
	-Each use case is a benchmark
	-----------------------------
	-
	-In addition to abstract benchmark, we will use each use case above as a
	-scenario-based benchmark.
	diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
	--- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
	+++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
	@@ -27,8 +27,7 @@
	* Create maps needed at runtime by the graph service, in particular:
	* <p>
	* <ul>
	- * <li>SWHID → WebGraph long node id</li>
	- * <li>WebGraph long node id → SWHID (converse of the former)</li>
	+ * <li>WebGraph long node id → SWHID</li>
	* <li>WebGraph long node id → SWH node type (enum)</li>
	* </ul>
	*
	diff --git a/swh/graph/webgraph.py b/swh/graph/webgraph.py
	--- a/swh/graph/webgraph.py
	+++ b/swh/graph/webgraph.py
	@@ -179,6 +179,14 @@
	"{tmp_dir}",
	"< {out_dir}/{graph_name}.nodes.csv.zst",
	],
	+ CompressionStep.EXTRACT_PERSONS: [
	+ "{java}",
	+ "org.softwareheritage.graph.compress.ExtractPersons",
	+ "--temp-dir",
	+ "{tmp_dir}",
	+ "{in_dir}",
	+ "{out_dir}/{graph_name}",
	+ ],
	CompressionStep.MPH_PERSONS: [
	"{java}",
	"it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction",
	@@ -190,14 +198,6 @@
	"{out_dir}/{graph_name}.persons.mph",
	"{out_dir}/{graph_name}.persons.csv.zst",
	],
	- CompressionStep.EXTRACT_PERSONS: [
	- "{java}",
	- "org.softwareheritage.graph.compress.ExtractPersons",
	- "--temp-dir",
	- "{tmp_dir}",
	- "{in_dir}",
	- "{out_dir}/{graph_name}",
	- ],
	CompressionStep.NODE_PROPERTIES: [
	"{java}",
	"org.softwareheritage.graph.compress.WriteNodeProperties",

File Metadata

Mime Type: text/plain
Expires: Dec 21 2024, 1:49 PM (27 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3215051

D7839.diffNo OneTemporaryActions

D7839.diffView Options

File Metadata

Event Timeline

D7839.diff
No OneTemporary
Actions

D7839.diff
View Options