Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7124582
D7839.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
89 KB
Subscribers
None
D7839.diff
View Options
diff --git a/docs/api.rst b/docs/api.rst
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -1,7 +1,16 @@
.. _swh-graph-api:
-Graph RPC API
-=============
+Graph Querying HTTP API
+=======================
+
+The Graph Querying API is a high-level HTTP API intended to run common,
+relatively simple traversal queries on the compressed graph.
+
+The client/server architecture allows it to only load the graph in memory once
+then serve multiple different requests. However, it is limited in expressivity;
+more complex or resource-intensive queries should rather use the
+:ref:`Low-level Java API <swh-graph-java-api>` to run them as standalone
+programs.
Terminology
@@ -53,8 +62,11 @@
- ``"cnt,snp"`` accepted node types returned in the query results.
+Endpoints
+---------
+
Leaves
-------
+~~~~~~
.. http:get:: /graph/leaves/:src
@@ -97,7 +109,7 @@
Neighbors
----------
+~~~~~~~~~
.. http:get:: /graph/neighbors/:src
@@ -138,7 +150,7 @@
Walk
-----
+~~~~
..
.. http:get:: /graph/walk/:src/:dst
@@ -246,7 +258,7 @@
Visit
------
+~~~~~
.. http:get:: /graph/visit/nodes/:src
.. http:get:: /graph/visit/edges/:src
@@ -340,7 +352,7 @@
Counting results
-----------------
+~~~~~~~~~~~~~~~~
The following method variants, with trailing `/count` added, behave like their
already discussed counterparts but, instead of returning results, return the
@@ -363,7 +375,7 @@
Stats
------
+~~~~~
.. http:get:: /graph/stats
@@ -405,3 +417,125 @@
"avg": 0.6107127825377487
}
}
+
+
+Use-case examples
+-----------------
+
+This section showcases how to leverage the endpoints of the HTTP API described
+above for some common use-cases.
+
+
+Browsing
+~~~~~~~~
+
+The following use cases require traversing the *forward graph*.
+
+- **ls**: given a directory node, list (non recursively) all linked nodes of
+ type directory and content
+
+ Endpoint::
+
+ /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
+
+- **ls -R**: given a directory node, recursively list all linked nodes of type
+ directory and content
+
+ Endpoint::
+
+ /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
+
+- **git log**: given a revision node, recursively list all linked nodes of type
+ revision
+
+ Endpoint::
+
+ /graph/visit/nodes/:REV_ID?edges=rev:rev
+
+
+Vault
+~~~~~
+
+The following use cases require traversing the *forward graph*.
+
+- **tarball** (same as *ls -R* above)
+
+- **git bundle**: given a node, recursively list all linked nodes of any kind
+
+ Endpoint::
+
+ /graph/visit/nodes/:NODE_ID?edges=*
+
+
+Provenance
+~~~~~~~~~~
+
+The following use cases require traversing the *backward (transposed)
+graph*.
+
+- **commit provenance**: given a content or directory node, return *a* commit
+ whose directory (recursively) contains it
+
+ Endpoint::
+
+ /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
+
+- **complete commit provenance**: given a content or directory node, return
+ *all* commits whose directory (recursively) contains it
+
+ Endpoint::
+
+ /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
+
+- **origin provenance**: given a content, directory, or commit node, return
+ *an* origin that has at least one snapshot that (recursively) contains it
+
+ Endpoint::
+
+ /graph/walk/:NODE_ID/ori?direction=backward&edges=*
+
+- **complete origin provenance**: given a content, directory, or commit node,
+ return *all* origins that have at least one snapshot that (recursively)
+ contains it
+
+ Endpoint::
+
+ /graph/leaves/:NODE_ID?direction=backward&edges=*
+
+
+Provenance statistics
+~~~~~~~~~~~~~~~~~~~~~
+
+The following use cases require traversing the *backward (transposed)
+graph*.
+
+- **content popularity across commits**: count the number of commits (or
+ *commit popularity*) that link to a directory that (recursively) includes a
+ given content.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev
+
+- **commit popularity across origins**: count the number of origins (or *origin
+ popularity*) that have a snapshot that (recursively) includes a given commit.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?direction=backward&edges=*
+
+The following use cases require traversing the *forward graph*.
+
+- **revision size** (as n. of contents) distribution: the number of contents
+ that are (recursively) reachable from a given revision.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?edges=*
+
+- **origin size** (as n. of revisions) distribution: count the number of
+ revisions that are (recursively) reachable from a given origin.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev
diff --git a/docs/compression.rst b/docs/compression.rst
--- a/docs/compression.rst
+++ b/docs/compression.rst
@@ -1,113 +1,391 @@
.. _graph-compression:
+=================
Graph compression
=================
-The compression process is a pipeline implemented for the most part on top of
-the `WebGraph framework <http://webgraph.di.unimi.it/>`_ and ecosystem
-libraries. The compression pipeline consists of the following steps:
+The compression pipeline is implemented on top of the `WebGraph framework
+<http://webgraph.di.unimi.it/>`_. It takes an ORC Graph Dataset as an input,
+such as the ones found in the :ref:`Graph Dataset List <swh-dataset-list>`,
+and generates a compressed graph suitable for high intensity analyses on
+large servers.
-.. figure:: images/compression_steps.png
- :align: center
- :alt: Compression steps
- Compression steps
+Running the compression pipeline
+================================
+
+Dependencies
+------------
+
+To compress a graph, you will need to install the ``swh.graph`` tool as well as
+a recent JRE, as described in the :ref:`swh-graph-quickstart` page.
+
+You will also need the zstd_ compression tool::
+
+ $ sudo apt install zstd
+
+.. _zstd: https://facebook.github.io/zstd/
+
+
+Hardware Requirements
+---------------------
+
+The compression pipeline is even more demanding than the graph server in terms
+of hardware requirements, especially RAM. Notably, the BFS compression step
+loads a graph compressed in random order in memory, which is usually more than
+a TiB for the full graph. While it is possible to do this step with a memory
+mapping, our experiments show that this could take a very long time (several
+months) on hard drives.
+
+The LLP compression step requires 13 bytes of RAM per node, which could amount
+to storing hundreds of gigabytes in RAM in addition to loading the graph
+itself.
+
+Some steps also involve sorting the entire set of edges and their labels, by
+using large on-disk buffer files, sometimes reaching the size of the input
+dataself itself.
+
+The machine we used to compress the entire graph (dataset version 2022-04-25)
+has the following hardware specs:
+
+- 2 TiB of RAM (DDR4 ECC 2400Mhz)
+- 64 vCPUs (Dual AMD EPYC 7302 16-Core)
+- 24 TiB of SSD (NVMe)
+
+The server we rented is from the
+`HGR-HCI-4 <https://www.ovhcloud.com/en/bare-metal/high-grade/hgr-hci-4/>`_
+series from OVH.
+
+
+Input dataset
+-------------
+
+First, you need to retrieve a graph to compress, in ORC format. The :ref:`Graph
+Dataset List <swh-dataset-list>` has a list of datasets made available by the
+Software Heritage archive, including "teaser" subdatasets which have a more
+manageable size and are thus very useful for prototyping with less hardware
+resources.
+
+The datasets can be retrieved from S3 or the annex, in a similar fashion to
+what is described in :ref:`swh-graph-retrieving-compressed`, by simply
+replacing "compressed" by "orc":
+
+.. code:: console
+
+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/orc
+ (venv) $ cd 2021-03-23-popular-3k-python/
+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/ orc
+
+Alternatively, any custom ORC dataset can be used as long as it respects
+:ref:`the schema <swh-dataset-schema>` of the Software Heritage Graph Dataset.
+
+**Note:** for testing purposes, a fake test dataset is available in the
+``swh-graph`` repository, with just a few dozen nodes. The ORC tables are
+available in ``swh-graph/swh/graph/tests/dataset/orc/``.
+
+
+Compression
+-----------
+
+You can compress your dataset by using the ``swh graph compress`` command. It
+will run all the various steps of the pipeline in the right order.
+
+.. code:: console
-Each of these steps is briefly described below. For more details see the
-following paper:
-.. note::
+ (venv) $ swh graph compress --input-dataset orc/ --outdir compressed/
+ [...]
+ (venv) $ ls compressed/
+ graph.edges.count.txt
+ graph.edges.stats.txt
+ graph.graph
+ graph.indegree
+ graph-labelled.labeloffsets
+ graph-labelled.labels
+ [...]
+ graph-transposed.obl
+ graph-transposed.offsets
+ graph-transposed.properties
- Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
- `Ultra-Large-Scale Repository Analysis via Graph Compression
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_. In
- proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th IEEE
- International Conference on Software Analysis, Evolution and
- Reengineering. IEEE 2020.
- Links: `preprint
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
- `bibtex
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
+(The purpose of each of these files is detailed in the
+:ref:`swh-graph-java-api` page.
-In order to practically perform graph compression, install the ``swh.graph``
-module and use the ``swh graph compress`` command line interface of the
-compression driver, that will conduct the various steps in the right order.
-See ``swh graph compress --help`` for usage details.
+For sufficiently large graphs, this command can take entire weeks. It is highly
+recommended to run it in a systemd service or in a tmux session.
+It is also possible to run single steps or step ranges from the CLI:
-1. MPH
+.. code:: bash
+
+ swh graph compress -i orc/ -o compressed/ --steps mph-bfs
+
+See ``swh graph compress --help`` for syntax and usage details.
+
+
+Compression steps
+=================
+
+The compression pipeline consists of the following steps:
+
+.. figure:: images/compression_steps.png
+ :align: center
+ :alt: Compression steps
+ :target: _images/compression_steps.png
+
+ Compression steps
+
+Each of these steps is briefly described below. For more details see the
+original Software Heritage graph compression paper [SWHGraphCompression2020]_,
+as well as chapters 9 and 10 of Antoine Pietri's PhD thesis
+[PietriThesis2021]_.
+
+.. [SWHGraphCompression2020]
+ | Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
+ | `Ultra-Large-Scale Repository Analysis via Graph Compression
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_.
+ | In proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th
+ IEEE International Conference on Software Analysis, Evolution and
+ Reengineering. IEEE 2020.
+ | Links: `preprint
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
+ `bibtex
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
+
+
+
+.. [PietriThesis2021]
+ | Antoine Pietri
+ | `Organizing the graph of public software development for large-scale mining
+ <https://hal.archives-ouvertes.fr/tel-03515795/>`_.
+ | Doctoral dissertation. Inria, 2021.
+
+
+1. EXTRACT_NODES
+----------------
+
+This step reads a graph dataset and extract all the unique node SWHIDs it
+contains, including the ones that are not stored as actual objects in the
+graph, but only *referred to* by the edges. Additionally, it extracts the set
+of all unique edge labels in the graph.
+
+**Rationale:** Because the graph can contain holes, loose objects and dangling
+objects, some nodes that are referred to as destinations in the edge
+relationships might not actually be stored in the graph itself. However, to
+compress the graph using a graph compressio library, it is necessary to have a
+list of *all* the nodes in the graph, including the ones that are simply
+referred to by the edges but not actually stored as concrete objects.
+
+This step reads the entire graph dataset, and uses ``sort -u`` to extract the
+set of all the unique nodes and unique labels that will be needed as an input
+for the compression process. It also write object count statistics in various
+files:
+
+- The set of nodes is written in ``graph.nodes.csv.zst``, as a zst-compressed
+ sorted list of SWHIDs, one per line.
+- The set of edge labels is written in ``graph.labels.csv.zst``, as a
+ zst-compressed sorted list of labels encoded in base64, one per line.
+- The number of unique nodes referred to in the graph is written in a text
+ file, ``graph.nodes.count.txt``
+- The number of unique edges referred to in the graph is written in a text
+ file, ``graph.edges.count.txt``
+- The number of unique edge labels is written in a text file,
+ ``graph.labels.count.txt``
+- Statistics on the number of nodes of each type are written in a text file,
+ ``graph.nodes.stats.txt``
+- Statistics on the number of edges of each type are written in a text file,
+ ``graph.edges.stats.txt``
+
+
+2. MPH
------
-A node in the Software Heritage :ref:`data model <data-model>` is identified
-using its SWHID (see :ref:`persistent identifiers
-<persistent-identifiers>`). However, WebGraph internally uses integers to refer
-to node ids.
+As discussed in :ref:`swh-graph-java-basics`, a node in the Software Heritage
+:ref:`data model <data-model>` is identified by its SWHID (see :ref:`persistent
+identifiers <persistent-identifiers>`), but WebGraph internally uses integers
+to refer to node ids.
-Mapping between the strings and longs ids is needed before compressing the
-graph. From the `Sux4J <http://sux.di.unimi.it/>`_ utility tool, we use the
+To create a mapping between integer node IDs and SWHIDs, we use the
`GOVMinimalPerfectHashFunction
<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
-class, mapping with no collisions N keys to N consecutive integers.
+class of the `Sux4J <http://sux.di.unimi.it/>`_ library, which maps N keys to N
+consecutive integers.
-The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
-Hash-function*) storing the hash function taking as input a string and returning
-a unique integer.
+We run this function on the list of SWHIDs stored in the
+``graph.nodes.csv.zst`` file generated in the previous step.
+This allows us to generate a bijection from the set of all the *n* SWHIDs in the
+graph to the set of integers :math:`[0, n - 1]`.
+The step produces a ``graph.mph`` file (MPH stands for *Minimal Perfect
+Hash-function*), containing a function which takes a SWHID (as a bytestring)
+and returns its associated node ID.
-2. BV compress
+
+3. BV compress
--------------
-This is the first actual compression step, building a compressed version of the
-input graph using WebGraph techniques presented in the framework paper. We use
-the `ScatteredArcsASCIIGraph
+This is the first actual compression step, where we build a compressed version
+of the input graph dataset.
+
+We use a ScatteredArcsORCGraph to load the dataset
+(implementation inspired of the `ScatteredArcsASCIIGraph
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/ScatteredArcsASCIIGraph.html>`_
-class, from WebGraph.
+class in WebGraph).
+This class wraps the ORC Graph dataset and exposes a *virtual* ImmutableGraph,
+whose nodes and edges can be iterated sequentially as if it was any other
+standard graph. To do so, it puts all the edges in batches and sorts them in an
+aggressively parallel fashion, then stores them as ``.bitstream`` files, and
+returns a `BatchGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.BatchGraph.html>`
+created from these batches.
+
+Finally, it uses the ``BVGraph.store()`` method, which compresses the input
+graph as a `BVGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BVGraph.html>`_,
+using the compression techniques described in the article *The WebGraph
+Framework I: Compression Techniques* cited above.
The resulting BV graph is stored as a set of files:
-- ``.graph``: the compressed graph in the BV format
-- ``.offsets``: offsets values to read the bit stream graph file
-- ``.obl``: offsets cache to load the graph faster
-- ``.properties``: entries used to correctly decode graph and offset files
+- ``graph-base.graph``: the compressed graph in the BV format
+- ``graph-base.offsets``: offsets values to read the bit stream graph file
+- ``graph-base.properties``: entries used to correctly decode graph and offset
+ files
-3. BFS
--------
+4. BFS
+------
-In the LLP paper, authors propose an empirical analysis linking node ordering
-and high compression ratio: it is important to use an ordering of nodes ids such
-that vertices from the same host are close to one another.
+In [LLP]_, the paper authors empirically demonstrate that a high graph
+compression ratio can be achieved for the graph of the Web by ordering nodes
+such that vertices from the same host are close to each other.
-Building on this insight, the previous compression results in the BV compress
-step are improved by re-ordering nodes ids using a BFS traversal order. We use
-the `BFS
+In Software Heritage, there is no notion of "host" that can be used to generate
+these compression-friendly orderings, because the identifiers are just
+uniformly random cryptographic hashes. However, we can generate these orderings
+by running algorithms to inform us on which nodes are close to each other.
+
+In this step, we run a BFS traversal on the entire graph to get a more
+compression-friendly ordering of nodes. We use the `BFS
<http://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
class from the `LAW <http://law.di.unimi.it/>`_ library.
-The resulting ordering is stored in the ``.order`` file, listing nodes ids in
-order of traversal.
+The resulting ordering is stored in a ``graph-bfs.order`` file, which contains
+all the node IDs in the order of traversal.
-4. Permute
-----------
+5. PERMUTE_BFS
+--------------
-Once the order is computed (BFS or another ordering technique), the final
-compressed graph is created based on the initial BV compress result, and using
-the new node order mapping. The permutation uses the `Transform
+Once the BFS order is computed, we permute the initial "base" graph using the
+this new ordering. The permutation uses the `Transform
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
class from WebGraph framework.
-The final compressed graph is only stored in the resulting ``.graph``,
-``.offsets``, ``.obl``, and ``.properties`` files.
+The BFS-compressed graph is stored in the files
+``graph-bfs.{graph,offsets,properties}``.
+6. TRANSPOSE_BFS
+----------------
-5. Stats
---------
+We transpose the BFS-compressed graph, using the `Transform
+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
+class from WebGraph.
+This step is a prerequisite for LLP compression.
+
+7. SIMPLIFY
+-----------
+
+This step creates a loopless and symmetric version of the BFS-compressed graph,
+using the `Transform
+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
+class from WebGraph.
+This step is a prerequisite for LLP compression.
+
+8. LLP
+------
+
+Better compression ratios can be achieved by the Layered Label Propagation
+(LLP) algorithm to reorder nodes. This algorithm is described in [LLP]_.
+The LLP algorithm finds locality-preserving orders by clustering together nodes
+in close proximity. Similar to the BFS, this algorithm is particularly
+interesting for our use case as it is unsupervised, and does not rely on prior
+information on the clusters present in the graph. The idea behind the
+clustering algorithm is to randomly distribute communities to the nodes in the
+graph, then iteratively assign to each node the community most represented in
+its neighbors.
+
+.. [LLP] Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna.
+ *Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks.*
+ WWW 2011: 587-596
+ DOI: https://doi.org/10.1145/1963405.1963488
+ preprint: https://arxiv.org/abs/1011.5425
+
+LLP is more costly than simple BFS-based compression in both time and memory.
+Even though the algorithm has a linear time complexity, it does multiple
+iterations on the graph and is significantly slower than the BFS which is just
+one single traversal. Moreover, keeping track of the communities requires a
+total of 13 bytes per node, which increases the RAM requirements.
+Because of these constraints, it is unrealistic to run the LLP algorithm on the
+uncompressed version of the graph; this is why we do an intermediate
+compression with the BFS ordering first, then compress the entire graph *again*
+with an even better ordering.
+
+The LLP algorithm takes a simplified (loopless, symmetric) graph as an input,
+which we already computed in the previous steps.
+
+The algorithm is also parameterized by a list of γ values, a "resolution" parameter
+which defines the shapes of the clustering it produces: either small, but
+denser pieces, or larger, but unavoidably sparser pieces. The algorithm then
+combines the different clusterings together to generate the output reordering.
+γ values are given to the algorithm in the form :math:`\frac{j}{2^k}`; by
+default, 12 different values of γ are used. However, the combination procedure
+is very slow, and using that many γ values could take several months in our
+case.
+We thus narrowed down a smaller set of γ values that empirically give good
+compression results, which are used by default in the pipeline. In general,
+smaller values of γ seem to generate better compression ratios. The effect of a
+given γ is that the density of the sparsest cluster is at least γ γ+1, so large
+γ values imply small, more dense clusters. It is reasonable to assume that
+since the graph is very sparse to start with, such clusters are not that
+useful.
+
+The resulting ordering is stored in a ``graph-llp.order`` file.
+
+9. PERMUTE_LLP
+--------------
-Compute various statistics on the final compressed graph:
+Once the LLP order is computed, we permute the BFS-compressed graph using the
+this new ordering. The LLP-compressed graph, which is our final compressed
+graph, is stored in the files ``graph.{graph,offsets,properties}``.
-- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
+10. OBL
+-------
+
+Cache the BVGraph offsets of the forward graph to make loading faster. The
+resulting offset big list is stored in the ``graph.obl`` file.
+
+11. COMPOSE_ORDERS
+------------------
+
+To be able to translate the initial MPH inputs to their resulting rank in the
+LLP-compressed graph, we need to use the two order permutations successively:
+the base → BFS permutation, then the BFS → LLP permutation.
+
+To make this less wasteful, we *compose* the two permutations into a single
+one. We use the `composePermutationsInPlace
+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/Util.html#composePermutationsInPlace(long%5B%5D%5B%5D,long%5B%5D%5B%5D)>`_
+function of the dsiutils library. The resulting permutation is stored as a
+``graph.order`` file. Hashing a SWHID with the ``graph.mph`` function, then
+permuting the result using the ``graph.order`` permutation yields the integer
+node ID matching the input SWHID in the graph.
+
+12. STATS
+---------
+
+This step computes various statistics on the compressed graph:
+
+- ``.stats``: statistics such as number of nodes, edges, avg/min/max degree,
average locality, etc.
- ``.indegree``: graph indegree distribution
- ``.outdegree``: graph outdegree distribution
@@ -117,9 +395,202 @@
class from WebGraph.
-6. Transpose
-------------
+13. TRANSPOSE
+-------------
-Create a transposed graph to allow backward traversal, using the `Transform
+Transpose the graph to allow backward traversal, using the `Transform
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
-class from WebGraph.
+class from WebGraph. The resulting transposed graph is stored as the
+``graph-transposed.{graph,offsets,properties}`` files.
+
+
+14. TRANSPOSE_OBL
+-----------------
+
+Same as OBL, but for the transposed graph. The resulting offset big list is
+stored in the ``graph-transposed.obl`` file.
+
+
+15. MAPS
+--------
+
+This steps generates the *node mappings* described in
+:ref:`swh-graph-java-node-mappings`. In particular, it generates:
+
+- ``graph.node2swhid.bin``: a compact binary representation of all the
+ SWHIDs in the graph, ordered by their rank in the graph file.
+- ``graph.node2type.bin``: a `LongBigArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
+ which stores the type of each node.
+
+It does so by reading all the SWHIDs in the ``graph.nodes.csv.zst`` file generated in the
+EXTRACT_NODES step, then getting their corresponding node IDs (using the
+``.mph`` and ``.order`` files), then sorting all the SWHIDs according to
+their node ID. It then writes these SWHIDs in order, in a compact but seekable
+binary format, which can be used to return the SWHID corresponding to any given
+node in O(1).
+
+
+16. EXTRACT_PERSONS
+-------------------
+
+This step reads the ORC graph dataset and extracts all the unique persons it
+contains. Here, "persons" are defined as the set of unique pairs of name +
+email, potentially pseudonymized, found either as revision authors, revision
+committers or release authors.
+
+The ExtractPersons class reads all the persons from revision and release
+tables, then uses ``sort -u`` to get a sorted list without any duplicates. The
+resulting sorted list of authors is stored in the ``graph.persons.csv.zst``
+file.
+
+
+17. MPH_PERSONS
+---------------
+
+This step computes a Minimal Perfect Hash function on the set of all the unique
+persons extracted in the EXTRACT_PERSONS step. Each individual person is mapped
+to a unique integer in :math:`[0, n-1]` where *n* is the total number of
+persons. The resulting function is serialized and stored in the
+``graph.persons.mph`` file.
+
+
+18. NODE_PROPERTIES
+-------------------
+
+This step generates the *node property files*, as described in
+:ref:`swh-graph-java-node-properties`.
+The nodes in the Software Heritage Graph each have associated *properties*
+(e.g., commit timestamps, authors, messages, ...). The values of these
+properties for each node in the graph are compressed and stored in files
+alongside the compressed graph.
+
+The WriteNodeProperties class reads all the properties from the ORC Graph
+Dataset, then serializes each of them in a representation suitable for
+efficient random access (e.g., large binary arrays) and stores them on disk.
+
+For persons (authors, committers etc), the MPH computed in the MPH_PERSONS step
+is used to store them as a single pseudonymized integer ID, which uniquely
+represents a full name + email.
+
+The results are stored in the following list of files:
+
+- ``graph.property.author_id.bin``
+- ``graph.property.author_timestamp.bin``
+- ``graph.property.author_timestamp_offset.bin``
+- ``graph.property.committer_id.bin``
+- ``graph.property.committer_timestamp.bin``
+- ``graph.property.committer_timestamp_offset.bin``
+- ``graph.property.content.is_skipped.bin``
+- ``graph.property.content.length.bin``
+- ``graph.property.message.bin``
+- ``graph.property.message.offset.bin``
+- ``graph.property.tag_name.bin``
+- ``graph.property.tag_name.offset.bin``
+
+
+19. MPH_LABELS
+--------------
+
+This step computes a **monotone** Minimal Perfect Hash function on the set of
+all the unique *arc label names* extracted in the EXTRACT_NODES step. Each
+individual arc label name (i.e., directory entry names and snapshot branch
+names) is monotonely mapped to a unique integer in :math:`[0, n-1]`, where *n*
+is the total number of unique arc label names, which corresponds to their
+**lexical rank** in the set of all arc labels.
+
+In other words, this MPH being monotone means that the hash of the *k*-th item
+in the sorted input list of arc labels will always be *k*.
+We use the `LcpMonotoneMinimalPerfectHashFunction
+<https://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/LcpMonotoneMinimalPerfectHashFunction.html>`_
+of Sux4J to generate this function.
+
+The rationale for using a monotone function here is that it will allow us to
+quickly get back the arc label from its hash without having to store an
+additional permutation.
+The resulting MPH function is serialized and stored in the ``graph.labels.mph``
+file.
+
+
+20. FCL_LABELS
+--------------
+
+This step computes a *reverse-mapping* for arc labels, i.e., a way to
+efficiently get the arc label name from its hash computed with the monotone MPH
+of the MPH_LABELS step.
+
+Thanks to the MPH being monotone, this boils down to storing all the labels in
+lexicographic order in a string list format that allows O(1) access to its
+elements. For this purpose, we use the `MappedFrontCodedStringBigList
+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/big/util/MappedFrontCodedStringBigList.html>`_
+class from the dsiutils library, using the ``graph.labels.csv.zst`` file as its
+input. It stores the label names in a compact way by using front-coding
+compression, which is particularly efficient here because the strings are
+already in lexicographic order. The resulting FCL files are stored as
+``graph.labels.fcl.*``, and they can be loaded using memory mapping.
+
+
+21. EDGE_LABELS
+---------------
+
+
+This step generates the *edge property files*, as described in
+:ref:`swh-graph-java-edge-properties`. These files allow us to get the *edge
+labels* as we iterate on the edges of the graph. The files essentially contain
+compressed sorted triplets of the form (source, destination, label), with
+additional offsets to allow random access.
+
+To generate these files, the LabelMapBuilder class starts by reading in
+parallel the labelled edges in the ORC dataset, which can be thought of as
+quadruplets containing the source SWHID, the destination SWHID, the label name
+and the entry permission if applicable:
+
+.. code-block:: text
+
+ swh:1:snp:4548a5… swh:1:rev:0d6834… cmVmcy9oZWFkcy9tYXN0ZXI=
+ swh:1:dir:05faa1… swh:1:cnt:a35136… dGVzdC5j 33188
+ swh:1:dir:05faa1… swh:1:dir:d0ff82… dGVzdA== 16384
+ ...
+
+Using the ``graph.mph`` and the ``graph.order`` files, we hash and permute the
+source and destination nodes. We also monotonically hash the labels using the
+``graph.labels.mph`` function to obtain the arc label identifiers. The
+permissions are normalized as one of the 6 possible values in the
+``DirEntry.Permission.Type`` enum, and are then stored in the 3 lowest bits of
+the label field.
+
+.. code-block:: text
+
+ 4421 14773 154
+ 1877 21441 1134
+ 1877 14143 1141
+ ...
+
+These hashed edges and their compact-form labels are then put in large batches
+sorted in an aggressively parallel fashion, which are then stored as
+``.bitstream`` files. These batch files are put in a heap structure to perform
+a merge sort on the fly on all the batches.
+
+Then, the LabelMapBuilder loads the graph and starts iterating on its edges. It
+synchronizes the stream of edges read from the graph with the stream of sorted
+edges and labels read from the bitstreams in the heap. At this point, it writes
+the labels to the following output files:
+
+- ``graph-labelled.properties``: a property file describing the graph, notably
+ containing the basename of the wrapped graph.
+- ``graph-labelled.labels``: the compressed labels
+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
+ random order.
+
+It then does the same with backward edge batches to get the transposed
+equivalent of these files:
+``graph-transposed-labelled.{properties,labels,labeloffsets}``.
+
+
+
+22. CLEAN_TMP
+-------------
+
+This step reclaims space by deleting the temporary directory, as well as all
+the intermediate outputs that are no longer necessary now that the final graph
+has been compressed (shown in gray in the step diagram).
diff --git a/docs/images/Makefile b/docs/images/Makefile
--- a/docs/images/Makefile
+++ b/docs/images/Makefile
@@ -1,7 +1,7 @@
all: compression_steps.png compression_steps.svg
%.png: %.dot
- dot -Gdpi=300 -Tpng $< -o $@
+ dot -Gdpi=150 -Tpng $< -o $@
%.svg: %.dot
dot -Tsvg $< -o $@
diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot
--- a/docs/images/compression_steps.dot
+++ b/docs/images/compression_steps.dot
@@ -1,51 +1,105 @@
digraph "Compression steps" {
- // Horizontal graph
- rankdir=LR;
+ node [shape = none];
+
+ orc_dataset [label="ORC Graph\nDataset"];
+ nodes_csv [label="graph.nodes.csv.zst"];
+ labels_csv [label="graph.labels.csv.zst"];
+ graph_mph [label="graph.mph"];
subgraph {
- input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none];
- input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none];
- {rank=same; input_edges; input_nodes;}
+ node [fontcolor=darkgray];
+ graph_base [label="graph-base.graph"]
+ graph_bfs_order [label="graph-bfs.order"]
+ graph_bfs [label="graph-bfs.graph"]
+ graph_bfs_transposed [label="graph-bfs-transposed.graph"]
+ graph_bfs_simplified [label="graph-bfs-simplified.graph"]
+ graph_llp_order [label="graph-llp.order"]
}
- mph [label="MPH", shape=box];
- mph_out [label="swh.mph", fontsize=9, shape=none];
-
- bv_compress [label="BV compress", shape=box];
- bv_compress_out
- [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties",
- fontsize=9, shape=none];
-
- bfs [label="BFS", shape=box];
- bfs_out [label="swh.order", fontsize=9, shape=none];
+ graph_llp [label="graph.graph"]
+ graph_llp_transposed [label="graph-transposed.graph"]
+ graph_order [label="graph.order"]
+ graph_obl [label="graph.obl"]
+ graph_transposed_obl [label="graph-transposed.obl"]
+ stats [label="graph.stats"]
+ swhidmap [label="graph.node2swhid.bin"]
+ typemap [label="graph.node2type.bin"]
+ persons_csv [label="graph.persons.csv.zst"];
+ persons_mph [label="graph.persons.mph"];
+ node_properties [label="graph.property.*"];
+ labels_mph [label="graph.labels.mph"];
+ labels_fcl [label="graph.labels.fcl"];
+ graph_labelled [label="graph-labelled.*"];
+ graph_transposed_labelled [label="graph-transposed-labelled.*"];
- permute [label="Permute", shape=box];
- permute_out
- [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties",
- fontsize=9, shape=none];
-
- stats [label="Stats", shape=box];
- stats_out
- [label="swh.stats\lswh.indegree\lswh.outdegree",
- fontsize=9, shape=none];
+ subgraph {
+ node [shape=box, fontname="Courier New"];
+ EXTRACT_NODES;
+ MPH;
+ BV;
+ BFS;
+ PERMUTE_BFS;
+ TRANSPOSE_BFS;
+ SIMPLIFY;
+ LLP;
+ PERMUTE_LLP;
+ COMPOSE_ORDERS;
+ STATS;
+ TRANSPOSE;
+ OBL;
+ TRANSPOSE_OBL;
+ NODE_MAP;
+ EXTRACT_PERSONS;
+ MPH_PERSONS;
+ NODE_PROPERTIES;
+ MPH_LABELS;
+ FCL_LABELS;
+ EDGE_LABELS;
+ }
- transpose [label="Transpose", shape=box];
- transpose_out
- [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties",
- fontsize=9, shape=none];
- input_nodes -> mph;
- input_edges -> bv_compress;
- mph -> mph_out;
- mph_out -> bv_compress;
- bv_compress -> bv_compress_out;
- bv_compress_out-> bfs;
- bv_compress_out-> permute;
- bfs -> bfs_out;
- bfs_out -> permute;
- permute -> permute_out;
- permute_out -> stats;
- permute_out -> transpose;
- stats -> stats_out;
- transpose -> transpose_out;
+ orc_dataset -> EXTRACT_NODES;
+ EXTRACT_NODES -> nodes_csv;
+ EXTRACT_NODES -> labels_csv;
+ nodes_csv -> MPH -> graph_mph;
+ graph_mph -> BV;
+ orc_dataset -> BV -> graph_base;
+ graph_base -> BFS -> graph_bfs_order;
+ graph_bfs_order -> PERMUTE_BFS;
+ graph_base -> PERMUTE_BFS -> graph_bfs;
+ graph_bfs -> TRANSPOSE_BFS -> graph_bfs_transposed;
+ graph_bfs_transposed -> SIMPLIFY;
+ graph_bfs -> SIMPLIFY -> graph_bfs_simplified;
+ graph_bfs_simplified -> LLP -> graph_llp_order;
+ graph_llp_order -> PERMUTE_LLP;
+ graph_bfs -> PERMUTE_LLP -> graph_llp;
+ graph_bfs_order -> COMPOSE_ORDERS;
+ graph_llp_order -> COMPOSE_ORDERS -> graph_order;
+ graph_llp -> TRANSPOSE -> graph_llp_transposed;
+ graph_llp -> OBL -> graph_obl;
+ graph_llp_transposed -> TRANSPOSE_OBL -> graph_transposed_obl;
+ graph_llp -> STATS -> stats;
+ graph_llp -> NODE_MAP;
+ nodes_csv -> NODE_MAP;
+ graph_mph -> NODE_MAP;
+ graph_order -> NODE_MAP;
+ NODE_MAP -> swhidmap;
+ NODE_MAP -> typemap;
+ orc_dataset -> EXTRACT_PERSONS -> persons_csv;
+ persons_csv -> MPH_PERSONS -> persons_mph;
+ orc_dataset -> NODE_PROPERTIES;
+ persons_mph -> NODE_PROPERTIES;
+ graph_mph -> NODE_PROPERTIES;
+ graph_order -> NODE_PROPERTIES;
+ NODE_PROPERTIES -> node_properties;
+ labels_csv -> MPH_LABELS -> labels_mph;
+ labels_mph -> FCL_LABELS;
+ labels_csv -> FCL_LABELS -> labels_fcl;
+ orc_dataset -> EDGE_LABELS;
+ labels_mph -> EDGE_LABELS;
+ graph_llp -> EDGE_LABELS;
+ graph_mph -> EDGE_LABELS;
+ graph_order -> EDGE_LABELS;
+ EDGE_LABELS -> graph_labelled;
+ EDGE_LABELS -> graph_transposed_labelled;
}
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -8,10 +8,11 @@
:caption: Overview
quickstart
+ api
+ java-api
+ memory
compression
cli
- api
- use-cases
docker
git2graph
/apidoc/swh.graph
diff --git a/docs/java-api.rst b/docs/java-api.rst
new file mode 100644
--- /dev/null
+++ b/docs/java-api.rst
@@ -0,0 +1,744 @@
+.. _swh-graph-java-api:
+
+Using the Java API
+==================
+
+.. highlight:: java
+
+While the :ref:`HTTP API <swh-graph-api>` is useful for many common use-cases,
+it is often not sufficient to implement more complex algorithms. This section
+describes the low-level Java API that ``swh-graph`` provides on top of the
+WebGraph framework to manipulate the compressed graph of Software Heritage.
+
+A cursory understanding of the `WebGraph framework
+<https://webgraph.di.unimi.it/>`_ and its API is helpful to understand the
+notions detailed here.
+
+.. _swh-graph-java-basics:
+
+Basics
+------
+
+In the WebGraph framework, graphs are generally subclasses of
+`ImmutableGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_,
+the abstract class providing the core API to manipulate and iterate on graphs.
+Under the hood, compressed graphs are stored as the `BVGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_
+class, which contains the actual codec used to compress and decompress
+adjacency lists.
+
+Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]`
+where *n* is the total number of nodes in the graph.
+Each node has an associated *adjacency list*, i.e., a list of nodes going from
+that source node to a destination node. This list represents the **edges** (or
+**arcs**) of the graph.
+
+**Note**: edges are always directed. Undirected graphs are internally stored as
+a pair of directed edges (src → dst, dst → src), and are called "symmetric"
+graphs.
+
+On disk, a simple BVGraph with the basename ``graph`` would be represented as
+the following set of files:
+
+- ``graph.graph``: contains the compressed adjacency lists of each node, which
+ can be decompressed by the BVGraph codec.
+- ``graph.properties``: contains metadata on the graph, such as the number of
+ nodes and arcs, as well as additional loading information needed by the
+ BVGraph codec.
+- ``graph.offsets``: a list of offsets of where the adjacency list of each node
+ is stored in the main graph file.
+- ``graph.obl``: optionally, an "offset big-list file" which can be used to
+ load graphs faster.
+
+An ImmutableGraph can be loaded using different *load methods*, which have each
+different performance implications:
+
+- ``load()``: the entire graph is loaded in RAM and supports random access.
+- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see
+ ``mmap(1)``), at the cost of being potentially slower, especially when doing
+ random access on slow storage.
+- ``loadOffline()``: no data is actually loaded is memory, only sequential
+ iteration is possible.
+
+The following code loads a graph stored on disk under the ``compressed/graph``
+basename, using the memory-mapped loading mode, and stores it as a generic
+ImmutableGraph:
+
+.. code-block:: java
+
+ ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph");
+
+Note that most of the time you will want to use the SWH-specific subclass
+**SwhUnidirectionalGraph** instead, which has the same API as an ImmutableGraph
+except it adds other SWH-specific methods. It is described later in the
+:ref:`swh-graph-java-node-mappings` section.
+
+
+Running the code
+----------------
+
+To run a piece of Java code written using the Java API, you need to run it with
+all the dependencies in your classpath (the WebGraph libraries and the
+swh-graph library). The easiest way to do it is to add the *fat jar*
+shipped in the swh-graph library on PyPI, which contains all the required
+dependencies.
+
+.. code-block:: console
+
+ $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
+
+
+Note that to load bigger graphs, the default heap size of the JVM is likely to
+be insufficient to load entire graphs in memory. It is advised to increase this
+heap size with the ``-Xmx`` flag:
+
+.. code-block:: console
+
+ $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
+
+For more information on performance tuning and memory considerations, you
+should also read the :ref:`swh-graph-memory` page, in which we recommend
+additional JVM options for loading large graphs.
+
+
+Simple traversal
+----------------
+
+The ImmutableGraph class provides primitives to iterate and traverse graphs. It
+contains the following methods:
+
+- ``graph.numNodes()`` returns the number of nodes in the graph (*n*).
+- ``graph.numArcs()`` returns the number of arcs in the graph.
+
+And, given a node ID :math:`k \in [0, n - 1]`:
+
+- ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are
+ *adjacent* to *k* (i.e., its outgoing *neighbors*).
+- ``graph.outdegree(k)`` returns the number of outgoing neighbors of *k*.
+
+
+Example: Average outdegree
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code can be used to compute the average
+outdegree of a graph, which is a useful measure of its density:
+
+.. code-block:: java
+
+ public static long averageOutdegree(ImmutableGraph graph) {
+ return ((long) graph.numArcs()) / graph.numNodes();
+ }
+
+
+Example: Degree distributions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using the ``outdegree()`` primitive, we can compute the outdegree distribution
+of the graph by iterating on all its nodes. The distribution will be returned
+as a map that associates to each degree *d* the number of nodes with outdegree
+*d*.
+
+.. code-block:: java
+
+ public static Map<Long, Long> outdegreeDistribution(ImmutableGraph graph) {
+ HashMap<Long, Long> distribution = new HashMap<Long, Long>();
+ for (long k = 0; k < graph.numNodes(); ++k) {
+ distribution.merge(graph.outdegree(k), 1L, Long::sum);
+ }
+ return distribution;
+ }
+
+
+Example: Depth-First Traversal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``successors`` primitive can be used to write a simple stack-based DFS
+traversal on the graph which starts from a given node and prints all the
+descendant nodes in its transitive closure:
+
+.. code-block:: java
+ :emphasize-lines: 10
+
+ public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) {
+ Stack<Long> stack = new Stack<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ stack.push(srcNodeId);
+ visited.add(srcNodeId);
+
+ while (!stack.isEmpty()) {
+ long currentNodeId = stack.pop();
+ System.out.println(currentNodeId);
+
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ stack.push(neighborNodeId);
+ visited.add(neighborNodeId);
+ }
+ }
+ }
+ }
+
+Example: Breadth-First Traversal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Swapping the stack for a queue changes the traversal order from depth-first
+to breadth-first:
+
+.. code-block:: java
+ :emphasize-lines: 2
+
+ public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) {
+ Queue<Long> queue = new ArrayDeque<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ queue.add(srcNodeId);
+ visited.add(srcNodeId);
+
+ while (!queue.isEmpty()) {
+ long currentNodeId = queue.poll();
+ System.out.println(currentNodeId);
+
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ queue.add(neighborNodeId);
+ visited.add(neighborNodeId);
+ }
+ }
+ }
+ }
+
+
+.. _swh-graph-java-node-mappings:
+
+Node types and SWHIDs
+---------------------
+
+In the Software Heritage archive, nodes are not represented by a simple
+integer, but by a :ref:`SWHID <persistent-identifiers>`, which contain both the
+*type* of the node (revision, directory, blob...) and its unique identifier. We
+use **node mappings** which allow us to translate between SWHIDs and the
+compact node IDs used in the compressed graph.
+
+Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the
+`GOVMinimalPerfectHashFunction
+<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
+class of the Sux4J library, which maps N keys to N consecutive integers with no
+collisions.
+
+The following files are used to store the mappings between the nodes and their
+types:
+
+- ``graph.mph``: contains a serialized minimal perfect hash function computed
+ on the list of all the SWHIDs in the graph.
+- ``graph.order``: contains the permutation that associates with each output of
+ the MPH the node ID to which it corresponds
+- ``graph.node2swhid.bin``: contains a compact binary representation of all the
+ SWHIDs in the graph, ordered by their rank in the graph file.
+- ``graph.node2type.bin``: contains a `LongBigArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
+ which stores the type of each node.
+
+To use these mappings easily, we provide the class **SwhUnidirectionalGraph**,
+an ImmutableGraph which wraps the underlying graph and adds a few
+utility methods to obtain SWH-specific information on the graph.
+
+A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph,
+as long as the mapping files listed above are present::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
+
+This class exposes the same graph primitives as an ImmutableGraph, but it
+additionally contains the following methods:
+
+- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given
+ node ID. This function does a lookup of the SWHID at offset *i* in the file
+ ``graph.node2swhid.bin``.
+
+- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given
+ SWHID. It works by hashing the SWHID with the function stored in
+ ``graph.mph``, then permuting it using the permutation stored in
+ ``graph.order``. It does additional domain-checking by calling ``getSWHID()``
+ on its own result to check that the input SWHID was valid.
+
+- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as
+ an enum of all the different object types in the Software Heritage data
+ model. It does so by looking up the value at offset *i* in the bit vector
+ stored in ``graph.node2type.bin``.
+
+
+Example: Find the target directory of a revision
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As an example, we use the methods mentioned above to perform the
+following task: "given a revision, return its target directory". To do so, we
+first look up the node ID of the given revision in the compressed graph. We
+iterate on the successors of that node, and return the SWHID of the first
+destination node that has the "directory" type.
+
+
+.. code-block:: java
+ :emphasize-lines: 2
+
+ public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) {
+ long src = graph.getNodeId(revSwhid);
+ assert graph.getNodeType(src) == Node.Type.REV;
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long dst; (dst = it.nextLong()) != -1;) {
+ if (graph.getNodeType(dst) == Node.Type.DIR) {
+ return graph.getSWHID(dst);
+ }
+ }
+ throw new RuntimeError("Revision has no target directory");
+ }
+
+.. _swh-graph-java-node-properties:
+
+Node properties
+---------------
+
+The Software Heritage Graph is a *property graph*, which means it has various
+properties associated with its nodes and edges (e.g., commit timestamps,
+authors, messages, ...). We compress these properties and store them in files
+alongside the compressed graph. This allows you to write traversal algorithms
+that depend on these properties.
+
+By default, properties are not assumed to be present are are not loaded when
+the graph itself is loaded. If you want to use a property, you need to
+explicitly load it first. As an example, this is how you load the "content
+length" property to get the length of a given blob::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
+ graph.loadContentLength();
+ long blobSize = graph.getContentLength(graph.getNodeID(swhid));
+
+The documentation of the SwhGraphProperties class (**TODO: link**) lists all
+the different properties, their types, and the methods used to load them and to get
+their value for a specific node.
+
+A few things of note:
+
+- A single loading call can load multiple properties at once; this is because
+ they are stored in the same file to be more space efficient.
+
+- Persons (authors, committers etc) are exported as a single pseudonymized
+ integer ID, that uniquely represents a full name + email.
+
+- Timestamps are stored as a long integer (for the timestamp itself) and a
+ short integer (for the UTC offset).
+
+
+.. _swh-graph-java-edge-properties:
+
+Edge labels
+-----------
+
+While looking up graph properties on the *nodes* of the graph is relatively
+straightforward, doing so for labels on the *arcs* is comparatively more
+difficult. These include the names and permissions of directory entries, as
+well as the branch names of snapshots.
+
+The `ArcLabelledImmutableGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledImmutableGraph.html>`_
+class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them
+*labelled iterators*, which essentially allow us to look up the label of the
+arcs while iterating on them.
+
+This labelled graph is stored in the following files:
+
+- ``graph-labelled.properties``: a property file describing the graph, notably
+ containing the basename of the wrapped graph.
+- ``graph-labelled.labels``: the compressed labels
+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
+ random order.
+
+The SwhUnidirectionalGraph class contains *labelled* loading methods
+(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods
+are used instead of the standard non-labelled ones, the graph is loaded as an
+ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods
+can then be used:
+
+- ``labelledSuccessors(k)`` returns a `LabelledArcIterator
+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.LabelledArcIterator.html>`_
+ which is used in the same way as a LazyLongIterator except it also contains a
+ ``label()`` method to get the label of the currently traversed arc.
+- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator
+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.html>`_
+ of all the nodes in the graph, which replaces the LazyLongIterator of the
+ ``successor()`` function by a LabelledArcIterator similar to above.
+
+
+Label format
+~~~~~~~~~~~~
+
+The labels of each arc are returned as a ``DirEntry[]`` array. They encode
+both the name of a directory entry and its permissions. For snapshot branches,
+only the "name" field is useful.
+
+Arc label names are encoded as an integer ID representing each unique
+entry/branch name present in the graph. To retrieve the actual name associated
+with a given label ID, one needs to load the reverse mapping similar to how you
+would do for a normal property::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename);
+ graph.loadLabelNames();
+
+The byte array representing the actual label name can then be loaded with::
+
+ byte[] name = graph.getLabelName(label.filenameId);
+
+
+Multiedges
+~~~~~~~~~~
+
+The Software Heritage is not a *simple graph*, where at most one edge can exist
+between two vertices, but a *multigraph*, where multiple edges can be incident
+to the same two vertices. Consider for instance the case of a single directory
+``test/`` containing twice the same file blob (e.g., the empty file), under two
+different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty).
+The simple graph view of this directory will represent it as a single edge
+``test`` → *empty file*, while the multigraph view will represent it as *two*
+edges between the same nodes.
+
+Due to the copy-list model of compression, WebGraph only stores simple graphs,
+and thus stores multiedges as single edges, to which we cannot associate
+a single label name (in our example, we need to associate both names
+``ISSUES.txt`` and ``TODO.txt``).
+To represent this possibility of having multiple file names for a single arc,
+in the case of multiple relationships between two identical nodes, each arc label is
+stored as an *array* of DirEntry, each record representing one relationship
+between two nodes.
+
+
+Example: Printing all the entries of a directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code showcases how one can print all the entries (name,
+permission and target SWHID) of a given directory, using the labelled methods
+seen above.
+
+.. code-block:: java
+
+ public static void printEntries(ImmutableGraph g, long dirNode) {
+ LabelledArcIterator s = g.labelledSuccessors(dirNode);
+ for (long dst; (dst = it.nextLong()) >= 0;) {
+ DirEntry[] labels = (DirEntry[]) s.label().get();
+ for (DirEntry label : labels) {
+ System.out.format(
+ "%s %s %d\n",
+ graph.getSWHID(dst);
+ new String(graph.getLabelName(label.filenameId)),
+ label.permission
+ );
+ }
+ }
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <DIR_SWHID>
+ public static void main(String[] args) {
+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]);
+ g.loadLabelNames();
+ long dirNode = g.getNodeID(new SWHID(args[1]));
+ printEntries(g, dirNode);
+ }
+
+
+Transposed graph
+----------------
+
+Up until now, we have only looked at how to traverse the *forward* graph, i.e.,
+the directed graph whose edges are in the same direction as the Merkle DAG of
+the Software Heritage archive.
+For many purposes, especially that of finding the *provenance* of software
+artifacts, it is useful to query the *backward* (or *transposed*) graph
+instead, which is the same as the forward graph except all the edges are
+reversed.
+
+The transposed graph has its own set of files, counterparts to the files needed
+for the forward graph:
+
+- ``graph-transposed.graph``
+- ``graph-transposed.properties``
+- ``graph-transposed.offsets``
+- ``graph-transposed.obl``
+- ``graph-transposed-labelled.labels``
+- ``graph-transposed-labelled.labeloffsets``
+- ``graph-transposed-labelled.properties``
+
+However, because node IDs are the same in the forward and the backward graph,
+all the files that pertain to mappings between the node IDs and various
+properties (SWHIDs, property data, node permutations etc) remain the same.
+
+
+Example: Earliest revision containing a given blob
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code loads all the committer timestamps of the revisions in the
+graph, then walks the *transposed* graph to return the earliest revision
+containing a given object.
+
+.. code-block:: java
+
+ public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) {
+ long oldestRev = -1;
+ long oldestRevTs = Long.MAX_VALUE;
+
+ Stack<Long> stack = new Stack<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ stack.push(src);
+ visited.add(src);
+ while (!stack.isEmpty()) {
+ long currentNodeId = stack.pop();
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ stack.push(neighborNodeId);
+ visited.add(neighborNodeId);
+ if (g.getNodeType(neighborNodeId) == Node.Type.REV) {
+ Long ts = g.getCommitterTimestamp(neighborNodeId);
+ if (ts != null && ts < oldestRevTs) {
+ oldestRev = neighborNodeId;
+ oldestRevTs = ts;
+ }
+ }
+ }
+ }
+ }
+ return oldestRev;
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <BLOB_SWHID>
+ public static void main(String[] args) {
+ // Load the backward (= transposed) graph as a SwhUnidirectionalGraph
+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed");
+ g.loadCommitterTimestamps();
+ long node = g.getNodeID(new SWHID(args[1]));
+ long oldestRev = findEarliestRevisionContaining(g, node);
+ System.out.println(g.getSWHID(oldestRev));
+ }
+
+
+
+
+Bidirectional Graph
+-------------------
+
+
+BidirectionalImmutableGraph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and
+then manipulated just like the forward graph, it is often convenient to have
+*both* the forward and the backward graph in memory. Some traversal algorithms
+require first going down in the forward graph to select some nodes, then going
+up to find their provenance.
+
+To achieve that, we use the `BidirectionalImmutableGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BidirectionalImmutableGraph.html>`_
+class from WebGraph, which stores both a graph and its transpose.
+This class provides the following methods to iterate on the **backward** graph,
+shown here with their counterparts for the forward graph:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Forward graph operation
+ - Backward graph operation
+
+ * - ``outdegree(k)``
+ - ``indegree(k)``
+
+ * - ``successors(k)``
+ - ``predecessors(k)``
+
+In addition, the class offers a few convenience methods which are generally
+useful when you have both a graph and its transpose:
+
+- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by
+ inverting the references to the forward and the backward graphs. Successors
+ become predecessors, and vice-versa.
+- ``symmetrize()`` returns the symmetric (= undirected) version of the
+ bidirectional graph. It is implemented by a union between the forward and the
+ backward graph, which basically boils down to removing the directionality of
+ the edges (the successors of a node are also its predecessors).
+
+
+SwhBidirectionalGraph
+~~~~~~~~~~~~~~~~~~~~~
+
+Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with
+SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it
+contains the method ``labelledPredecessors()``, the equivalent of
+``labelledSuccessors()`` but on the backward graph.
+
+Because SwhUnidirectionalGraph inherits from ImmutableGraph, and
+SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the
+common behavior between the two classes in a SwhGraph interface, which can
+represent either an unidirectional or a bidirectional graph.
+
+To avoid loading the node properties two times (once for each direction), they
+are stored in a separate class called SwhGraphProperties. In a
+SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node
+properties in memory by storing references to the same SwhGraphProperty
+object.
+
+.. code-block:: text
+
+
+ ┌──────────────┐
+ │ImmutableGraph◄────────┐
+ └────▲─────────┘ │extends
+ │ │
+ │ ┌──────────┴────────────────┐
+ extends│ │BidirectionalImmutableGraph│
+ │ └────────────▲──────────────┘
+ │ │extends
+ ┌──────────────┴───────┐ ┌──────┴──────────────┐
+ │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│
+ └──┬──────────────┬────┘ └────────┬───────────┬┘
+ │ │ contains x2 │ │
+ │ │ │ │
+ │ implements│ │implements │
+ │ ┌▼──────────┐ │ │
+ │ │SwhGraph(I)◄────────┘ │
+ contains│ └───────────┘ │contains
+ │ │
+ │ ┌──────────────────┐ │
+ └────────────►SwhGraphProperties◄──────────────┘
+ └──────────────────┘
+
+
+Example: Find all the shared-commit forks of a given origin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to define the *forks* of an origin as being the set of origins
+which share at least one revision with that origin.
+
+The following code loads the graph in both directions using a
+SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward*
+graph to find all its root revisions. It then walks the *backward* graph to
+find all the origins containing these root revisions, i.e., its *forks*.
+
+.. code-block:: java
+
+ public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) {
+ Stack<Long> forwardStack = new Stack<>();
+ HashSet<Long> forwardVisited = new HashSet<Long>();
+ Stack<Long> backwardStack = new Stack<>();
+ HashSet<Long> backwardVisited = new HashSet<Long>();
+
+ // First traversal (forward graph): find all the root revisions of the
+ // origin
+ forwardStack.push(srcOrigin);
+ forwardVisited.add(srcOrigin);
+ while (!forwardStack.isEmpty()) {
+ long curr = forwardStack.pop();
+ LazyLongIterator it = graph.successors(curr);
+ boolean isRootRevision = true;
+ for (long succ; (succ = it.nextLong()) != -1;) {
+ Node.Type nt = g.getNodeType(succ);
+ if (!forwardVisited.contains(succ)
+ && nt != Node.Type.DIR && nt != Node.Type.CNT) {
+ forwardStack.push(succ);
+ forwardVisited.add(succ);
+ isRootRevision = false;
+ }
+ }
+ if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) {
+ // Found a root revision, add it to the second stack
+ backwardStack.push(curr);
+ backwardVisited.add(curr);
+ }
+ }
+
+ // Second traversal (backward graph): find all the origins containing
+ // any of these root revisions and print them
+ while (!backwardStack.isEmpty()) {
+ long curr = backwardStack.pop();
+ LazyLongIterator it = graph.predecessors(curr);
+ boolean isRootRevision = true;
+ for (long succ; (succ = it.nextLong()) != -1;) {
+ Node.Type nt = g.getNodeType(succ);
+ if (!backwardVisited.contains(succ)) {
+ backwardStack.push(succ);
+ backwardVisited.add(succ);
+ if (nt == Node.Type.ORI) {
+ // Found an origin, print it.
+ System.out.println(g.getSWHID(succ));
+ }
+ }
+ }
+ }
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <ORI_SWHID>
+ public static void main(String[] args) {
+ // Load both forward and backward graphs as a SwhBidirectionalGraph
+ SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]);
+ long node = g.getNodeID(new SWHID(args[1]));
+ findSharedCommitForks(g, node);
+ }
+
+
+Large-scale processing
+----------------------
+
+Multithreading
+~~~~~~~~~~~~~~
+
+ImmutableGraph is not thread-safe. When writing multithreaded algorithms,
+calling ``successors()`` on the same graph from multiple threads will return
+garbage.
+
+Instead, each thread should create its own "lightweight copy" of the graph by
+calling ``.copy()``. This will not actually copy the entire graph data, which
+will remain shared across threads, but it will create new instances of the
+iterators so that each thread can independently iterate on the graph data.
+
+
+Data structures for large traversals
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When doing very large traversals, such as a BFS on the entire graph, the
+usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite
+inefficient. If you know you are going to traverse large parts of the graph,
+it's better to use more appropriate data structures, a lot of which can be
+found in the dsiutils library. In particular:
+
+- `LongArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongArrayBitVector.html>`_
+ is an efficient bit-vector implementation, which can be used to store the
+ nodes that have already been seen in the visit. Its memory footprint is too
+ big to use for small traversals, but it is very efficient to traverse the
+ full graph, as every node only takes a single bit.
+
+- `ByteDiskQueue
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/io/ByteDiskQueue.html>`_ can
+ be used to efficiently store the queue of nodes to visit on disk, when it is
+ too large to fit in RAM.
+
+Other types in dsiutils and fastutil can save significant memory:
+``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList<Long>``,
+and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over
+``HashMap<Long, Long>``. We strongly recommend reading the documentation of the
+unimi libraries and looking at the code for usage examples.
+
+
+BigArrays
+~~~~~~~~~
+
+When working with the Software Heritage graph, is often necessary to store
+large arrays of values, with a size exceeding 2^32 items. Unfortunately,
+standard Java arrays do not support this.
+
+To circumvent this, WebGraph uses the `BigArrays scheme
+<https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/BigArrays.html>`_ from
+the fastutil library: "big arrays" are stored as arrays of arrays, supporting
+quadrillions of records.
+
+A BigArray ``long[][] a`` can be used with the following methods:
+
+- ``BigArrays.get(a, i)`` to get the value at index *i*
+- ``BigArrays.set(a, i, v)`` to set the value at index *i* to *v*.
+- ``BigArrays.length(a)`` to get the total length of the bigarray.
diff --git a/docs/memory.rst b/docs/memory.rst
new file mode 100644
--- /dev/null
+++ b/docs/memory.rst
@@ -0,0 +1,130 @@
+.. _swh-graph-memory:
+
+Memory & Performance tuning
+===========================
+
+This page discusses various considerations related to memory usage and
+performance tuning when using the ``swh-graph`` library to load large
+compressed graphs.
+
+JVM options
+-----------
+
+In production, we tend to use very large servers which have enough RAM to load
+the entire graph in RAM. In these setups, the default JVM options are often
+suboptimal. We recommend to start the JVM with the following options, which
+tend to significantly improve performance::
+
+ java \
+ -ea \
+ -server \
+ -XX:PretenureSizeThreshold=512M \
+ -XX:MaxNewSize=4G \
+ -XX:+UseLargePages \
+ -XX:+UseTransparentHugePages \
+ -XX:+UseNUMA \
+ -XX:+UseTLAB \
+ -XX:+ResizeTLAB \
+
+These options are documented in the manual of ``java(1)`` the Oracle
+documentation.
+
+
+Temporary directory
+-------------------
+
+Many of the graph algorithms (either for compression or traversal) tend to
+offload some of their run-time memory to disk. For instance, the `BFS
+<https://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
+algorithm in the LAW library uses a temporary directory to write its queue of
+nodes to visit.
+
+Because these can be quite large and sometimes overflow the default ``/tmp``
+partition, it is advised to systematically specify a path to a local temporary
+directory with enough space to accommodate the needs of the Java programs. This
+can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI::
+
+ java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp
+
+
+Memory mapping vs Direct loading
+--------------------------------
+
+The main dial you can use to manage your memory usage is to chose between
+memory-mapping and direct-loading the graph data. The different loading modes
+available when loading the graph are documented in :ref:`swh-graph-java-api`.
+
+Loading in mapped mode will not load any extra data in RAM, but will instead
+use the ``mmap(1)`` syscall to put the graph file located on disk in the
+virtual address space. The Linux kernel will then be free to arbitrarily cache
+the file, either partially or in its entirety, depending on the available
+memory space.
+
+In our experiments, memory-mapping a small graph from a SSD only incurs a
+relatively small slowdown (about 15-20%). However, when the graph is too big to
+fit in RAM, the kernel has to constantly invalidate pages to cache newly
+accessed sections, which incurs a very large performance penalty. A full
+traversal of a large graph that usually takes about 20 hours when loaded in
+main memory could take more than a year when mapped from a hard drive!
+
+When deciding what to direct-load and what to memory-map, here are a few rules
+of thumb:
+
+- If you don't need random access to the graph edges, you can consider using
+ the "offline" loading mode. The offsets won't be loaded which will save
+ dozens of gigabytes of RAM.
+
+- If you only need to query some specific nodes or run trivial traversals,
+ memory-mapping the graph from a HDD should be a reasonable solution that
+ doesn't take an inordinate amount of time. It might be bad for your disks,
+ though.
+
+- If you are constrained in available RAM, memory-mapping the graph from an SSD
+ offers reasonable performance for reasonably complex algorithms.
+
+- If you have a heavy workload (i.e. running a full traversal of the entire
+ graph) and you can afford the RAM, direct loading will be orders of magnitude
+ faster than all the above options.
+
+
+Sharing mapped data across processes
+------------------------------------
+
+Often, multiple processes can be working on the same data (mappings or the
+graph itself), for instance when running different experiments on the same
+graph. This is problematic in terms of RAM usage when using direct memory
+loading, as the same data of potentially hundreds of gigabytes is loaded in
+memory twice.
+As we have seen, memory-mapping can be used to avoid storing redundant data in
+RAM, but comes at the cost of potentially slower I/O as the data is no longer
+guaranteed to be loaded in main memory and is reliant on kernel heuristics.
+
+To efficiently share data across two different compressed graph processes,
+another option is to copy graph data to a ``tmpfs`` not backed by a disk swap,
+which forces the kernel to load it entirely in RAM. Subsequent memory-mappings
+of the files stored in the tmpfs will simply map the data stored in RAM to
+virtual memory pages, and return a pointer to the in-memory structure.
+
+To do so, we create a directory in ``/dev/shm`` in which we **copy** all the
+files that we want to direct-load in RAM, and **symlink** all the others. Then,
+we load the graph using the memory-mapped loading mode, which makes it use the
+shared memory stored in the tmpfs under the hood.
+
+Here is a systemd service that can be used to perform this task automatically:
+
+.. code-block:: ini
+
+ [Unit]
+ Description=swh-graph memory sharing in tmpfs
+
+ [Service]
+ Type=oneshot
+ RemainAfterExit=yes
+ ExecStart=mkdir -p /dev/shm/swh-graph/default
+ ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default"
+ ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default
+ ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default
+ ExecStop=rm -rf /dev/shm/swh-graph/default
+
+ [Install]
+ WantedBy=multi-user.target
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
--- a/docs/quickstart.rst
+++ b/docs/quickstart.rst
@@ -1,51 +1,38 @@
+.. _swh-graph-quickstart:
+
Quickstart
==========
-This quick tutorial shows how to compress and browse a graph using ``swh.graph``.
-
-It does not cover the technical details behind the graph compression techniques
-(refer to :ref:`graph-compression`).
-
+This quick tutorial shows how to start the ``swh.graph`` service to query
+an existing compressed graph with the high-level HTTP API.
Dependencies
------------
In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java
-JRE, you do not need the JDK if you install the package from pypi, but may want
-to install it if you want to hack the code or install it from this git
-repository. To compress a graph, you will need zstd_ compression tools.
-
-It is highly recommended to install this package in a virtualenv.
-
-On a Debian stable (buster) system:
-
-.. code:: bash
-
- $ sudo apt install python3-virtualenv default-jre zstd
+JRE. On a Debian system:
+.. code:: console
-.. _zstd: https://facebook.github.io/zstd/
+ $ sudo apt install python3 python3-venv default-jre
-
-Install
--------
+Installing swh.graph
+--------------------
Create a virtualenv and activate it:
-.. code:: bash
+.. code:: console
- ~/tmp$ mkdir swh-graph-tests
- ~/tmp$ cd swh-graph-tests
- ~/t/swh-graph-tests$ virtualenv swhenv
- ~/t/swh-graph-tests$ . swhenv/bin/activate
+ $ python3 -m venv .venv
+ $ source .venv/bin/activate
Install the ``swh.graph`` python package:
-.. code:: bash
+.. code:: console
- (swhenv) ~/t/swh-graph-tests$ pip install swh.graph
+ (venv) $ pip install swh.graph
[...]
- (swhenv) ~/t/swh-graph-tests swh graph --help
+ (venv) $ swh graph --help
Usage: swh graph [OPTIONS] COMMAND [ARGS]...
Software Heritage graph tools.
@@ -55,106 +42,77 @@
-h, --help Show this message and exit.
Commands:
- api-client client for the graph RPC service
- cachemount Cache the mmapped files of the compressed graph in a tmpfs.
compress Compress a graph using WebGraph Input: a pair of files...
- map Manage swh-graph on-disk maps
rpc-serve run the graph RPC service
-Compression
------------
-
-Existing datasets
-^^^^^^^^^^^^^^^^^
-
-You can directly use compressed graph datasets provided by Software Heritage.
-Here is a small and realistic dataset (3.1GB):
-
- https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
-
-.. code:: bash
- (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
- (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar
- (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading
+.. _swh-graph-retrieving-compressed:
-Note: not for the faint heart, but the full dataset is available at:
+Retrieving a compressed graph
+-----------------------------
- https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/
+Software Heritage provides a list of off-the-shelf datasets that can be used
+for various research or prototyping purposes. Most of them are available in
+*compressed* representation, i.e., in a format suitable to be loaded and
+queried by the ``swh-graph`` library.
-Own datasets
-^^^^^^^^^^^^
+All the publicly available datasets are documented on this page:
+https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
-A graph is described as both its adjacency list and the set of nodes
-identifiers in plain text format. Such graph example can be found in the
-``swh/graph/tests/dataset/`` folder.
+A good way of retrieving these datasets is to use the `AWS S3 CLI
+<https://docs.aws.amazon.com/cli/latest/reference/s3/>`_.
-You can compress the example graph on the command line like this:
+Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has
+a relatively reasonable size (~15 GiB including property data, with
+the compressed graph itself being less than 700 MiB):
-.. code:: bash
+.. code:: console
+ (venv) $ pip install awscli
+ [...]
+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed
+ (venv) $ cd 2021-03-23-popular-3k-python/
+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed
- (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/
- [...]
+You can also retrieve larger graphs, but note that these graphs are generally
+intended to be loaded fully in RAM, and do not fit on ordinary desktop
+machines. The server we use in production to run the graph service has more
+than 700 GiB of RAM. These memory considerations are discussed in more details
+in :ref:`swh-graph-memory`.
- (swhenv) ~/t/swh-graph-tests$ ls output/
- example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets
- example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties
- example.indegree example.node2type.map example.order example.stats example-transposed.obl
+**Note:** for testing purposes, a fake test dataset is available in the
+``swh-graph`` repository, with just a few dozen nodes. Its basename is
+``swh-graph/swh/graph/tests/dataset/compressed/example``.
API server
----------
-To start a ``swh.graph`` API server of a compressed graph dataset, run:
+To start a ``swh.graph`` API server of a compressed graph dataset, you need to
+use the ``rpc-serve`` command with the basename of the graph, which is the path prefix
+of all the graph files (e.g., with the basename ``compressed/graph``, it will
+attempt to load the files located at
+``compressed/graph.{graph,properties,offsets,...}``.
-.. code:: bash
+In our example:
+
+.. code:: console
- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example
- Loading graph output/example ...
+ (venv) $ swh graph rpc-serve -g compressed/graph
+ Loading graph compressed/graph ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
From there you can use this endpoint to query the compressed graph, for example
-with httpie_ (``sudo apt install``) from another terminal:
+with httpie_ (``sudo apt install httpie``):
.. _httpie: https://httpie.org
.. code:: bash
- ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010
- HTTP/1.1 200 OK
- Content-Type: text/plain
- Date: Tue, 15 Sep 2020 08:33:25 GMT
- Server: Python/3.8 aiohttp/3.6.2
- Transfer-Encoding: chunked
-
- swh:1:rel:0000000000000000000000000000000000000010
- swh:1:rev:0000000000000000000000000000000000000009
- swh:1:rev:0000000000000000000000000000000000000003
- swh:1:dir:0000000000000000000000000000000000000002
- swh:1:cnt:0000000000000000000000000000000000000001
- swh:1:dir:0000000000000000000000000000000000000008
- swh:1:dir:0000000000000000000000000000000000000006
- swh:1:cnt:0000000000000000000000000000000000000004
- swh:1:cnt:0000000000000000000000000000000000000005
- swh:1:cnt:0000000000000000000000000000000000000007
-
-
-Running the existing ``python3kcompress`` dataset:
-
-.. code:: bash
-
- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k
- Loading graph python3kcompress/python3k ...
- Graph loaded.
- ======== Running on http://0.0.0.0:5009 ========
- (Press CTRL+C to quit)
-
-
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
HTTP/1.1 200 OK
Content-Type: text/plain
@@ -170,5 +128,5 @@
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406
-
-See the documentation of the :ref:`API <swh-graph-api>` for more details.
+See the documentation of the :ref:`API <swh-graph-api>` for more details on how
+to use the HTTP graph querying API.
diff --git a/docs/use-cases.rst b/docs/use-cases.rst
deleted file mode 100644
--- a/docs/use-cases.rst
+++ /dev/null
@@ -1,167 +0,0 @@
-=========
-Use cases
-=========
-
-
-This document lists use cases and benchmark scenarios for the Software Heritage
-graph service.
-
-
-Conventions
-===========
-
-- **Node identification**: in the following, nodes are always identified by
- their :ref:`SWHIDs <persistent-identifiers>`.
-
-
-Use cases
-=========
-
-
-Browsing
---------
-
-The following use cases require traversing the *forward graph*.
-
-- **ls**: given a directory node, list (non recursively) all linked nodes of
- type directory and content
-
- Implementation::
-
- /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
-
-- **ls -R**: given a directory node, recursively list all linked nodes of type
- directory and content
-
- Implementation::
-
- /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
-
-- **git log**: given a revision node, recursively list all linked nodes of type
- revision
-
- Implementation::
-
- /graph/visit/nodes/:REV_ID?edges=rev:rev
-
-
-Vault
------
-
-The following use cases require traversing the *forward graph*.
-
-- **tarball** (same as *ls -R* above)
-
-- **git bundle**: given a node, recursively list all linked nodes of any kind
-
- Implementation::
-
- /graph/visit/nodes/:NODE_ID?edges=*
-
-
-Provenance
-----------
-
-The following use cases require traversing the *backward (transposed)
-graph*.
-
-- **commit provenance**: given a content or directory node, return *a* commit
- whose directory (recursively) contains it
-
- Implementation::
-
- /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
-
-- **complete commit provenance**: given a content or directory node, return
- *all* commits whose directory (recursively) contains it
-
- Implementation::
-
- /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
-
-- **origin provenance**: given a content, directory, or commit node, return
- *an* origin that has at least one snapshot that (recursively) contains it
-
- Implementation::
-
- /graph/walk/:NODE_ID/ori?direction=backward&edges=*
-
-- **complete origin provenance**: given a content, directory, or commit node,
- return *all* origins that have at least one snapshot that (recursively)
- contains it
-
- Implementation::
-
- /graph/leaves/:NODE_ID?direction=backward&edges=*
-
-- *SLOC tracking*: left as future work
-
-
-Provenance statistics
-~~~~~~~~~~~~~~~~~~~~~
-
-The following use cases require traversing the *backward (transposed)
-graph*.
-
-- **content popularity across commits**: for each content, count the number of
- commits (or *commit popularity*) that link to a directory that (recursively)
- includes it. Plot the distribution of content popularity across commits.
-
- Implementation: apply *complete commit provenance* to each content node,
- count the returned commits, aggregate.
-
-- **commit popularity across origins**: for each commit, count the number of
- origins (or *origin popularity*) that have a snapshot that (recursively)
- includes it. Plot the distribution of commit popularity across origins.
-
- Implementation: apply *complete origin provenance* to each commit node, count
- the returned origins, aggregate.
-
-- *SLOC popularity across contents*: left as future work
-
-The following use cases require traversing the *forward graph*.
-
-- **revision size** (as n. of contents) distribution: for each revision, count
- the number of contents that are (recursively) reachable from it. Plot the
- distribution of revision sizes.
-
-- **origin size** (as n. of revisions) distribution: for each origin, count the
- number of revisions that are (recursively) reachable from it. Plot the
- distribution of origin sizes.
-
-
-Benchmarks
-==========
-
-Notes on how to benchmark graph access:
-
-- separate pure-graph timings from timings related to use additional mappings
- (e.g., node types), no matter if the mappings are in-memory or on-disk
-
-- separate in-memory timings from on-disk timings; in particular, separate the
- timing of translating node identifiers between internal integers and SWHIDs
-
-- for each use case that requires a node as input, we will randomize the choice
- of the input node and repeat the experiment a suitable number of times; where
- possible we will aggregate results computing basic statistics (average,
- standard deviation), as well as normalize results w.r.t. the “size” of the
- chosen node (e.g., number of nodes/path length in the resulting visit)
-
-
-Basic benchmarks
-----------------
-
-- **Edge traversal**: given a node, retrieve the first node in its adjacency
- list.
-
- For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
- times to retrieve the adjacency list of a node (and/or test if an edge exists
- between two nodes) in the 2-3 us range, for the largest graph in their
- experiments (22 M nodes, 600 M edges).
-
-
-Each use case is a benchmark
-----------------------------
-
-In addition to abstract benchmark, we will use each use case above as a
-scenario-based benchmark.
diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
--- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
+++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
@@ -27,8 +27,7 @@
* Create maps needed at runtime by the graph service, in particular:
* <p>
* <ul>
- * <li>SWHID → WebGraph long node id</li>
- * <li>WebGraph long node id → SWHID (converse of the former)</li>
+ * <li>WebGraph long node id → SWHID</li>
* <li>WebGraph long node id → SWH node type (enum)</li>
* </ul>
*
diff --git a/swh/graph/webgraph.py b/swh/graph/webgraph.py
--- a/swh/graph/webgraph.py
+++ b/swh/graph/webgraph.py
@@ -179,6 +179,14 @@
"{tmp_dir}",
"< {out_dir}/{graph_name}.nodes.csv.zst",
],
+ CompressionStep.EXTRACT_PERSONS: [
+ "{java}",
+ "org.softwareheritage.graph.compress.ExtractPersons",
+ "--temp-dir",
+ "{tmp_dir}",
+ "{in_dir}",
+ "{out_dir}/{graph_name}",
+ ],
CompressionStep.MPH_PERSONS: [
"{java}",
"it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction",
@@ -190,14 +198,6 @@
"{out_dir}/{graph_name}.persons.mph",
"{out_dir}/{graph_name}.persons.csv.zst",
],
- CompressionStep.EXTRACT_PERSONS: [
- "{java}",
- "org.softwareheritage.graph.compress.ExtractPersons",
- "--temp-dir",
- "{tmp_dir}",
- "{in_dir}",
- "{out_dir}/{graph_name}",
- ],
CompressionStep.NODE_PROPERTIES: [
"{java}",
"org.softwareheritage.graph.compress.WriteNodeProperties",
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Dec 21 2024, 1:49 PM (11 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3215051
Attached To
D7839: Documentation overhaul
Event Timeline
Log In to Comment