Page MenuHomeSoftware Heritage

D7839.diff
No OneTemporary

D7839.diff

diff --git a/docs/api.rst b/docs/api.rst
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -1,7 +1,16 @@
.. _swh-graph-api:
-Graph RPC API
-=============
+Graph Querying HTTP API
+=======================
+
+The Graph Querying API is a high-level HTTP API intended to run common,
+relatively simple traversal queries on the compressed graph.
+
+The client/server architecture allows it to only load the graph in memory once
+then serve multiple different requests. However, it is limited in expressivity;
+more complex or resource-intensive queries should rather use the
+:ref:`Low-level Java API <swh-graph-java-api>` to run them as standalone
+programs.
Terminology
@@ -53,8 +62,11 @@
- ``"cnt,snp"`` accepted node types returned in the query results.
+Endpoints
+---------
+
Leaves
-------
+~~~~~~
.. http:get:: /graph/leaves/:src
@@ -97,7 +109,7 @@
Neighbors
----------
+~~~~~~~~~
.. http:get:: /graph/neighbors/:src
@@ -138,7 +150,7 @@
Walk
-----
+~~~~
..
.. http:get:: /graph/walk/:src/:dst
@@ -246,7 +258,7 @@
Visit
------
+~~~~~
.. http:get:: /graph/visit/nodes/:src
.. http:get:: /graph/visit/edges/:src
@@ -340,7 +352,7 @@
Counting results
-----------------
+~~~~~~~~~~~~~~~~
The following method variants, with trailing `/count` added, behave like their
already discussed counterparts but, instead of returning results, return the
@@ -363,7 +375,7 @@
Stats
------
+~~~~~
.. http:get:: /graph/stats
@@ -405,3 +417,125 @@
"avg": 0.6107127825377487
}
}
+
+
+Use-case examples
+-----------------
+
+This section showcases how to leverage the endpoints of the HTTP API described
+above for some common use-cases.
+
+
+Browsing
+~~~~~~~~
+
+The following use cases require traversing the *forward graph*.
+
+- **ls**: given a directory node, list (non recursively) all linked nodes of
+ type directory and content
+
+ Endpoint::
+
+ /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
+
+- **ls -R**: given a directory node, recursively list all linked nodes of type
+ directory and content
+
+ Endpoint::
+
+ /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
+
+- **git log**: given a revision node, recursively list all linked nodes of type
+ revision
+
+ Endpoint::
+
+ /graph/visit/nodes/:REV_ID?edges=rev:rev
+
+
+Vault
+~~~~~
+
+The following use cases require traversing the *forward graph*.
+
+- **tarball** (same as *ls -R* above)
+
+- **git bundle**: given a node, recursively list all linked nodes of any kind
+
+ Endpoint::
+
+ /graph/visit/nodes/:NODE_ID?edges=*
+
+
+Provenance
+~~~~~~~~~~
+
+The following use cases require traversing the *backward (transposed)
+graph*.
+
+- **commit provenance**: given a content or directory node, return *a* commit
+ whose directory (recursively) contains it
+
+ Endpoint::
+
+ /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
+
+- **complete commit provenance**: given a content or directory node, return
+ *all* commits whose directory (recursively) contains it
+
+ Endpoint::
+
+ /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
+
+- **origin provenance**: given a content, directory, or commit node, return
+ *an* origin that has at least one snapshot that (recursively) contains it
+
+ Endpoint::
+
+ /graph/walk/:NODE_ID/ori?direction=backward&edges=*
+
+- **complete origin provenance**: given a content, directory, or commit node,
+ return *all* origins that have at least one snapshot that (recursively)
+ contains it
+
+ Endpoint::
+
+ /graph/leaves/:NODE_ID?direction=backward&edges=*
+
+
+Provenance statistics
+~~~~~~~~~~~~~~~~~~~~~
+
+The following use cases require traversing the *backward (transposed)
+graph*.
+
+- **content popularity across commits**: count the number of commits (or
+ *commit popularity*) that link to a directory that (recursively) includes a
+ given content.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?direction=backward&edges=cnt:dir,dir:dir,dir:rev
+
+- **commit popularity across origins**: count the number of origins (or *origin
+ popularity*) that have a snapshot that (recursively) includes a given commit.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?direction=backward&edges=*
+
+The following use cases require traversing the *forward graph*.
+
+- **revision size** (as n. of contents) distribution: the number of contents
+ that are (recursively) reachable from a given revision.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?edges=*
+
+- **origin size** (as n. of revisions) distribution: count the number of
+ revisions that are (recursively) reachable from a given origin.
+
+ Endpoint::
+
+ /graph/count/leaves/:NODE_ID?edges=ori:snp,snp:rel,snp:rev,rel:rev,rev:rev
diff --git a/docs/compression.rst b/docs/compression.rst
--- a/docs/compression.rst
+++ b/docs/compression.rst
@@ -1,113 +1,391 @@
.. _graph-compression:
+=================
Graph compression
=================
-The compression process is a pipeline implemented for the most part on top of
-the `WebGraph framework <http://webgraph.di.unimi.it/>`_ and ecosystem
-libraries. The compression pipeline consists of the following steps:
+The compression pipeline is implemented on top of the `WebGraph framework
+<http://webgraph.di.unimi.it/>`_. It takes an ORC Graph Dataset as an input,
+such as the ones found in the :ref:`Graph Dataset List <swh-dataset-list>`,
+and generates a compressed graph suitable for high intensity analyses on
+large servers.
-.. figure:: images/compression_steps.png
- :align: center
- :alt: Compression steps
- Compression steps
+Running the compression pipeline
+================================
+
+Dependencies
+------------
+
+To compress a graph, you will need to install the ``swh.graph`` tool as well as
+a recent JRE, as described in the :ref:`swh-graph-quickstart` page.
+
+You will also need the zstd_ compression tool::
+
+ $ sudo apt install zstd
+
+.. _zstd: https://facebook.github.io/zstd/
+
+
+Hardware Requirements
+---------------------
+
+The compression pipeline is even more demanding than the graph server in terms
+of hardware requirements, especially RAM. Notably, the BFS compression step
+loads a graph compressed in random order in memory, which is usually more than
+a TiB for the full graph. While it is possible to do this step with a memory
+mapping, our experiments show that this could take a very long time (several
+months) on hard drives.
+
+The LLP compression step requires 13 bytes of RAM per node, which could amount
+to storing hundreds of gigabytes in RAM in addition to loading the graph
+itself.
+
+Some steps also involve sorting the entire set of edges and their labels, by
+using large on-disk buffer files, sometimes reaching the size of the input
+dataself itself.
+
+The machine we used to compress the entire graph (dataset version 2022-04-25)
+has the following hardware specs:
+
+- 2 TiB of RAM (DDR4 ECC 2400Mhz)
+- 64 vCPUs (Dual AMD EPYC 7302 16-Core)
+- 24 TiB of SSD (NVMe)
+
+The server we rented is from the
+`HGR-HCI-4 <https://www.ovhcloud.com/en/bare-metal/high-grade/hgr-hci-4/>`_
+series from OVH.
+
+
+Input dataset
+-------------
+
+First, you need to retrieve a graph to compress, in ORC format. The :ref:`Graph
+Dataset List <swh-dataset-list>` has a list of datasets made available by the
+Software Heritage archive, including "teaser" subdatasets which have a more
+manageable size and are thus very useful for prototyping with less hardware
+resources.
+
+The datasets can be retrieved from S3 or the annex, in a similar fashion to
+what is described in :ref:`swh-graph-retrieving-compressed`, by simply
+replacing "compressed" by "orc":
+
+.. code:: console
+
+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/orc
+ (venv) $ cd 2021-03-23-popular-3k-python/
+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/ orc
+
+Alternatively, any custom ORC dataset can be used as long as it respects
+:ref:`the schema <swh-dataset-schema>` of the Software Heritage Graph Dataset.
+
+**Note:** for testing purposes, a fake test dataset is available in the
+``swh-graph`` repository, with just a few dozen nodes. The ORC tables are
+available in ``swh-graph/swh/graph/tests/dataset/orc/``.
+
+
+Compression
+-----------
+
+You can compress your dataset by using the ``swh graph compress`` command. It
+will run all the various steps of the pipeline in the right order.
+
+.. code:: console
-Each of these steps is briefly described below. For more details see the
-following paper:
-.. note::
+ (venv) $ swh graph compress --input-dataset orc/ --outdir compressed/
+ [...]
+ (venv) $ ls compressed/
+ graph.edges.count.txt
+ graph.edges.stats.txt
+ graph.graph
+ graph.indegree
+ graph-labelled.labeloffsets
+ graph-labelled.labels
+ [...]
+ graph-transposed.obl
+ graph-transposed.offsets
+ graph-transposed.properties
- Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
- `Ultra-Large-Scale Repository Analysis via Graph Compression
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_. In
- proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th IEEE
- International Conference on Software Analysis, Evolution and
- Reengineering. IEEE 2020.
- Links: `preprint
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
- `bibtex
- <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
+(The purpose of each of these files is detailed in the
+:ref:`swh-graph-java-api` page.
-In order to practically perform graph compression, install the ``swh.graph``
-module and use the ``swh graph compress`` command line interface of the
-compression driver, that will conduct the various steps in the right order.
-See ``swh graph compress --help`` for usage details.
+For sufficiently large graphs, this command can take entire weeks. It is highly
+recommended to run it in a systemd service or in a tmux session.
+It is also possible to run single steps or step ranges from the CLI:
-1. MPH
+.. code:: bash
+
+ swh graph compress -i orc/ -o compressed/ --steps mph-bfs
+
+See ``swh graph compress --help`` for syntax and usage details.
+
+
+Compression steps
+=================
+
+The compression pipeline consists of the following steps:
+
+.. figure:: images/compression_steps.png
+ :align: center
+ :alt: Compression steps
+ :target: _images/compression_steps.png
+
+ Compression steps
+
+Each of these steps is briefly described below. For more details see the
+original Software Heritage graph compression paper [SWHGraphCompression2020]_,
+as well as chapters 9 and 10 of Antoine Pietri's PhD thesis
+[PietriThesis2021]_.
+
+.. [SWHGraphCompression2020]
+ | Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli.
+ | `Ultra-Large-Scale Repository Analysis via Graph Compression
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_.
+ | In proceedings of `SANER 2020 <https://saner2020.csd.uwo.ca/>`_: The 27th
+ IEEE International Conference on Software Analysis, Evolution and
+ Reengineering. IEEE 2020.
+ | Links: `preprint
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf>`_,
+ `bibtex
+ <https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib>`_.
+
+
+
+.. [PietriThesis2021]
+ | Antoine Pietri
+ | `Organizing the graph of public software development for large-scale mining
+ <https://hal.archives-ouvertes.fr/tel-03515795/>`_.
+ | Doctoral dissertation. Inria, 2021.
+
+
+1. EXTRACT_NODES
+----------------
+
+This step reads a graph dataset and extract all the unique node SWHIDs it
+contains, including the ones that are not stored as actual objects in the
+graph, but only *referred to* by the edges. Additionally, it extracts the set
+of all unique edge labels in the graph.
+
+**Rationale:** Because the graph can contain holes, loose objects and dangling
+objects, some nodes that are referred to as destinations in the edge
+relationships might not actually be stored in the graph itself. However, to
+compress the graph using a graph compressio library, it is necessary to have a
+list of *all* the nodes in the graph, including the ones that are simply
+referred to by the edges but not actually stored as concrete objects.
+
+This step reads the entire graph dataset, and uses ``sort -u`` to extract the
+set of all the unique nodes and unique labels that will be needed as an input
+for the compression process. It also write object count statistics in various
+files:
+
+- The set of nodes is written in ``graph.nodes.csv.zst``, as a zst-compressed
+ sorted list of SWHIDs, one per line.
+- The set of edge labels is written in ``graph.labels.csv.zst``, as a
+ zst-compressed sorted list of labels encoded in base64, one per line.
+- The number of unique nodes referred to in the graph is written in a text
+ file, ``graph.nodes.count.txt``
+- The number of unique edges referred to in the graph is written in a text
+ file, ``graph.edges.count.txt``
+- The number of unique edge labels is written in a text file,
+ ``graph.labels.count.txt``
+- Statistics on the number of nodes of each type are written in a text file,
+ ``graph.nodes.stats.txt``
+- Statistics on the number of edges of each type are written in a text file,
+ ``graph.edges.stats.txt``
+
+
+2. MPH
------
-A node in the Software Heritage :ref:`data model <data-model>` is identified
-using its SWHID (see :ref:`persistent identifiers
-<persistent-identifiers>`). However, WebGraph internally uses integers to refer
-to node ids.
+As discussed in :ref:`swh-graph-java-basics`, a node in the Software Heritage
+:ref:`data model <data-model>` is identified by its SWHID (see :ref:`persistent
+identifiers <persistent-identifiers>`), but WebGraph internally uses integers
+to refer to node ids.
-Mapping between the strings and longs ids is needed before compressing the
-graph. From the `Sux4J <http://sux.di.unimi.it/>`_ utility tool, we use the
+To create a mapping between integer node IDs and SWHIDs, we use the
`GOVMinimalPerfectHashFunction
<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
-class, mapping with no collisions N keys to N consecutive integers.
+class of the `Sux4J <http://sux.di.unimi.it/>`_ library, which maps N keys to N
+consecutive integers.
-The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
-Hash-function*) storing the hash function taking as input a string and returning
-a unique integer.
+We run this function on the list of SWHIDs stored in the
+``graph.nodes.csv.zst`` file generated in the previous step.
+This allows us to generate a bijection from the set of all the *n* SWHIDs in the
+graph to the set of integers :math:`[0, n - 1]`.
+The step produces a ``graph.mph`` file (MPH stands for *Minimal Perfect
+Hash-function*), containing a function which takes a SWHID (as a bytestring)
+and returns its associated node ID.
-2. BV compress
+
+3. BV compress
--------------
-This is the first actual compression step, building a compressed version of the
-input graph using WebGraph techniques presented in the framework paper. We use
-the `ScatteredArcsASCIIGraph
+This is the first actual compression step, where we build a compressed version
+of the input graph dataset.
+
+We use a ScatteredArcsORCGraph to load the dataset
+(implementation inspired of the `ScatteredArcsASCIIGraph
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/ScatteredArcsASCIIGraph.html>`_
-class, from WebGraph.
+class in WebGraph).
+This class wraps the ORC Graph dataset and exposes a *virtual* ImmutableGraph,
+whose nodes and edges can be iterated sequentially as if it was any other
+standard graph. To do so, it puts all the edges in batches and sorts them in an
+aggressively parallel fashion, then stores them as ``.bitstream`` files, and
+returns a `BatchGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.BatchGraph.html>`
+created from these batches.
+
+Finally, it uses the ``BVGraph.store()`` method, which compresses the input
+graph as a `BVGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BVGraph.html>`_,
+using the compression techniques described in the article *The WebGraph
+Framework I: Compression Techniques* cited above.
The resulting BV graph is stored as a set of files:
-- ``.graph``: the compressed graph in the BV format
-- ``.offsets``: offsets values to read the bit stream graph file
-- ``.obl``: offsets cache to load the graph faster
-- ``.properties``: entries used to correctly decode graph and offset files
+- ``graph-base.graph``: the compressed graph in the BV format
+- ``graph-base.offsets``: offsets values to read the bit stream graph file
+- ``graph-base.properties``: entries used to correctly decode graph and offset
+ files
-3. BFS
--------
+4. BFS
+------
-In the LLP paper, authors propose an empirical analysis linking node ordering
-and high compression ratio: it is important to use an ordering of nodes ids such
-that vertices from the same host are close to one another.
+In [LLP]_, the paper authors empirically demonstrate that a high graph
+compression ratio can be achieved for the graph of the Web by ordering nodes
+such that vertices from the same host are close to each other.
-Building on this insight, the previous compression results in the BV compress
-step are improved by re-ordering nodes ids using a BFS traversal order. We use
-the `BFS
+In Software Heritage, there is no notion of "host" that can be used to generate
+these compression-friendly orderings, because the identifiers are just
+uniformly random cryptographic hashes. However, we can generate these orderings
+by running algorithms to inform us on which nodes are close to each other.
+
+In this step, we run a BFS traversal on the entire graph to get a more
+compression-friendly ordering of nodes. We use the `BFS
<http://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
class from the `LAW <http://law.di.unimi.it/>`_ library.
-The resulting ordering is stored in the ``.order`` file, listing nodes ids in
-order of traversal.
+The resulting ordering is stored in a ``graph-bfs.order`` file, which contains
+all the node IDs in the order of traversal.
-4. Permute
-----------
+5. PERMUTE_BFS
+--------------
-Once the order is computed (BFS or another ordering technique), the final
-compressed graph is created based on the initial BV compress result, and using
-the new node order mapping. The permutation uses the `Transform
+Once the BFS order is computed, we permute the initial "base" graph using the
+this new ordering. The permutation uses the `Transform
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
class from WebGraph framework.
-The final compressed graph is only stored in the resulting ``.graph``,
-``.offsets``, ``.obl``, and ``.properties`` files.
+The BFS-compressed graph is stored in the files
+``graph-bfs.{graph,offsets,properties}``.
+6. TRANSPOSE_BFS
+----------------
-5. Stats
---------
+We transpose the BFS-compressed graph, using the `Transform
+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
+class from WebGraph.
+This step is a prerequisite for LLP compression.
+
+7. SIMPLIFY
+-----------
+
+This step creates a loopless and symmetric version of the BFS-compressed graph,
+using the `Transform
+<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
+class from WebGraph.
+This step is a prerequisite for LLP compression.
+
+8. LLP
+------
+
+Better compression ratios can be achieved by the Layered Label Propagation
+(LLP) algorithm to reorder nodes. This algorithm is described in [LLP]_.
+The LLP algorithm finds locality-preserving orders by clustering together nodes
+in close proximity. Similar to the BFS, this algorithm is particularly
+interesting for our use case as it is unsupervised, and does not rely on prior
+information on the clusters present in the graph. The idea behind the
+clustering algorithm is to randomly distribute communities to the nodes in the
+graph, then iteratively assign to each node the community most represented in
+its neighbors.
+
+.. [LLP] Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna.
+ *Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks.*
+ WWW 2011: 587-596
+ DOI: https://doi.org/10.1145/1963405.1963488
+ preprint: https://arxiv.org/abs/1011.5425
+
+LLP is more costly than simple BFS-based compression in both time and memory.
+Even though the algorithm has a linear time complexity, it does multiple
+iterations on the graph and is significantly slower than the BFS which is just
+one single traversal. Moreover, keeping track of the communities requires a
+total of 13 bytes per node, which increases the RAM requirements.
+Because of these constraints, it is unrealistic to run the LLP algorithm on the
+uncompressed version of the graph; this is why we do an intermediate
+compression with the BFS ordering first, then compress the entire graph *again*
+with an even better ordering.
+
+The LLP algorithm takes a simplified (loopless, symmetric) graph as an input,
+which we already computed in the previous steps.
+
+The algorithm is also parameterized by a list of γ values, a "resolution" parameter
+which defines the shapes of the clustering it produces: either small, but
+denser pieces, or larger, but unavoidably sparser pieces. The algorithm then
+combines the different clusterings together to generate the output reordering.
+γ values are given to the algorithm in the form :math:`\frac{j}{2^k}`; by
+default, 12 different values of γ are used. However, the combination procedure
+is very slow, and using that many γ values could take several months in our
+case.
+We thus narrowed down a smaller set of γ values that empirically give good
+compression results, which are used by default in the pipeline. In general,
+smaller values of γ seem to generate better compression ratios. The effect of a
+given γ is that the density of the sparsest cluster is at least γ γ+1, so large
+γ values imply small, more dense clusters. It is reasonable to assume that
+since the graph is very sparse to start with, such clusters are not that
+useful.
+
+The resulting ordering is stored in a ``graph-llp.order`` file.
+
+9. PERMUTE_LLP
+--------------
-Compute various statistics on the final compressed graph:
+Once the LLP order is computed, we permute the BFS-compressed graph using the
+this new ordering. The LLP-compressed graph, which is our final compressed
+graph, is stored in the files ``graph.{graph,offsets,properties}``.
-- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
+10. OBL
+-------
+
+Cache the BVGraph offsets of the forward graph to make loading faster. The
+resulting offset big list is stored in the ``graph.obl`` file.
+
+11. COMPOSE_ORDERS
+------------------
+
+To be able to translate the initial MPH inputs to their resulting rank in the
+LLP-compressed graph, we need to use the two order permutations successively:
+the base → BFS permutation, then the BFS → LLP permutation.
+
+To make this less wasteful, we *compose* the two permutations into a single
+one. We use the `composePermutationsInPlace
+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/Util.html#composePermutationsInPlace(long%5B%5D%5B%5D,long%5B%5D%5B%5D)>`_
+function of the dsiutils library. The resulting permutation is stored as a
+``graph.order`` file. Hashing a SWHID with the ``graph.mph`` function, then
+permuting the result using the ``graph.order`` permutation yields the integer
+node ID matching the input SWHID in the graph.
+
+12. STATS
+---------
+
+This step computes various statistics on the compressed graph:
+
+- ``.stats``: statistics such as number of nodes, edges, avg/min/max degree,
average locality, etc.
- ``.indegree``: graph indegree distribution
- ``.outdegree``: graph outdegree distribution
@@ -117,9 +395,202 @@
class from WebGraph.
-6. Transpose
-------------
+13. TRANSPOSE
+-------------
-Create a transposed graph to allow backward traversal, using the `Transform
+Transpose the graph to allow backward traversal, using the `Transform
<http://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/Transform.html>`_
-class from WebGraph.
+class from WebGraph. The resulting transposed graph is stored as the
+``graph-transposed.{graph,offsets,properties}`` files.
+
+
+14. TRANSPOSE_OBL
+-----------------
+
+Same as OBL, but for the transposed graph. The resulting offset big list is
+stored in the ``graph-transposed.obl`` file.
+
+
+15. MAPS
+--------
+
+This steps generates the *node mappings* described in
+:ref:`swh-graph-java-node-mappings`. In particular, it generates:
+
+- ``graph.node2swhid.bin``: a compact binary representation of all the
+ SWHIDs in the graph, ordered by their rank in the graph file.
+- ``graph.node2type.bin``: a `LongBigArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
+ which stores the type of each node.
+
+It does so by reading all the SWHIDs in the ``graph.nodes.csv.zst`` file generated in the
+EXTRACT_NODES step, then getting their corresponding node IDs (using the
+``.mph`` and ``.order`` files), then sorting all the SWHIDs according to
+their node ID. It then writes these SWHIDs in order, in a compact but seekable
+binary format, which can be used to return the SWHID corresponding to any given
+node in O(1).
+
+
+16. EXTRACT_PERSONS
+-------------------
+
+This step reads the ORC graph dataset and extracts all the unique persons it
+contains. Here, "persons" are defined as the set of unique pairs of name +
+email, potentially pseudonymized, found either as revision authors, revision
+committers or release authors.
+
+The ExtractPersons class reads all the persons from revision and release
+tables, then uses ``sort -u`` to get a sorted list without any duplicates. The
+resulting sorted list of authors is stored in the ``graph.persons.csv.zst``
+file.
+
+
+17. MPH_PERSONS
+---------------
+
+This step computes a Minimal Perfect Hash function on the set of all the unique
+persons extracted in the EXTRACT_PERSONS step. Each individual person is mapped
+to a unique integer in :math:`[0, n-1]` where *n* is the total number of
+persons. The resulting function is serialized and stored in the
+``graph.persons.mph`` file.
+
+
+18. NODE_PROPERTIES
+-------------------
+
+This step generates the *node property files*, as described in
+:ref:`swh-graph-java-node-properties`.
+The nodes in the Software Heritage Graph each have associated *properties*
+(e.g., commit timestamps, authors, messages, ...). The values of these
+properties for each node in the graph are compressed and stored in files
+alongside the compressed graph.
+
+The WriteNodeProperties class reads all the properties from the ORC Graph
+Dataset, then serializes each of them in a representation suitable for
+efficient random access (e.g., large binary arrays) and stores them on disk.
+
+For persons (authors, committers etc), the MPH computed in the MPH_PERSONS step
+is used to store them as a single pseudonymized integer ID, which uniquely
+represents a full name + email.
+
+The results are stored in the following list of files:
+
+- ``graph.property.author_id.bin``
+- ``graph.property.author_timestamp.bin``
+- ``graph.property.author_timestamp_offset.bin``
+- ``graph.property.committer_id.bin``
+- ``graph.property.committer_timestamp.bin``
+- ``graph.property.committer_timestamp_offset.bin``
+- ``graph.property.content.is_skipped.bin``
+- ``graph.property.content.length.bin``
+- ``graph.property.message.bin``
+- ``graph.property.message.offset.bin``
+- ``graph.property.tag_name.bin``
+- ``graph.property.tag_name.offset.bin``
+
+
+19. MPH_LABELS
+--------------
+
+This step computes a **monotone** Minimal Perfect Hash function on the set of
+all the unique *arc label names* extracted in the EXTRACT_NODES step. Each
+individual arc label name (i.e., directory entry names and snapshot branch
+names) is monotonely mapped to a unique integer in :math:`[0, n-1]`, where *n*
+is the total number of unique arc label names, which corresponds to their
+**lexical rank** in the set of all arc labels.
+
+In other words, this MPH being monotone means that the hash of the *k*-th item
+in the sorted input list of arc labels will always be *k*.
+We use the `LcpMonotoneMinimalPerfectHashFunction
+<https://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/LcpMonotoneMinimalPerfectHashFunction.html>`_
+of Sux4J to generate this function.
+
+The rationale for using a monotone function here is that it will allow us to
+quickly get back the arc label from its hash without having to store an
+additional permutation.
+The resulting MPH function is serialized and stored in the ``graph.labels.mph``
+file.
+
+
+20. FCL_LABELS
+--------------
+
+This step computes a *reverse-mapping* for arc labels, i.e., a way to
+efficiently get the arc label name from its hash computed with the monotone MPH
+of the MPH_LABELS step.
+
+Thanks to the MPH being monotone, this boils down to storing all the labels in
+lexicographic order in a string list format that allows O(1) access to its
+elements. For this purpose, we use the `MappedFrontCodedStringBigList
+<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/big/util/MappedFrontCodedStringBigList.html>`_
+class from the dsiutils library, using the ``graph.labels.csv.zst`` file as its
+input. It stores the label names in a compact way by using front-coding
+compression, which is particularly efficient here because the strings are
+already in lexicographic order. The resulting FCL files are stored as
+``graph.labels.fcl.*``, and they can be loaded using memory mapping.
+
+
+21. EDGE_LABELS
+---------------
+
+
+This step generates the *edge property files*, as described in
+:ref:`swh-graph-java-edge-properties`. These files allow us to get the *edge
+labels* as we iterate on the edges of the graph. The files essentially contain
+compressed sorted triplets of the form (source, destination, label), with
+additional offsets to allow random access.
+
+To generate these files, the LabelMapBuilder class starts by reading in
+parallel the labelled edges in the ORC dataset, which can be thought of as
+quadruplets containing the source SWHID, the destination SWHID, the label name
+and the entry permission if applicable:
+
+.. code-block:: text
+
+ swh:1:snp:4548a5… swh:1:rev:0d6834… cmVmcy9oZWFkcy9tYXN0ZXI=
+ swh:1:dir:05faa1… swh:1:cnt:a35136… dGVzdC5j 33188
+ swh:1:dir:05faa1… swh:1:dir:d0ff82… dGVzdA== 16384
+ ...
+
+Using the ``graph.mph`` and the ``graph.order`` files, we hash and permute the
+source and destination nodes. We also monotonically hash the labels using the
+``graph.labels.mph`` function to obtain the arc label identifiers. The
+permissions are normalized as one of the 6 possible values in the
+``DirEntry.Permission.Type`` enum, and are then stored in the 3 lowest bits of
+the label field.
+
+.. code-block:: text
+
+ 4421 14773 154
+ 1877 21441 1134
+ 1877 14143 1141
+ ...
+
+These hashed edges and their compact-form labels are then put in large batches
+sorted in an aggressively parallel fashion, which are then stored as
+``.bitstream`` files. These batch files are put in a heap structure to perform
+a merge sort on the fly on all the batches.
+
+Then, the LabelMapBuilder loads the graph and starts iterating on its edges. It
+synchronizes the stream of edges read from the graph with the stream of sorted
+edges and labels read from the bitstreams in the heap. At this point, it writes
+the labels to the following output files:
+
+- ``graph-labelled.properties``: a property file describing the graph, notably
+ containing the basename of the wrapped graph.
+- ``graph-labelled.labels``: the compressed labels
+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
+ random order.
+
+It then does the same with backward edge batches to get the transposed
+equivalent of these files:
+``graph-transposed-labelled.{properties,labels,labeloffsets}``.
+
+
+
+22. CLEAN_TMP
+-------------
+
+This step reclaims space by deleting the temporary directory, as well as all
+the intermediate outputs that are no longer necessary now that the final graph
+has been compressed (shown in gray in the step diagram).
diff --git a/docs/images/Makefile b/docs/images/Makefile
--- a/docs/images/Makefile
+++ b/docs/images/Makefile
@@ -1,7 +1,7 @@
all: compression_steps.png compression_steps.svg
%.png: %.dot
- dot -Gdpi=300 -Tpng $< -o $@
+ dot -Gdpi=150 -Tpng $< -o $@
%.svg: %.dot
dot -Tsvg $< -o $@
diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot
--- a/docs/images/compression_steps.dot
+++ b/docs/images/compression_steps.dot
@@ -1,51 +1,105 @@
digraph "Compression steps" {
- // Horizontal graph
- rankdir=LR;
+ node [shape = none];
+
+ orc_dataset [label="ORC Graph\nDataset"];
+ nodes_csv [label="graph.nodes.csv.zst"];
+ labels_csv [label="graph.labels.csv.zst"];
+ graph_mph [label="graph.mph"];
subgraph {
- input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none];
- input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none];
- {rank=same; input_edges; input_nodes;}
+ node [fontcolor=darkgray];
+ graph_base [label="graph-base.graph"]
+ graph_bfs_order [label="graph-bfs.order"]
+ graph_bfs [label="graph-bfs.graph"]
+ graph_bfs_transposed [label="graph-bfs-transposed.graph"]
+ graph_bfs_simplified [label="graph-bfs-simplified.graph"]
+ graph_llp_order [label="graph-llp.order"]
}
- mph [label="MPH", shape=box];
- mph_out [label="swh.mph", fontsize=9, shape=none];
-
- bv_compress [label="BV compress", shape=box];
- bv_compress_out
- [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties",
- fontsize=9, shape=none];
-
- bfs [label="BFS", shape=box];
- bfs_out [label="swh.order", fontsize=9, shape=none];
+ graph_llp [label="graph.graph"]
+ graph_llp_transposed [label="graph-transposed.graph"]
+ graph_order [label="graph.order"]
+ graph_obl [label="graph.obl"]
+ graph_transposed_obl [label="graph-transposed.obl"]
+ stats [label="graph.stats"]
+ swhidmap [label="graph.node2swhid.bin"]
+ typemap [label="graph.node2type.bin"]
+ persons_csv [label="graph.persons.csv.zst"];
+ persons_mph [label="graph.persons.mph"];
+ node_properties [label="graph.property.*"];
+ labels_mph [label="graph.labels.mph"];
+ labels_fcl [label="graph.labels.fcl"];
+ graph_labelled [label="graph-labelled.*"];
+ graph_transposed_labelled [label="graph-transposed-labelled.*"];
- permute [label="Permute", shape=box];
- permute_out
- [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties",
- fontsize=9, shape=none];
-
- stats [label="Stats", shape=box];
- stats_out
- [label="swh.stats\lswh.indegree\lswh.outdegree",
- fontsize=9, shape=none];
+ subgraph {
+ node [shape=box, fontname="Courier New"];
+ EXTRACT_NODES;
+ MPH;
+ BV;
+ BFS;
+ PERMUTE_BFS;
+ TRANSPOSE_BFS;
+ SIMPLIFY;
+ LLP;
+ PERMUTE_LLP;
+ COMPOSE_ORDERS;
+ STATS;
+ TRANSPOSE;
+ OBL;
+ TRANSPOSE_OBL;
+ NODE_MAP;
+ EXTRACT_PERSONS;
+ MPH_PERSONS;
+ NODE_PROPERTIES;
+ MPH_LABELS;
+ FCL_LABELS;
+ EDGE_LABELS;
+ }
- transpose [label="Transpose", shape=box];
- transpose_out
- [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties",
- fontsize=9, shape=none];
- input_nodes -> mph;
- input_edges -> bv_compress;
- mph -> mph_out;
- mph_out -> bv_compress;
- bv_compress -> bv_compress_out;
- bv_compress_out-> bfs;
- bv_compress_out-> permute;
- bfs -> bfs_out;
- bfs_out -> permute;
- permute -> permute_out;
- permute_out -> stats;
- permute_out -> transpose;
- stats -> stats_out;
- transpose -> transpose_out;
+ orc_dataset -> EXTRACT_NODES;
+ EXTRACT_NODES -> nodes_csv;
+ EXTRACT_NODES -> labels_csv;
+ nodes_csv -> MPH -> graph_mph;
+ graph_mph -> BV;
+ orc_dataset -> BV -> graph_base;
+ graph_base -> BFS -> graph_bfs_order;
+ graph_bfs_order -> PERMUTE_BFS;
+ graph_base -> PERMUTE_BFS -> graph_bfs;
+ graph_bfs -> TRANSPOSE_BFS -> graph_bfs_transposed;
+ graph_bfs_transposed -> SIMPLIFY;
+ graph_bfs -> SIMPLIFY -> graph_bfs_simplified;
+ graph_bfs_simplified -> LLP -> graph_llp_order;
+ graph_llp_order -> PERMUTE_LLP;
+ graph_bfs -> PERMUTE_LLP -> graph_llp;
+ graph_bfs_order -> COMPOSE_ORDERS;
+ graph_llp_order -> COMPOSE_ORDERS -> graph_order;
+ graph_llp -> TRANSPOSE -> graph_llp_transposed;
+ graph_llp -> OBL -> graph_obl;
+ graph_llp_transposed -> TRANSPOSE_OBL -> graph_transposed_obl;
+ graph_llp -> STATS -> stats;
+ graph_llp -> NODE_MAP;
+ nodes_csv -> NODE_MAP;
+ graph_mph -> NODE_MAP;
+ graph_order -> NODE_MAP;
+ NODE_MAP -> swhidmap;
+ NODE_MAP -> typemap;
+ orc_dataset -> EXTRACT_PERSONS -> persons_csv;
+ persons_csv -> MPH_PERSONS -> persons_mph;
+ orc_dataset -> NODE_PROPERTIES;
+ persons_mph -> NODE_PROPERTIES;
+ graph_mph -> NODE_PROPERTIES;
+ graph_order -> NODE_PROPERTIES;
+ NODE_PROPERTIES -> node_properties;
+ labels_csv -> MPH_LABELS -> labels_mph;
+ labels_mph -> FCL_LABELS;
+ labels_csv -> FCL_LABELS -> labels_fcl;
+ orc_dataset -> EDGE_LABELS;
+ labels_mph -> EDGE_LABELS;
+ graph_llp -> EDGE_LABELS;
+ graph_mph -> EDGE_LABELS;
+ graph_order -> EDGE_LABELS;
+ EDGE_LABELS -> graph_labelled;
+ EDGE_LABELS -> graph_transposed_labelled;
}
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -8,10 +8,11 @@
:caption: Overview
quickstart
+ api
+ java-api
+ memory
compression
cli
- api
- use-cases
docker
git2graph
/apidoc/swh.graph
diff --git a/docs/java-api.rst b/docs/java-api.rst
new file mode 100644
--- /dev/null
+++ b/docs/java-api.rst
@@ -0,0 +1,744 @@
+.. _swh-graph-java-api:
+
+Using the Java API
+==================
+
+.. highlight:: java
+
+While the :ref:`HTTP API <swh-graph-api>` is useful for many common use-cases,
+it is often not sufficient to implement more complex algorithms. This section
+describes the low-level Java API that ``swh-graph`` provides on top of the
+WebGraph framework to manipulate the compressed graph of Software Heritage.
+
+A cursory understanding of the `WebGraph framework
+<https://webgraph.di.unimi.it/>`_ and its API is helpful to understand the
+notions detailed here.
+
+.. _swh-graph-java-basics:
+
+Basics
+------
+
+In the WebGraph framework, graphs are generally subclasses of
+`ImmutableGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_,
+the abstract class providing the core API to manipulate and iterate on graphs.
+Under the hood, compressed graphs are stored as the `BVGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_
+class, which contains the actual codec used to compress and decompress
+adjacency lists.
+
+Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]`
+where *n* is the total number of nodes in the graph.
+Each node has an associated *adjacency list*, i.e., a list of nodes going from
+that source node to a destination node. This list represents the **edges** (or
+**arcs**) of the graph.
+
+**Note**: edges are always directed. Undirected graphs are internally stored as
+a pair of directed edges (src → dst, dst → src), and are called "symmetric"
+graphs.
+
+On disk, a simple BVGraph with the basename ``graph`` would be represented as
+the following set of files:
+
+- ``graph.graph``: contains the compressed adjacency lists of each node, which
+ can be decompressed by the BVGraph codec.
+- ``graph.properties``: contains metadata on the graph, such as the number of
+ nodes and arcs, as well as additional loading information needed by the
+ BVGraph codec.
+- ``graph.offsets``: a list of offsets of where the adjacency list of each node
+ is stored in the main graph file.
+- ``graph.obl``: optionally, an "offset big-list file" which can be used to
+ load graphs faster.
+
+An ImmutableGraph can be loaded using different *load methods*, which have each
+different performance implications:
+
+- ``load()``: the entire graph is loaded in RAM and supports random access.
+- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see
+ ``mmap(1)``), at the cost of being potentially slower, especially when doing
+ random access on slow storage.
+- ``loadOffline()``: no data is actually loaded is memory, only sequential
+ iteration is possible.
+
+The following code loads a graph stored on disk under the ``compressed/graph``
+basename, using the memory-mapped loading mode, and stores it as a generic
+ImmutableGraph:
+
+.. code-block:: java
+
+ ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph");
+
+Note that most of the time you will want to use the SWH-specific subclass
+**SwhUnidirectionalGraph** instead, which has the same API as an ImmutableGraph
+except it adds other SWH-specific methods. It is described later in the
+:ref:`swh-graph-java-node-mappings` section.
+
+
+Running the code
+----------------
+
+To run a piece of Java code written using the Java API, you need to run it with
+all the dependencies in your classpath (the WebGraph libraries and the
+swh-graph library). The easiest way to do it is to add the *fat jar*
+shipped in the swh-graph library on PyPI, which contains all the required
+dependencies.
+
+.. code-block:: console
+
+ $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
+
+
+Note that to load bigger graphs, the default heap size of the JVM is likely to
+be insufficient to load entire graphs in memory. It is advised to increase this
+heap size with the ``-Xmx`` flag:
+
+.. code-block:: console
+
+ $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java
+
+For more information on performance tuning and memory considerations, you
+should also read the :ref:`swh-graph-memory` page, in which we recommend
+additional JVM options for loading large graphs.
+
+
+Simple traversal
+----------------
+
+The ImmutableGraph class provides primitives to iterate and traverse graphs. It
+contains the following methods:
+
+- ``graph.numNodes()`` returns the number of nodes in the graph (*n*).
+- ``graph.numArcs()`` returns the number of arcs in the graph.
+
+And, given a node ID :math:`k \in [0, n - 1]`:
+
+- ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are
+ *adjacent* to *k* (i.e., its outgoing *neighbors*).
+- ``graph.outdegree(k)`` returns the number of outgoing neighbors of *k*.
+
+
+Example: Average outdegree
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code can be used to compute the average
+outdegree of a graph, which is a useful measure of its density:
+
+.. code-block:: java
+
+ public static long averageOutdegree(ImmutableGraph graph) {
+ return ((long) graph.numArcs()) / graph.numNodes();
+ }
+
+
+Example: Degree distributions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using the ``outdegree()`` primitive, we can compute the outdegree distribution
+of the graph by iterating on all its nodes. The distribution will be returned
+as a map that associates to each degree *d* the number of nodes with outdegree
+*d*.
+
+.. code-block:: java
+
+ public static Map<Long, Long> outdegreeDistribution(ImmutableGraph graph) {
+ HashMap<Long, Long> distribution = new HashMap<Long, Long>();
+ for (long k = 0; k < graph.numNodes(); ++k) {
+ distribution.merge(graph.outdegree(k), 1L, Long::sum);
+ }
+ return distribution;
+ }
+
+
+Example: Depth-First Traversal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``successors`` primitive can be used to write a simple stack-based DFS
+traversal on the graph which starts from a given node and prints all the
+descendant nodes in its transitive closure:
+
+.. code-block:: java
+ :emphasize-lines: 10
+
+ public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) {
+ Stack<Long> stack = new Stack<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ stack.push(srcNodeId);
+ visited.add(srcNodeId);
+
+ while (!stack.isEmpty()) {
+ long currentNodeId = stack.pop();
+ System.out.println(currentNodeId);
+
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ stack.push(neighborNodeId);
+ visited.add(neighborNodeId);
+ }
+ }
+ }
+ }
+
+Example: Breadth-First Traversal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Swapping the stack for a queue changes the traversal order from depth-first
+to breadth-first:
+
+.. code-block:: java
+ :emphasize-lines: 2
+
+ public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) {
+ Queue<Long> queue = new ArrayDeque<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ queue.add(srcNodeId);
+ visited.add(srcNodeId);
+
+ while (!queue.isEmpty()) {
+ long currentNodeId = queue.poll();
+ System.out.println(currentNodeId);
+
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ queue.add(neighborNodeId);
+ visited.add(neighborNodeId);
+ }
+ }
+ }
+ }
+
+
+.. _swh-graph-java-node-mappings:
+
+Node types and SWHIDs
+---------------------
+
+In the Software Heritage archive, nodes are not represented by a simple
+integer, but by a :ref:`SWHID <persistent-identifiers>`, which contain both the
+*type* of the node (revision, directory, blob...) and its unique identifier. We
+use **node mappings** which allow us to translate between SWHIDs and the
+compact node IDs used in the compressed graph.
+
+Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the
+`GOVMinimalPerfectHashFunction
+<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
+class of the Sux4J library, which maps N keys to N consecutive integers with no
+collisions.
+
+The following files are used to store the mappings between the nodes and their
+types:
+
+- ``graph.mph``: contains a serialized minimal perfect hash function computed
+ on the list of all the SWHIDs in the graph.
+- ``graph.order``: contains the permutation that associates with each output of
+ the MPH the node ID to which it corresponds
+- ``graph.node2swhid.bin``: contains a compact binary representation of all the
+ SWHIDs in the graph, ordered by their rank in the graph file.
+- ``graph.node2type.bin``: contains a `LongBigArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_
+ which stores the type of each node.
+
+To use these mappings easily, we provide the class **SwhUnidirectionalGraph**,
+an ImmutableGraph which wraps the underlying graph and adds a few
+utility methods to obtain SWH-specific information on the graph.
+
+A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph,
+as long as the mapping files listed above are present::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
+
+This class exposes the same graph primitives as an ImmutableGraph, but it
+additionally contains the following methods:
+
+- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given
+ node ID. This function does a lookup of the SWHID at offset *i* in the file
+ ``graph.node2swhid.bin``.
+
+- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given
+ SWHID. It works by hashing the SWHID with the function stored in
+ ``graph.mph``, then permuting it using the permutation stored in
+ ``graph.order``. It does additional domain-checking by calling ``getSWHID()``
+ on its own result to check that the input SWHID was valid.
+
+- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as
+ an enum of all the different object types in the Software Heritage data
+ model. It does so by looking up the value at offset *i* in the bit vector
+ stored in ``graph.node2type.bin``.
+
+
+Example: Find the target directory of a revision
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As an example, we use the methods mentioned above to perform the
+following task: "given a revision, return its target directory". To do so, we
+first look up the node ID of the given revision in the compressed graph. We
+iterate on the successors of that node, and return the SWHID of the first
+destination node that has the "directory" type.
+
+
+.. code-block:: java
+ :emphasize-lines: 2
+
+ public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) {
+ long src = graph.getNodeId(revSwhid);
+ assert graph.getNodeType(src) == Node.Type.REV;
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long dst; (dst = it.nextLong()) != -1;) {
+ if (graph.getNodeType(dst) == Node.Type.DIR) {
+ return graph.getSWHID(dst);
+ }
+ }
+ throw new RuntimeError("Revision has no target directory");
+ }
+
+.. _swh-graph-java-node-properties:
+
+Node properties
+---------------
+
+The Software Heritage Graph is a *property graph*, which means it has various
+properties associated with its nodes and edges (e.g., commit timestamps,
+authors, messages, ...). We compress these properties and store them in files
+alongside the compressed graph. This allows you to write traversal algorithms
+that depend on these properties.
+
+By default, properties are not assumed to be present are are not loaded when
+the graph itself is loaded. If you want to use a property, you need to
+explicitly load it first. As an example, this is how you load the "content
+length" property to get the length of a given blob::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);
+ graph.loadContentLength();
+ long blobSize = graph.getContentLength(graph.getNodeID(swhid));
+
+The documentation of the SwhGraphProperties class (**TODO: link**) lists all
+the different properties, their types, and the methods used to load them and to get
+their value for a specific node.
+
+A few things of note:
+
+- A single loading call can load multiple properties at once; this is because
+ they are stored in the same file to be more space efficient.
+
+- Persons (authors, committers etc) are exported as a single pseudonymized
+ integer ID, that uniquely represents a full name + email.
+
+- Timestamps are stored as a long integer (for the timestamp itself) and a
+ short integer (for the UTC offset).
+
+
+.. _swh-graph-java-edge-properties:
+
+Edge labels
+-----------
+
+While looking up graph properties on the *nodes* of the graph is relatively
+straightforward, doing so for labels on the *arcs* is comparatively more
+difficult. These include the names and permissions of directory entries, as
+well as the branch names of snapshots.
+
+The `ArcLabelledImmutableGraph
+<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledImmutableGraph.html>`_
+class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them
+*labelled iterators*, which essentially allow us to look up the label of the
+arcs while iterating on them.
+
+This labelled graph is stored in the following files:
+
+- ``graph-labelled.properties``: a property file describing the graph, notably
+ containing the basename of the wrapped graph.
+- ``graph-labelled.labels``: the compressed labels
+- ``graph-labelled.labeloffsets``: the offsets used to access the labels in
+ random order.
+
+The SwhUnidirectionalGraph class contains *labelled* loading methods
+(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods
+are used instead of the standard non-labelled ones, the graph is loaded as an
+ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods
+can then be used:
+
+- ``labelledSuccessors(k)`` returns a `LabelledArcIterator
+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.LabelledArcIterator.html>`_
+ which is used in the same way as a LazyLongIterator except it also contains a
+ ``label()`` method to get the label of the currently traversed arc.
+- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator
+ <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.html>`_
+ of all the nodes in the graph, which replaces the LazyLongIterator of the
+ ``successor()`` function by a LabelledArcIterator similar to above.
+
+
+Label format
+~~~~~~~~~~~~
+
+The labels of each arc are returned as a ``DirEntry[]`` array. They encode
+both the name of a directory entry and its permissions. For snapshot branches,
+only the "name" field is useful.
+
+Arc label names are encoded as an integer ID representing each unique
+entry/branch name present in the graph. To retrieve the actual name associated
+with a given label ID, one needs to load the reverse mapping similar to how you
+would do for a normal property::
+
+ SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename);
+ graph.loadLabelNames();
+
+The byte array representing the actual label name can then be loaded with::
+
+ byte[] name = graph.getLabelName(label.filenameId);
+
+
+Multiedges
+~~~~~~~~~~
+
+The Software Heritage is not a *simple graph*, where at most one edge can exist
+between two vertices, but a *multigraph*, where multiple edges can be incident
+to the same two vertices. Consider for instance the case of a single directory
+``test/`` containing twice the same file blob (e.g., the empty file), under two
+different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty).
+The simple graph view of this directory will represent it as a single edge
+``test`` → *empty file*, while the multigraph view will represent it as *two*
+edges between the same nodes.
+
+Due to the copy-list model of compression, WebGraph only stores simple graphs,
+and thus stores multiedges as single edges, to which we cannot associate
+a single label name (in our example, we need to associate both names
+``ISSUES.txt`` and ``TODO.txt``).
+To represent this possibility of having multiple file names for a single arc,
+in the case of multiple relationships between two identical nodes, each arc label is
+stored as an *array* of DirEntry, each record representing one relationship
+between two nodes.
+
+
+Example: Printing all the entries of a directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code showcases how one can print all the entries (name,
+permission and target SWHID) of a given directory, using the labelled methods
+seen above.
+
+.. code-block:: java
+
+ public static void printEntries(ImmutableGraph g, long dirNode) {
+ LabelledArcIterator s = g.labelledSuccessors(dirNode);
+ for (long dst; (dst = it.nextLong()) >= 0;) {
+ DirEntry[] labels = (DirEntry[]) s.label().get();
+ for (DirEntry label : labels) {
+ System.out.format(
+ "%s %s %d\n",
+ graph.getSWHID(dst);
+ new String(graph.getLabelName(label.filenameId)),
+ label.permission
+ );
+ }
+ }
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <DIR_SWHID>
+ public static void main(String[] args) {
+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]);
+ g.loadLabelNames();
+ long dirNode = g.getNodeID(new SWHID(args[1]));
+ printEntries(g, dirNode);
+ }
+
+
+Transposed graph
+----------------
+
+Up until now, we have only looked at how to traverse the *forward* graph, i.e.,
+the directed graph whose edges are in the same direction as the Merkle DAG of
+the Software Heritage archive.
+For many purposes, especially that of finding the *provenance* of software
+artifacts, it is useful to query the *backward* (or *transposed*) graph
+instead, which is the same as the forward graph except all the edges are
+reversed.
+
+The transposed graph has its own set of files, counterparts to the files needed
+for the forward graph:
+
+- ``graph-transposed.graph``
+- ``graph-transposed.properties``
+- ``graph-transposed.offsets``
+- ``graph-transposed.obl``
+- ``graph-transposed-labelled.labels``
+- ``graph-transposed-labelled.labeloffsets``
+- ``graph-transposed-labelled.properties``
+
+However, because node IDs are the same in the forward and the backward graph,
+all the files that pertain to mappings between the node IDs and various
+properties (SWHIDs, property data, node permutations etc) remain the same.
+
+
+Example: Earliest revision containing a given blob
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following code loads all the committer timestamps of the revisions in the
+graph, then walks the *transposed* graph to return the earliest revision
+containing a given object.
+
+.. code-block:: java
+
+ public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) {
+ long oldestRev = -1;
+ long oldestRevTs = Long.MAX_VALUE;
+
+ Stack<Long> stack = new Stack<>();
+ HashSet<Long> visited = new HashSet<Long>();
+ stack.push(src);
+ visited.add(src);
+ while (!stack.isEmpty()) {
+ long currentNodeId = stack.pop();
+ LazyLongIterator it = graph.successors(currentNodeId);
+ for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {
+ if (!visited.contains(neighborNodeId)) {
+ stack.push(neighborNodeId);
+ visited.add(neighborNodeId);
+ if (g.getNodeType(neighborNodeId) == Node.Type.REV) {
+ Long ts = g.getCommitterTimestamp(neighborNodeId);
+ if (ts != null && ts < oldestRevTs) {
+ oldestRev = neighborNodeId;
+ oldestRevTs = ts;
+ }
+ }
+ }
+ }
+ }
+ return oldestRev;
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <BLOB_SWHID>
+ public static void main(String[] args) {
+ // Load the backward (= transposed) graph as a SwhUnidirectionalGraph
+ SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed");
+ g.loadCommitterTimestamps();
+ long node = g.getNodeID(new SWHID(args[1]));
+ long oldestRev = findEarliestRevisionContaining(g, node);
+ System.out.println(g.getSWHID(oldestRev));
+ }
+
+
+
+
+Bidirectional Graph
+-------------------
+
+
+BidirectionalImmutableGraph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and
+then manipulated just like the forward graph, it is often convenient to have
+*both* the forward and the backward graph in memory. Some traversal algorithms
+require first going down in the forward graph to select some nodes, then going
+up to find their provenance.
+
+To achieve that, we use the `BidirectionalImmutableGraph
+<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BidirectionalImmutableGraph.html>`_
+class from WebGraph, which stores both a graph and its transpose.
+This class provides the following methods to iterate on the **backward** graph,
+shown here with their counterparts for the forward graph:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Forward graph operation
+ - Backward graph operation
+
+ * - ``outdegree(k)``
+ - ``indegree(k)``
+
+ * - ``successors(k)``
+ - ``predecessors(k)``
+
+In addition, the class offers a few convenience methods which are generally
+useful when you have both a graph and its transpose:
+
+- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by
+ inverting the references to the forward and the backward graphs. Successors
+ become predecessors, and vice-versa.
+- ``symmetrize()`` returns the symmetric (= undirected) version of the
+ bidirectional graph. It is implemented by a union between the forward and the
+ backward graph, which basically boils down to removing the directionality of
+ the edges (the successors of a node are also its predecessors).
+
+
+SwhBidirectionalGraph
+~~~~~~~~~~~~~~~~~~~~~
+
+Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with
+SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it
+contains the method ``labelledPredecessors()``, the equivalent of
+``labelledSuccessors()`` but on the backward graph.
+
+Because SwhUnidirectionalGraph inherits from ImmutableGraph, and
+SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the
+common behavior between the two classes in a SwhGraph interface, which can
+represent either an unidirectional or a bidirectional graph.
+
+To avoid loading the node properties two times (once for each direction), they
+are stored in a separate class called SwhGraphProperties. In a
+SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node
+properties in memory by storing references to the same SwhGraphProperty
+object.
+
+.. code-block:: text
+
+
+ ┌──────────────┐
+ │ImmutableGraph◄────────┐
+ └────▲─────────┘ │extends
+ │ │
+ │ ┌──────────┴────────────────┐
+ extends│ │BidirectionalImmutableGraph│
+ │ └────────────▲──────────────┘
+ │ │extends
+ ┌──────────────┴───────┐ ┌──────┴──────────────┐
+ │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│
+ └──┬──────────────┬────┘ └────────┬───────────┬┘
+ │ │ contains x2 │ │
+ │ │ │ │
+ │ implements│ │implements │
+ │ ┌▼──────────┐ │ │
+ │ │SwhGraph(I)◄────────┘ │
+ contains│ └───────────┘ │contains
+ │ │
+ │ ┌──────────────────┐ │
+ └────────────►SwhGraphProperties◄──────────────┘
+ └──────────────────┘
+
+
+Example: Find all the shared-commit forks of a given origin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to define the *forks* of an origin as being the set of origins
+which share at least one revision with that origin.
+
+The following code loads the graph in both directions using a
+SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward*
+graph to find all its root revisions. It then walks the *backward* graph to
+find all the origins containing these root revisions, i.e., its *forks*.
+
+.. code-block:: java
+
+ public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) {
+ Stack<Long> forwardStack = new Stack<>();
+ HashSet<Long> forwardVisited = new HashSet<Long>();
+ Stack<Long> backwardStack = new Stack<>();
+ HashSet<Long> backwardVisited = new HashSet<Long>();
+
+ // First traversal (forward graph): find all the root revisions of the
+ // origin
+ forwardStack.push(srcOrigin);
+ forwardVisited.add(srcOrigin);
+ while (!forwardStack.isEmpty()) {
+ long curr = forwardStack.pop();
+ LazyLongIterator it = graph.successors(curr);
+ boolean isRootRevision = true;
+ for (long succ; (succ = it.nextLong()) != -1;) {
+ Node.Type nt = g.getNodeType(succ);
+ if (!forwardVisited.contains(succ)
+ && nt != Node.Type.DIR && nt != Node.Type.CNT) {
+ forwardStack.push(succ);
+ forwardVisited.add(succ);
+ isRootRevision = false;
+ }
+ }
+ if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) {
+ // Found a root revision, add it to the second stack
+ backwardStack.push(curr);
+ backwardVisited.add(curr);
+ }
+ }
+
+ // Second traversal (backward graph): find all the origins containing
+ // any of these root revisions and print them
+ while (!backwardStack.isEmpty()) {
+ long curr = backwardStack.pop();
+ LazyLongIterator it = graph.predecessors(curr);
+ boolean isRootRevision = true;
+ for (long succ; (succ = it.nextLong()) != -1;) {
+ Node.Type nt = g.getNodeType(succ);
+ if (!backwardVisited.contains(succ)) {
+ backwardStack.push(succ);
+ backwardVisited.add(succ);
+ if (nt == Node.Type.ORI) {
+ // Found an origin, print it.
+ System.out.println(g.getSWHID(succ));
+ }
+ }
+ }
+ }
+ }
+
+ // Usage: $PROGRAM <GRAPH_BASENAME> <ORI_SWHID>
+ public static void main(String[] args) {
+ // Load both forward and backward graphs as a SwhBidirectionalGraph
+ SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]);
+ long node = g.getNodeID(new SWHID(args[1]));
+ findSharedCommitForks(g, node);
+ }
+
+
+Large-scale processing
+----------------------
+
+Multithreading
+~~~~~~~~~~~~~~
+
+ImmutableGraph is not thread-safe. When writing multithreaded algorithms,
+calling ``successors()`` on the same graph from multiple threads will return
+garbage.
+
+Instead, each thread should create its own "lightweight copy" of the graph by
+calling ``.copy()``. This will not actually copy the entire graph data, which
+will remain shared across threads, but it will create new instances of the
+iterators so that each thread can independently iterate on the graph data.
+
+
+Data structures for large traversals
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When doing very large traversals, such as a BFS on the entire graph, the
+usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite
+inefficient. If you know you are going to traverse large parts of the graph,
+it's better to use more appropriate data structures, a lot of which can be
+found in the dsiutils library. In particular:
+
+- `LongArrayBitVector
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongArrayBitVector.html>`_
+ is an efficient bit-vector implementation, which can be used to store the
+ nodes that have already been seen in the visit. Its memory footprint is too
+ big to use for small traversals, but it is very efficient to traverse the
+ full graph, as every node only takes a single bit.
+
+- `ByteDiskQueue
+ <https://dsiutils.di.unimi.it/docs/it/unimi/dsi/io/ByteDiskQueue.html>`_ can
+ be used to efficiently store the queue of nodes to visit on disk, when it is
+ too large to fit in RAM.
+
+Other types in dsiutils and fastutil can save significant memory:
+``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList<Long>``,
+and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over
+``HashMap<Long, Long>``. We strongly recommend reading the documentation of the
+unimi libraries and looking at the code for usage examples.
+
+
+BigArrays
+~~~~~~~~~
+
+When working with the Software Heritage graph, is often necessary to store
+large arrays of values, with a size exceeding 2^32 items. Unfortunately,
+standard Java arrays do not support this.
+
+To circumvent this, WebGraph uses the `BigArrays scheme
+<https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/BigArrays.html>`_ from
+the fastutil library: "big arrays" are stored as arrays of arrays, supporting
+quadrillions of records.
+
+A BigArray ``long[][] a`` can be used with the following methods:
+
+- ``BigArrays.get(a, i)`` to get the value at index *i*
+- ``BigArrays.set(a, i, v)`` to set the value at index *i* to *v*.
+- ``BigArrays.length(a)`` to get the total length of the bigarray.
diff --git a/docs/memory.rst b/docs/memory.rst
new file mode 100644
--- /dev/null
+++ b/docs/memory.rst
@@ -0,0 +1,130 @@
+.. _swh-graph-memory:
+
+Memory & Performance tuning
+===========================
+
+This page discusses various considerations related to memory usage and
+performance tuning when using the ``swh-graph`` library to load large
+compressed graphs.
+
+JVM options
+-----------
+
+In production, we tend to use very large servers which have enough RAM to load
+the entire graph in RAM. In these setups, the default JVM options are often
+suboptimal. We recommend to start the JVM with the following options, which
+tend to significantly improve performance::
+
+ java \
+ -ea \
+ -server \
+ -XX:PretenureSizeThreshold=512M \
+ -XX:MaxNewSize=4G \
+ -XX:+UseLargePages \
+ -XX:+UseTransparentHugePages \
+ -XX:+UseNUMA \
+ -XX:+UseTLAB \
+ -XX:+ResizeTLAB \
+
+These options are documented in the manual of ``java(1)`` the Oracle
+documentation.
+
+
+Temporary directory
+-------------------
+
+Many of the graph algorithms (either for compression or traversal) tend to
+offload some of their run-time memory to disk. For instance, the `BFS
+<https://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_
+algorithm in the LAW library uses a temporary directory to write its queue of
+nodes to visit.
+
+Because these can be quite large and sometimes overflow the default ``/tmp``
+partition, it is advised to systematically specify a path to a local temporary
+directory with enough space to accommodate the needs of the Java programs. This
+can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI::
+
+ java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp
+
+
+Memory mapping vs Direct loading
+--------------------------------
+
+The main dial you can use to manage your memory usage is to chose between
+memory-mapping and direct-loading the graph data. The different loading modes
+available when loading the graph are documented in :ref:`swh-graph-java-api`.
+
+Loading in mapped mode will not load any extra data in RAM, but will instead
+use the ``mmap(1)`` syscall to put the graph file located on disk in the
+virtual address space. The Linux kernel will then be free to arbitrarily cache
+the file, either partially or in its entirety, depending on the available
+memory space.
+
+In our experiments, memory-mapping a small graph from a SSD only incurs a
+relatively small slowdown (about 15-20%). However, when the graph is too big to
+fit in RAM, the kernel has to constantly invalidate pages to cache newly
+accessed sections, which incurs a very large performance penalty. A full
+traversal of a large graph that usually takes about 20 hours when loaded in
+main memory could take more than a year when mapped from a hard drive!
+
+When deciding what to direct-load and what to memory-map, here are a few rules
+of thumb:
+
+- If you don't need random access to the graph edges, you can consider using
+ the "offline" loading mode. The offsets won't be loaded which will save
+ dozens of gigabytes of RAM.
+
+- If you only need to query some specific nodes or run trivial traversals,
+ memory-mapping the graph from a HDD should be a reasonable solution that
+ doesn't take an inordinate amount of time. It might be bad for your disks,
+ though.
+
+- If you are constrained in available RAM, memory-mapping the graph from an SSD
+ offers reasonable performance for reasonably complex algorithms.
+
+- If you have a heavy workload (i.e. running a full traversal of the entire
+ graph) and you can afford the RAM, direct loading will be orders of magnitude
+ faster than all the above options.
+
+
+Sharing mapped data across processes
+------------------------------------
+
+Often, multiple processes can be working on the same data (mappings or the
+graph itself), for instance when running different experiments on the same
+graph. This is problematic in terms of RAM usage when using direct memory
+loading, as the same data of potentially hundreds of gigabytes is loaded in
+memory twice.
+As we have seen, memory-mapping can be used to avoid storing redundant data in
+RAM, but comes at the cost of potentially slower I/O as the data is no longer
+guaranteed to be loaded in main memory and is reliant on kernel heuristics.
+
+To efficiently share data across two different compressed graph processes,
+another option is to copy graph data to a ``tmpfs`` not backed by a disk swap,
+which forces the kernel to load it entirely in RAM. Subsequent memory-mappings
+of the files stored in the tmpfs will simply map the data stored in RAM to
+virtual memory pages, and return a pointer to the in-memory structure.
+
+To do so, we create a directory in ``/dev/shm`` in which we **copy** all the
+files that we want to direct-load in RAM, and **symlink** all the others. Then,
+we load the graph using the memory-mapped loading mode, which makes it use the
+shared memory stored in the tmpfs under the hood.
+
+Here is a systemd service that can be used to perform this task automatically:
+
+.. code-block:: ini
+
+ [Unit]
+ Description=swh-graph memory sharing in tmpfs
+
+ [Service]
+ Type=oneshot
+ RemainAfterExit=yes
+ ExecStart=mkdir -p /dev/shm/swh-graph/default
+ ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default"
+ ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default
+ ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default
+ ExecStop=rm -rf /dev/shm/swh-graph/default
+
+ [Install]
+ WantedBy=multi-user.target
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
--- a/docs/quickstart.rst
+++ b/docs/quickstart.rst
@@ -1,51 +1,38 @@
+.. _swh-graph-quickstart:
+
Quickstart
==========
-This quick tutorial shows how to compress and browse a graph using ``swh.graph``.
-
-It does not cover the technical details behind the graph compression techniques
-(refer to :ref:`graph-compression`).
-
+This quick tutorial shows how to start the ``swh.graph`` service to query
+an existing compressed graph with the high-level HTTP API.
Dependencies
------------
In order to run the ``swh.graph`` tool, you will need Python (>= 3.7) and Java
-JRE, you do not need the JDK if you install the package from pypi, but may want
-to install it if you want to hack the code or install it from this git
-repository. To compress a graph, you will need zstd_ compression tools.
-
-It is highly recommended to install this package in a virtualenv.
-
-On a Debian stable (buster) system:
-
-.. code:: bash
-
- $ sudo apt install python3-virtualenv default-jre zstd
+JRE. On a Debian system:
+.. code:: console
-.. _zstd: https://facebook.github.io/zstd/
+ $ sudo apt install python3 python3-venv default-jre
-
-Install
--------
+Installing swh.graph
+--------------------
Create a virtualenv and activate it:
-.. code:: bash
+.. code:: console
- ~/tmp$ mkdir swh-graph-tests
- ~/tmp$ cd swh-graph-tests
- ~/t/swh-graph-tests$ virtualenv swhenv
- ~/t/swh-graph-tests$ . swhenv/bin/activate
+ $ python3 -m venv .venv
+ $ source .venv/bin/activate
Install the ``swh.graph`` python package:
-.. code:: bash
+.. code:: console
- (swhenv) ~/t/swh-graph-tests$ pip install swh.graph
+ (venv) $ pip install swh.graph
[...]
- (swhenv) ~/t/swh-graph-tests swh graph --help
+ (venv) $ swh graph --help
Usage: swh graph [OPTIONS] COMMAND [ARGS]...
Software Heritage graph tools.
@@ -55,106 +42,77 @@
-h, --help Show this message and exit.
Commands:
- api-client client for the graph RPC service
- cachemount Cache the mmapped files of the compressed graph in a tmpfs.
compress Compress a graph using WebGraph Input: a pair of files...
- map Manage swh-graph on-disk maps
rpc-serve run the graph RPC service
-Compression
------------
-
-Existing datasets
-^^^^^^^^^^^^^^^^^
-
-You can directly use compressed graph datasets provided by Software Heritage.
-Here is a small and realistic dataset (3.1GB):
-
- https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
-
-.. code:: bash
- (swhenv) ~/t/swh-graph-tests$ curl -O https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/python3kcompress.tar
- (swhenv) ~/t/swh-graph-tests$ tar xvf python3kcompress.tar
- (swhenv) ~/t/swh-graph-tests$ touch python3kcompress/*.obl # fix the mtime of cached offset files to allow faster loading
+.. _swh-graph-retrieving-compressed:
-Note: not for the faint heart, but the full dataset is available at:
+Retrieving a compressed graph
+-----------------------------
- https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/
+Software Heritage provides a list of off-the-shelf datasets that can be used
+for various research or prototyping purposes. Most of them are available in
+*compressed* representation, i.e., in a format suitable to be loaded and
+queried by the ``swh-graph`` library.
-Own datasets
-^^^^^^^^^^^^
+All the publicly available datasets are documented on this page:
+https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
-A graph is described as both its adjacency list and the set of nodes
-identifiers in plain text format. Such graph example can be found in the
-``swh/graph/tests/dataset/`` folder.
+A good way of retrieving these datasets is to use the `AWS S3 CLI
+<https://docs.aws.amazon.com/cli/latest/reference/s3/>`_.
-You can compress the example graph on the command line like this:
+Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has
+a relatively reasonable size (~15 GiB including property data, with
+the compressed graph itself being less than 700 MiB):
-.. code:: bash
+.. code:: console
+ (venv) $ pip install awscli
+ [...]
+ (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed
+ (venv) $ cd 2021-03-23-popular-3k-python/
+ (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed
- (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/
- [...]
+You can also retrieve larger graphs, but note that these graphs are generally
+intended to be loaded fully in RAM, and do not fit on ordinary desktop
+machines. The server we use in production to run the graph service has more
+than 700 GiB of RAM. These memory considerations are discussed in more details
+in :ref:`swh-graph-memory`.
- (swhenv) ~/t/swh-graph-tests$ ls output/
- example-bv.properties example.mph example.obl example.outdegree example.swhid2node.bin example-transposed.offsets
- example.graph example.node2swhid.bin example.offsets example.properties example-transposed.graph example-transposed.properties
- example.indegree example.node2type.map example.order example.stats example-transposed.obl
+**Note:** for testing purposes, a fake test dataset is available in the
+``swh-graph`` repository, with just a few dozen nodes. Its basename is
+``swh-graph/swh/graph/tests/dataset/compressed/example``.
API server
----------
-To start a ``swh.graph`` API server of a compressed graph dataset, run:
+To start a ``swh.graph`` API server of a compressed graph dataset, you need to
+use the ``rpc-serve`` command with the basename of the graph, which is the path prefix
+of all the graph files (e.g., with the basename ``compressed/graph``, it will
+attempt to load the files located at
+``compressed/graph.{graph,properties,offsets,...}``.
-.. code:: bash
+In our example:
+
+.. code:: console
- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example
- Loading graph output/example ...
+ (venv) $ swh graph rpc-serve -g compressed/graph
+ Loading graph compressed/graph ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
From there you can use this endpoint to query the compressed graph, for example
-with httpie_ (``sudo apt install``) from another terminal:
+with httpie_ (``sudo apt install httpie``):
.. _httpie: https://httpie.org
.. code:: bash
- ~/tmp$ http :5009/graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010
- HTTP/1.1 200 OK
- Content-Type: text/plain
- Date: Tue, 15 Sep 2020 08:33:25 GMT
- Server: Python/3.8 aiohttp/3.6.2
- Transfer-Encoding: chunked
-
- swh:1:rel:0000000000000000000000000000000000000010
- swh:1:rev:0000000000000000000000000000000000000009
- swh:1:rev:0000000000000000000000000000000000000003
- swh:1:dir:0000000000000000000000000000000000000002
- swh:1:cnt:0000000000000000000000000000000000000001
- swh:1:dir:0000000000000000000000000000000000000008
- swh:1:dir:0000000000000000000000000000000000000006
- swh:1:cnt:0000000000000000000000000000000000000004
- swh:1:cnt:0000000000000000000000000000000000000005
- swh:1:cnt:0000000000000000000000000000000000000007
-
-
-Running the existing ``python3kcompress`` dataset:
-
-.. code:: bash
-
- (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g python3kcompress/python3k
- Loading graph python3kcompress/python3k ...
- Graph loaded.
- ======== Running on http://0.0.0.0:5009 ========
- (Press CTRL+C to quit)
-
-
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
HTTP/1.1 200 OK
Content-Type: text/plain
@@ -170,5 +128,5 @@
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406
-
-See the documentation of the :ref:`API <swh-graph-api>` for more details.
+See the documentation of the :ref:`API <swh-graph-api>` for more details on how
+to use the HTTP graph querying API.
diff --git a/docs/use-cases.rst b/docs/use-cases.rst
deleted file mode 100644
--- a/docs/use-cases.rst
+++ /dev/null
@@ -1,167 +0,0 @@
-=========
-Use cases
-=========
-
-
-This document lists use cases and benchmark scenarios for the Software Heritage
-graph service.
-
-
-Conventions
-===========
-
-- **Node identification**: in the following, nodes are always identified by
- their :ref:`SWHIDs <persistent-identifiers>`.
-
-
-Use cases
-=========
-
-
-Browsing
---------
-
-The following use cases require traversing the *forward graph*.
-
-- **ls**: given a directory node, list (non recursively) all linked nodes of
- type directory and content
-
- Implementation::
-
- /graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
-
-- **ls -R**: given a directory node, recursively list all linked nodes of type
- directory and content
-
- Implementation::
-
- /graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
-
-- **git log**: given a revision node, recursively list all linked nodes of type
- revision
-
- Implementation::
-
- /graph/visit/nodes/:REV_ID?edges=rev:rev
-
-
-Vault
------
-
-The following use cases require traversing the *forward graph*.
-
-- **tarball** (same as *ls -R* above)
-
-- **git bundle**: given a node, recursively list all linked nodes of any kind
-
- Implementation::
-
- /graph/visit/nodes/:NODE_ID?edges=*
-
-
-Provenance
-----------
-
-The following use cases require traversing the *backward (transposed)
-graph*.
-
-- **commit provenance**: given a content or directory node, return *a* commit
- whose directory (recursively) contains it
-
- Implementation::
-
- /graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
-
-- **complete commit provenance**: given a content or directory node, return
- *all* commits whose directory (recursively) contains it
-
- Implementation::
-
- /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
-
-- **origin provenance**: given a content, directory, or commit node, return
- *an* origin that has at least one snapshot that (recursively) contains it
-
- Implementation::
-
- /graph/walk/:NODE_ID/ori?direction=backward&edges=*
-
-- **complete origin provenance**: given a content, directory, or commit node,
- return *all* origins that have at least one snapshot that (recursively)
- contains it
-
- Implementation::
-
- /graph/leaves/:NODE_ID?direction=backward&edges=*
-
-- *SLOC tracking*: left as future work
-
-
-Provenance statistics
-~~~~~~~~~~~~~~~~~~~~~
-
-The following use cases require traversing the *backward (transposed)
-graph*.
-
-- **content popularity across commits**: for each content, count the number of
- commits (or *commit popularity*) that link to a directory that (recursively)
- includes it. Plot the distribution of content popularity across commits.
-
- Implementation: apply *complete commit provenance* to each content node,
- count the returned commits, aggregate.
-
-- **commit popularity across origins**: for each commit, count the number of
- origins (or *origin popularity*) that have a snapshot that (recursively)
- includes it. Plot the distribution of commit popularity across origins.
-
- Implementation: apply *complete origin provenance* to each commit node, count
- the returned origins, aggregate.
-
-- *SLOC popularity across contents*: left as future work
-
-The following use cases require traversing the *forward graph*.
-
-- **revision size** (as n. of contents) distribution: for each revision, count
- the number of contents that are (recursively) reachable from it. Plot the
- distribution of revision sizes.
-
-- **origin size** (as n. of revisions) distribution: for each origin, count the
- number of revisions that are (recursively) reachable from it. Plot the
- distribution of origin sizes.
-
-
-Benchmarks
-==========
-
-Notes on how to benchmark graph access:
-
-- separate pure-graph timings from timings related to use additional mappings
- (e.g., node types), no matter if the mappings are in-memory or on-disk
-
-- separate in-memory timings from on-disk timings; in particular, separate the
- timing of translating node identifiers between internal integers and SWHIDs
-
-- for each use case that requires a node as input, we will randomize the choice
- of the input node and repeat the experiment a suitable number of times; where
- possible we will aggregate results computing basic statistics (average,
- standard deviation), as well as normalize results w.r.t. the “size” of the
- chosen node (e.g., number of nodes/path length in the resulting visit)
-
-
-Basic benchmarks
-----------------
-
-- **Edge traversal**: given a node, retrieve the first node in its adjacency
- list.
-
- For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
- times to retrieve the adjacency list of a node (and/or test if an edge exists
- between two nodes) in the 2-3 us range, for the largest graph in their
- experiments (22 M nodes, 600 M edges).
-
-
-Each use case is a benchmark
-----------------------------
-
-In addition to abstract benchmark, we will use each use case above as a
-scenario-based benchmark.
diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
--- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
+++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java
@@ -27,8 +27,7 @@
* Create maps needed at runtime by the graph service, in particular:
* <p>
* <ul>
- * <li>SWHID → WebGraph long node id</li>
- * <li>WebGraph long node id → SWHID (converse of the former)</li>
+ * <li>WebGraph long node id → SWHID</li>
* <li>WebGraph long node id → SWH node type (enum)</li>
* </ul>
*
diff --git a/swh/graph/webgraph.py b/swh/graph/webgraph.py
--- a/swh/graph/webgraph.py
+++ b/swh/graph/webgraph.py
@@ -179,6 +179,14 @@
"{tmp_dir}",
"< {out_dir}/{graph_name}.nodes.csv.zst",
],
+ CompressionStep.EXTRACT_PERSONS: [
+ "{java}",
+ "org.softwareheritage.graph.compress.ExtractPersons",
+ "--temp-dir",
+ "{tmp_dir}",
+ "{in_dir}",
+ "{out_dir}/{graph_name}",
+ ],
CompressionStep.MPH_PERSONS: [
"{java}",
"it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction",
@@ -190,14 +198,6 @@
"{out_dir}/{graph_name}.persons.mph",
"{out_dir}/{graph_name}.persons.csv.zst",
],
- CompressionStep.EXTRACT_PERSONS: [
- "{java}",
- "org.softwareheritage.graph.compress.ExtractPersons",
- "--temp-dir",
- "{tmp_dir}",
- "{in_dir}",
- "{out_dir}/{graph_name}",
- ],
CompressionStep.NODE_PROPERTIES: [
"{java}",
"org.softwareheritage.graph.compress.WriteNodeProperties",

File Metadata

Mime Type
text/plain
Expires
Dec 21 2024, 1:49 PM (11 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3215051

Event Timeline