Page MenuHomeSoftware Heritage

D1687.id5675.diff
No OneTemporary

D1687.id5675.diff

diff --git a/docs/compression.rst b/docs/compression.rst
new file mode 100644
--- /dev/null
+++ b/docs/compression.rst
@@ -0,0 +1,94 @@
+Graph compression
+=================
+
+The compression process is based on the `WebGraph framework
+<http://webgraph.di.unimi.it/>`_ and ecosystem libraries.
+
+References used:
+
+- Paolo Boldi, Sebastiano Vigna, *The webgraph framework I: compression
+ techniques*, Proceedings of the 13th international conference on World Wide
+ Web. ACM, 2004. `pdf <http://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf>`_
+- Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, *Layered label
+ propagation: A multiresolution coordinate-free ordering for compressing social
+ networks*. `arxiv <https://arxiv.org/abs/1011.5425>`_
+- Alberto Apostolico, Guido Drovandi, *Graph compression by BFS*, Algorithms
+ 2009, 2(3), 1031-1044. `mdpi <https://www.mdpi.com/1999-4893/2/3/1031/pdf>`_
+
+.. figure:: images/compression_steps.png
+ :alt: Compression steps
+
+ Compression steps
+
+1. MPH
+------
+
+A node in the `Software Heritage graph
+<https://docs.softwareheritage.org/devel/swh-model/data-model.html>`_ is
+identified using its string `persistent identifier
+<https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers>`_.
+However, WebGraph internally uses integers to refer to node ids.
+
+Mapping between the strings and longs ids is needed before compressing the
+graph. From the `Sux4J <http://sux.di.unimi.it/>`_ utility tool, we use the
+`GOVMinimalPerfectHashFunction
+<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
+class, mapping with no collisions $n$ keys to $n$ consecutive integers.
+
+The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
+Hash-function*) storing the hash function taking as input a string and returning
+a unique integer.
+
+2. BV compress
+--------------
+
+This is the first actual compression step, building a compressed version of the
+input graph using WebGraph techniques presented in the framework paper.
+
+The resulting BV graph is stored as a set of files:
+
+- ``.graph``: the compressed graph in the BV format
+- ``.offsets``: offsets values to read the bit stream graph file
+- ``.obl``: offsets cache to load the graph faster
+- ``.properties``: entries used to correctly decode graph and offset files
+
+3. BFS
+-------
+
+In the LLP paper, authors propose an empirical analysis linking node ordering
+and high compression ratio: it is important to use an ordering of nodes ids such
+that vertices from the same host are close to one another.
+
+Building on this insight, the previous compression results in the BV compress
+step are improved by re-ordering nodes ids using a BFS traversal order.
+
+The resulting ordering is stored in the ``.order`` file, listing nodes ids in
+order of traversal.
+
+4. Permute
+----------
+
+Once the order is computed (BFS or another ordering technique), the final
+compressed graph is created based on the initial BV compress result, and using
+the new node order mapping.
+
+The final compressed graph is only stored in the resulting ``.graph``,
+``.offsets``, ``.obl``, and ``.properties`` files.
+
+Extra steps
+-----------
+
+5. Stats
+~~~~~~~~
+
+Compute various statistics on the final compressed graph:
+
+- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
+ average locality, etc.
+- ``.indegree``: graph indegree distribution
+- ``.outdegree``: graph outdegree distribution
+
+6. Transpose
+~~~~~~~~~~~~
+
+Create a transposed graph to allow backward traversal.
diff --git a/docs/images/.gitignore b/docs/images/.gitignore
new file mode 100644
--- /dev/null
+++ b/docs/images/.gitignore
@@ -0,0 +1 @@
+compression_steps.png
diff --git a/docs/images/Makefile b/docs/images/Makefile
new file mode 100644
--- /dev/null
+++ b/docs/images/Makefile
@@ -0,0 +1,9 @@
+all: compression_steps.png
+
+%.png: %.dot
+ dot -Gdpi=300 -Tpng $< -o $@
+
+.PHONY: clean
+
+clean:
+ rm compression_steps.png
diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot
new file mode 100644
--- /dev/null
+++ b/docs/images/compression_steps.dot
@@ -0,0 +1,51 @@
+digraph "Compression steps" {
+ // Horizontal graph
+ rankdir=LR;
+
+ subgraph {
+ input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none];
+ input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none];
+ {rank=same; input_edges; input_nodes;}
+ }
+
+ mph [label="MPH", shape=box];
+ mph_out [label="swh.mph", fontsize=9, shape=none];
+
+ bv_compress [label="BV compress", shape=box];
+ bv_compress_out
+ [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties",
+ fontsize=9, shape=none];
+
+ bfs [label="BFS", shape=box];
+ bfs_out [label="swh.order", fontsize=9, shape=none];
+
+ permute [label="Permute", shape=box];
+ permute_out
+ [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties",
+ fontsize=9, shape=none];
+
+ stats [label="Stats", shape=box];
+ stats_out
+ [label="swh.stats\lswh.indegree\lswh.outdegree",
+ fontsize=9, shape=none];
+
+ transpose [label="Transpose", shape=box];
+ transpose_out
+ [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties",
+ fontsize=9, shape=none];
+
+ input_nodes -> mph;
+ input_edges -> bv_compress;
+ mph -> mph_out;
+ mph_out -> bv_compress;
+ bv_compress -> bv_compress_out;
+ bv_compress_out-> bfs;
+ bv_compress_out-> permute;
+ bfs -> bfs_out;
+ bfs_out -> permute;
+ permute -> permute_out;
+ permute_out -> stats;
+ permute_out -> transpose;
+ stats -> stats_out;
+ transpose -> transpose_out;
+}

File Metadata

Mime Type
text/plain
Expires
Nov 5 2024, 3:00 PM (11 w, 15 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3220609

Event Timeline