Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7066551
D1687.id5675.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
5 KB
Subscribers
None
D1687.id5675.diff
View Options
diff --git a/docs/compression.rst b/docs/compression.rst
new file mode 100644
--- /dev/null
+++ b/docs/compression.rst
@@ -0,0 +1,94 @@
+Graph compression
+=================
+
+The compression process is based on the `WebGraph framework
+<http://webgraph.di.unimi.it/>`_ and ecosystem libraries.
+
+References used:
+
+- Paolo Boldi, Sebastiano Vigna, *The webgraph framework I: compression
+ techniques*, Proceedings of the 13th international conference on World Wide
+ Web. ACM, 2004. `pdf <http://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf>`_
+- Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, *Layered label
+ propagation: A multiresolution coordinate-free ordering for compressing social
+ networks*. `arxiv <https://arxiv.org/abs/1011.5425>`_
+- Alberto Apostolico, Guido Drovandi, *Graph compression by BFS*, Algorithms
+ 2009, 2(3), 1031-1044. `mdpi <https://www.mdpi.com/1999-4893/2/3/1031/pdf>`_
+
+.. figure:: images/compression_steps.png
+ :alt: Compression steps
+
+ Compression steps
+
+1. MPH
+------
+
+A node in the `Software Heritage graph
+<https://docs.softwareheritage.org/devel/swh-model/data-model.html>`_ is
+identified using its string `persistent identifier
+<https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers>`_.
+However, WebGraph internally uses integers to refer to node ids.
+
+Mapping between the strings and longs ids is needed before compressing the
+graph. From the `Sux4J <http://sux.di.unimi.it/>`_ utility tool, we use the
+`GOVMinimalPerfectHashFunction
+<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_
+class, mapping with no collisions $n$ keys to $n$ consecutive integers.
+
+The step produces a ``.mph`` file (MPH stands for *Minimal Perfect
+Hash-function*) storing the hash function taking as input a string and returning
+a unique integer.
+
+2. BV compress
+--------------
+
+This is the first actual compression step, building a compressed version of the
+input graph using WebGraph techniques presented in the framework paper.
+
+The resulting BV graph is stored as a set of files:
+
+- ``.graph``: the compressed graph in the BV format
+- ``.offsets``: offsets values to read the bit stream graph file
+- ``.obl``: offsets cache to load the graph faster
+- ``.properties``: entries used to correctly decode graph and offset files
+
+3. BFS
+-------
+
+In the LLP paper, authors propose an empirical analysis linking node ordering
+and high compression ratio: it is important to use an ordering of nodes ids such
+that vertices from the same host are close to one another.
+
+Building on this insight, the previous compression results in the BV compress
+step are improved by re-ordering nodes ids using a BFS traversal order.
+
+The resulting ordering is stored in the ``.order`` file, listing nodes ids in
+order of traversal.
+
+4. Permute
+----------
+
+Once the order is computed (BFS or another ordering technique), the final
+compressed graph is created based on the initial BV compress result, and using
+the new node order mapping.
+
+The final compressed graph is only stored in the resulting ``.graph``,
+``.offsets``, ``.obl``, and ``.properties`` files.
+
+Extra steps
+-----------
+
+5. Stats
+~~~~~~~~
+
+Compute various statistics on the final compressed graph:
+
+- ``.stats``: entries such as number of nodes, edges, avg/min/max degree,
+ average locality, etc.
+- ``.indegree``: graph indegree distribution
+- ``.outdegree``: graph outdegree distribution
+
+6. Transpose
+~~~~~~~~~~~~
+
+Create a transposed graph to allow backward traversal.
diff --git a/docs/images/.gitignore b/docs/images/.gitignore
new file mode 100644
--- /dev/null
+++ b/docs/images/.gitignore
@@ -0,0 +1 @@
+compression_steps.png
diff --git a/docs/images/Makefile b/docs/images/Makefile
new file mode 100644
--- /dev/null
+++ b/docs/images/Makefile
@@ -0,0 +1,9 @@
+all: compression_steps.png
+
+%.png: %.dot
+ dot -Gdpi=300 -Tpng $< -o $@
+
+.PHONY: clean
+
+clean:
+ rm compression_steps.png
diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot
new file mode 100644
--- /dev/null
+++ b/docs/images/compression_steps.dot
@@ -0,0 +1,51 @@
+digraph "Compression steps" {
+ // Horizontal graph
+ rankdir=LR;
+
+ subgraph {
+ input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none];
+ input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none];
+ {rank=same; input_edges; input_nodes;}
+ }
+
+ mph [label="MPH", shape=box];
+ mph_out [label="swh.mph", fontsize=9, shape=none];
+
+ bv_compress [label="BV compress", shape=box];
+ bv_compress_out
+ [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties",
+ fontsize=9, shape=none];
+
+ bfs [label="BFS", shape=box];
+ bfs_out [label="swh.order", fontsize=9, shape=none];
+
+ permute [label="Permute", shape=box];
+ permute_out
+ [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties",
+ fontsize=9, shape=none];
+
+ stats [label="Stats", shape=box];
+ stats_out
+ [label="swh.stats\lswh.indegree\lswh.outdegree",
+ fontsize=9, shape=none];
+
+ transpose [label="Transpose", shape=box];
+ transpose_out
+ [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties",
+ fontsize=9, shape=none];
+
+ input_nodes -> mph;
+ input_edges -> bv_compress;
+ mph -> mph_out;
+ mph_out -> bv_compress;
+ bv_compress -> bv_compress_out;
+ bv_compress_out-> bfs;
+ bv_compress_out-> permute;
+ bfs -> bfs_out;
+ bfs_out -> permute;
+ permute -> permute_out;
+ permute_out -> stats;
+ permute_out -> transpose;
+ stats -> stats_out;
+ transpose -> transpose_out;
+}
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Nov 5 2024, 3:00 PM (11 w, 15 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3220609
Attached To
D1687: Add graph compression documentation
Event Timeline
Log In to Comment