diff --git a/docs/compression.rst b/docs/compression.rst new file mode 100644 index 0000000..3e91cec --- /dev/null +++ b/docs/compression.rst @@ -0,0 +1,106 @@ +Graph compression +================= + +The compression process is based on the `WebGraph framework +`_ and ecosystem libraries. + +References used: + +- Paolo Boldi, Sebastiano Vigna, *The webgraph framework I: compression + techniques*, Proceedings of the 13th international conference on World Wide + Web. ACM, 2004. `pdf `_ +- Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, *Layered label + propagation: A multiresolution coordinate-free ordering for compressing social + networks*. `arxiv `_ +- Alberto Apostolico, Guido Drovandi, *Graph compression by BFS*, Algorithms + 2009, 2(3), 1031-1044. `mdpi `_ + +.. figure:: images/compression_steps.png + :alt: Compression steps + + Compression steps + +1. MPH +------ + +A node in the `Software Heritage graph +`_ is +identified using its string `persistent identifier +`_. +However, WebGraph internally uses integers to refer to node ids. + +Mapping between the strings and longs ids is needed before compressing the +graph. From the `Sux4J `_ utility tool, we use the +`GOVMinimalPerfectHashFunction +`_ +class, mapping with no collisions $n$ keys to $n$ consecutive integers. + +The step produces a ``.mph`` file (MPH stands for *Minimal Perfect +Hash-function*) storing the hash function taking as input a string and returning +a unique integer. + +2. BV compress +-------------- + +This is the first actual compression step, building a compressed version of the +input graph using WebGraph techniques presented in the framework paper. We use +the `ScatteredArcsASCIIGraph +`_ +class, from WebGraph. + +The resulting BV graph is stored as a set of files: + +- ``.graph``: the compressed graph in the BV format +- ``.offsets``: offsets values to read the bit stream graph file +- ``.obl``: offsets cache to load the graph faster +- ``.properties``: entries used to correctly decode graph and offset files + +3. BFS +------- + +In the LLP paper, authors propose an empirical analysis linking node ordering +and high compression ratio: it is important to use an ordering of nodes ids such +that vertices from the same host are close to one another. + +Building on this insight, the previous compression results in the BV compress +step are improved by re-ordering nodes ids using a BFS traversal order. We use +the `BFSBig <>`_ class from the `LAW `_ library. + +The resulting ordering is stored in the ``.order`` file, listing nodes ids in +order of traversal. + +4. Permute +---------- + +Once the order is computed (BFS or another ordering technique), the final +compressed graph is created based on the initial BV compress result, and using +the new node order mapping. The permutation uses the `Transform +`_ +class from WebGraph framework. + +The final compressed graph is only stored in the resulting ``.graph``, +``.offsets``, ``.obl``, and ``.properties`` files. + +Extra steps +----------- + +5. Stats +~~~~~~~~ + +Compute various statistics on the final compressed graph: + +- ``.stats``: entries such as number of nodes, edges, avg/min/max degree, + average locality, etc. +- ``.indegree``: graph indegree distribution +- ``.outdegree``: graph outdegree distribution + +This step uses the `Stats +`_ +class from WebGraph. + +6. Transpose +~~~~~~~~~~~~ + +Create a transposed graph to allow backward traversal, using the `Transform +`_ +class from WebGraph. diff --git a/docs/images/.gitignore b/docs/images/.gitignore new file mode 100644 index 0000000..ef17686 --- /dev/null +++ b/docs/images/.gitignore @@ -0,0 +1,2 @@ +compression_steps.png +compression_steps.svg diff --git a/docs/images/Makefile b/docs/images/Makefile new file mode 100644 index 0000000..01fbfa2 --- /dev/null +++ b/docs/images/Makefile @@ -0,0 +1,13 @@ +all: compression_steps.png compression_steps.svg + +%.png: %.dot + dot -Gdpi=300 -Tpng $< -o $@ + +%.svg: %.dot + dot -Tsvg $< -o $@ + +.PHONY: clean + +clean: + rm -f compression_steps.png + rm -f compression_steps.svg diff --git a/docs/images/compression_steps.dot b/docs/images/compression_steps.dot new file mode 100644 index 0000000..7156f62 --- /dev/null +++ b/docs/images/compression_steps.dot @@ -0,0 +1,51 @@ +digraph "Compression steps" { + // Horizontal graph + rankdir=LR; + + subgraph { + input_edges [label="swh.edges.csv.gz", fontsize=9, shape=none]; + input_nodes [label="swh.nodes.csv.gz", fontsize=9, shape=none]; + {rank=same; input_edges; input_nodes;} + } + + mph [label="MPH", shape=box]; + mph_out [label="swh.mph", fontsize=9, shape=none]; + + bv_compress [label="BV compress", shape=box]; + bv_compress_out + [label="swh-bv.graph\lswh-bv.offsets\lswh-bv.obl\lswh-bv.properties", + fontsize=9, shape=none]; + + bfs [label="BFS", shape=box]; + bfs_out [label="swh.order", fontsize=9, shape=none]; + + permute [label="Permute", shape=box]; + permute_out + [label="swh.graph\lswh.offsets\lswh.obl\lswh.properties", + fontsize=9, shape=none]; + + stats [label="Stats", shape=box]; + stats_out + [label="swh.stats\lswh.indegree\lswh.outdegree", + fontsize=9, shape=none]; + + transpose [label="Transpose", shape=box]; + transpose_out + [label="swh-transposed.graph\lswh-transposed.offsets\lswh-transposed.obl\lswh-transposed.properties", + fontsize=9, shape=none]; + + input_nodes -> mph; + input_edges -> bv_compress; + mph -> mph_out; + mph_out -> bv_compress; + bv_compress -> bv_compress_out; + bv_compress_out-> bfs; + bv_compress_out-> permute; + bfs -> bfs_out; + bfs_out -> permute; + permute -> permute_out; + permute_out -> stats; + permute_out -> transpose; + stats -> stats_out; + transpose -> transpose_out; +}