diff --git a/docs/compression.rst b/docs/compression.rst new file mode 100644 --- /dev/null +++ b/docs/compression.rst @@ -0,0 +1,74 @@ +Graph compression +================= + +The compression process is based on the `WebGraph framework +`_ and ecosystem libraries. + +References used: + +- Paolo Boldi, Sebastiano Vigna, *The webgraph framework I: compression + techniques*, Proceedings of the 13th international conference on World Wide + Web. ACM, 2004. `pdf `_ +- Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, *Layered label + propagation: A multiresolution coordinate-free ordering for compressing social + networks*. `arxiv `_ + +1. MPH +------ + +A node in the `Software Heritage graph +`_ is +identified using its string `persistent identifier +`_. +However, WebGraph internally uses integers to refer to node ids. + +Mapping between the strings and longs ids is needed before compressing the +graph. From the `Sux4J `_ utility tool, we can use the +`GOVMinimalPerfectHashFunction +`_ +class, mapping with no collisions $n$ keys to $n$ consecutive integers. + +The step produces a ``.mph`` file storing the hash function taking as input a +string and returning a unique integer. + +2. BVGraph +---------- + +This is the first actual compression step, building a compressed version of the +input graph using WebGraph techniques presented in the framework paper. + +The graph is stored using multiple files: ``.graph``, ``.offsets``, ``.obl``, +and ``.properties``. + +3. BFS +------- + +In the LLP paper, authors propose an empirical analysis linking node ordering +and high compression ratio: it is important to use an ordering of nodes ids such +that vertices from the same host are close to one another. + +Building on this insight, the previous compression results in the BVGraph step +are improved by re-ordering nodes ids using a BFS traversal order. + +The ordering is stored in the ``.order`` file, listing nodes ids in order of +traversal. + +4. Permutation +-------------- + +Once the order is computed (BFS or another ordering technique), the final +compressed graph is created based on the initial BVGraph result, and using the +new node order mapping. + +Extra steps +----------- + +5. Stats +~~~~~~~~ + +Various statistics on the final compressed graph are computed in ``.stats``. + +6. Transpose +~~~~~~~~~~~~ + +A transposed graph is created to allow backward traversal.