diff --git a/common/images/webgraphs_compression.pdf b/common/images/webgraphs_compression.pdf new file mode 100644 index 0000000..90f21ac Binary files /dev/null and b/common/images/webgraphs_compression.pdf differ diff --git a/common/modules/graph-compression.org b/common/modules/graph-compression.org new file mode 100644 index 0000000..e551fd9 --- /dev/null +++ b/common/modules/graph-compression.org @@ -0,0 +1,193 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+INCLUDE: "prelude.org" :minlevel 1 + +# Depends: \usepackage{pdfpages} + +* Graph Compression + :PROPERTIES: + :CUSTOM_ID: main + :END: +** Graph compression on the Software Heritage archive +*** + #+BEGIN_EXPORT latex + \vspace{-3mm} + \begin{thebibliography}{Foo Bar, 1969} + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + \end{thebibliography} + #+END_EXPORT +*** Research question + Is it possible to efficiently perform software development history analyses + at ultra large scale (= the scale of Software Heritage archive or more), on + a single, relatively cheap machine? +*** Idea + Apply state-of-the-art graph compression techniques from the field of Web + graph / social network analysis. + +** Background --- Web graph compression + Borrowing (great!) slides from: + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \bibitem{Pibiri2018} Giulio Ermanno Pibiri + \newblock Effective Web Graph Representations, 2018 + \newblock \url{http://pages.di.unipi.it/pibiri/slides/webgraphs\_compression.pdf} + \end{thebibliography} + #+END_EXPORT + +** Background -- Web graph compression (imported slides) + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEGIN_EXPORT latex + { + \setbeamercolor{background canvas}{bg=} + \setbeamertemplate{background}{} + \includepdf[pages={4,11,12,13}]{webgraphs_compression.pdf} + \addtocounter{framenumber}{4} + } + #+END_EXPORT + +** Corpus +*** Nodes + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + | *Node type* | *N. of nodes* | + |-------------+---------------| + | origins | 88 M | + | snapshots | 57 M | + | releases | 9.9 M | + | revisions | 1.1 B | + | directories | 4.9 B | + | contents | 5.5 B | + |-------------+---------------| + | Total nodes | 12 B | +*** Edges + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + | *Edge type* | *N. of edges* | + |-----------------------+---------------| + | origin → snapshot | 195 M | + | snapshot → revision | 616 M | + | snapshot → release | 215 M | + | release → revision | 9.9 M | + | revision → revision | 1.2 B | + | revision → directory | 1.1 B | + | directory → directory | 48 B | + | directory → revisiony | 482 M | + | directory → content | 112 B | + |-----------------------+---------------| + | Total edges | 165 B | +*** + Archive snapshot 2018-09-25, from the Software Heritage graph dataset.\\ + Growth rate: exponential, doubling every 22-30 months. + +** Graph compression pipeline + #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps} + #+BEAMER: \vspace{-1cm} +*** + - *MPH*: minimal perfect hash, mapping Merkle IDs to 0..N-1 integers + - *BV* compress: Boldi-Vigna compression (based on MPH order) + - *BFS*: breadth-first visit to renumber + - *Permute*: update BV compression according to BFS order +*** (Re)establishing locality + - key for good compression is a node ordering that ensures locality and + similarity + - which is very much /not/ the case with Merkle IDs... + - ...but is the case /again/ after BFS + +** Compression experiment + | *Step* | *Wall time* (hours) | + |-------------+---------------------| + | MPH | 2 | + | BV Compress | 84 | + | BFS | 19 | + | Permute | 18 | + | Transpose | 15 | + |-------------+---------------------| + | Total | 138 (6 days) | + + - server equipped with 24 CPUs and 750 GB of RAM + - RAM mostly used as I/O cache for the BFS step + - /minimum/ memory requirements are close to the RAM needed to load the + final compressed graph in memory + +** Compression efficiency (space) +*** + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.4 + :END: + | *Forward graph* | | + | total size | 91 GiB | + | bits per edge | 4.91 | + | compression ratio | 15.8% | +*** + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.4 + :END: + | *Backward graph* | | + | total size | 83 GiB | + | bits per edge | 4.49 | + | compression ratio | 14.4% | +*** Operation cost + The structure of a full bidirectional archive graph fits in less than 200 + GiB of RAM, for a hardware cost of ~300 USD. + +** Compression efficiency (time) +*** Benchmark --- Full BFS visit + #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} + | *Forward graph* | | + |------------------+----------------| + | wall time | 1h48m | + | throughput | 1.81 M nodes/s | + | | (553 ns/node) | + #+BEAMER: \end{column}\begin{column}{0.45\textwidth} + | *Backward graph* | | + |------------------+----------------| + | wall time | 3h17m | + | throughput | 988 M nodes/s | + | | (1.01 µs/node) | + #+BEAMER: \end{column}\end{columns} +*** Benchmark --- Edge lookup + random sample: 1 B nodes (8.3% of entire graph) + #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} + | *Forward graph* | | + |------------------+----------------| + | visited edges | 13.6 B | + | throughput | 12.0 M edges/s | + | | (83 ns/edge) | + #+BEAMER: \end{column}\begin{column}{0.45\textwidth} + | *Backward graph* | | + |------------------+----------------| + | visited edges | 13.6 B | + | throughput | 9.45 M edges/s | + | | (106 ns/edge) | + #+BEAMER: \end{column}\end{columns} +*** + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Note how edge lookup time is close to DRAM random access time (50-60 ns). + +** Discussion +*** Incrementality + compression is *not incremental*, due to the use of contiguous integer + ranges + - but the graph is append-only, so... + - ...based on expected graph growth rate it should be possible to + pre-allocate enough free space in the integer ranges to support + *amortized incrementality* (future work) + #+BEAMER: \pause +*** In-memory v. on-disk + the compressed in-memory graph structure has *no attributes* + - usual design is to exploit the 0..N-1 integer ranges to *memory map node + attributes* to disk for efficient access + - works well for queries that does graph traversal first and "join" node + attributes last; ping-pong between the two is expensive + - edge attributes are more problematic