diff --git a/common/modules/graph-compression.org b/common/modules/graph-compression.org index 9e3bafa..c59fd5e 100644 --- a/common/modules/graph-compression.org +++ b/common/modules/graph-compression.org @@ -1,274 +1,279 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # Depends: \usepackage{pdfpages} * Graph Compression :PROPERTIES: :CUSTOM_ID: main :END: ** Graph compression on the Software Heritage archive *** #+BEGIN_EXPORT latex \vspace{-3mm} \begin{thebibliography}{Foo Bar, 1969} \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT *** Research question Is it possible to efficiently perform software development history analyses at ultra large scale (= the scale of Software Heritage archive or more), on a single, relatively cheap machine? *** Idea Apply state-of-the-art graph compression techniques from the field of Web graph / social network analysis. ** Background --- (Web) graph compression *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (1) - - **Locality:** pages links to pages whose URL is lexicographically + - **Locality:** pages links to pages whose URLs are lexicographically similar. URLs share long common prefixes. → use *D-gap compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.51 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** D-gapped adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.48 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+-----------------------------| | ... | ... | ... | | 15 | 11 | 3,1,0,0,0,0,3,0,178,111,718 | | 16 | 10 | 1,0,0,4,0,0,290,0,0,2723 | | 17 | 0 | | | 18 | 5 | 9,1,0,0,32 | | ... | ... | ... | ** Background --- (Web) graph compression (cont.) *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (2) - **Similarity:** pages that are close together in lexicographic order tend to have many common successors. → use *reference compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.47 :END: #+BEAMER: \scriptsize | *Node* | *Outd.* | *Successors* | |--------+---------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** Copy lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.60 :END: #+BEAMER: \scriptsize | *Node* | *Ref.* | *Copy list* | *Extra nodes* | |--------+--------+-------------+--------------------------------------| | ... | ... | ... | ... | | 15 | 0 | | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 1 | 01110011010 | 22,316,317,3041 | | 17 | | | | | 18 | 3 | 11110000000 | 50 | | ... | ... | ... | | ** Background --- Web graph compression (OLD) :noexport: Borrowing (great!) slides from: #+BEGIN_EXPORT latex \begin{thebibliography}{} \bibitem{Pibiri2018} Giulio Ermanno Pibiri \newblock Effective Web Graph Representations, 2018 \newblock \url{http://pages.di.unipi.it/pibiri/slides/webgraphs\_compression.pdf} \end{thebibliography} #+END_EXPORT ** Background -- Web graph compression (imported slides) (OLD) :noexport: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEGIN_EXPORT latex { \setbeamercolor{background canvas}{bg=} \setbeamertemplate{background}{} \includepdf[pages={4,11,12,13}]{webgraphs_compression.pdf} \addtocounter{framenumber}{4} } #+END_EXPORT ** Corpus *** Nodes :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: | *Node type* | *N. of nodes* | |-------------+---------------| | origins | 88 M | | snapshots | 57 M | | releases | 9.9 M | | revisions | 1.1 B | | directories | 4.9 B | | contents | 5.5 B | |-------------+---------------| | Total nodes | 12 B | *** Edges :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: | *Edge type* | *N. of edges* | |-----------------------+---------------| | origin → snapshot | 195 M | | snapshot → revision | 616 M | | snapshot → release | 215 M | | release → revision | 9.9 M | | revision → revision | 1.2 B | | revision → directory | 1.1 B | | directory → directory | 48 B | | directory → revisiony | 482 M | | directory → content | 112 B | |-----------------------+---------------| | Total edges | 165 B | *** + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: Archive snapshot 2018-09-25, from the Software Heritage graph dataset.\\ - Growth rate: exponential, doubling every 22-30 months. + Growth rate: exponential, doubling every 22-30 months (Rousseau, Di Cosmo, + Zacchiroli; ESE 2020, to appear). ** Graph compression pipeline #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps} #+BEAMER: \vspace{-1cm} *** - *MPH*: minimal perfect hash, mapping Merkle IDs to 0..N-1 integers - *BV* compress: Boldi-Vigna compression (based on MPH order) - *BFS*: breadth-first visit to renumber - *Permute*: update BV compression according to BFS order *** (Re)establishing locality - key for good compression is a node ordering that ensures locality and similarity - - which is very much /not/ the case with Merkle IDs... - - ...but is the case /again/ after BFS + - which is very much /not/ the case with Merkle IDs, ...but is the case + /again/ after BFS reordering ** Compression experiment | *Step* | *Wall time* (hours) | |-------------+---------------------| | MPH | 2 | | BV Compress | 84 | | BFS | 19 | | Permute | 18 | | Transpose | 15 | |-------------+---------------------| | Total | 138 (6 days) | - server equipped with 24 CPUs and 750 GB of RAM - RAM mostly used as I/O cache for the BFS step - /minimum/ memory requirements are close to the RAM needed to load the final compressed graph in memory ** Compression efficiency (space) *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Forward graph* | | | total size | 91 GiB | | bits per edge | 4.91 | | compression ratio | 15.8% | *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Backward graph* | | | total size | 83 GiB | | bits per edge | 4.49 | | compression ratio | 14.4% | -*** Operation cost +*** Operating cost The structure of a full bidirectional archive graph fits in less than 200 GiB of RAM, for a hardware cost of ~300 USD. ** Compression efficiency (time) -*** Benchmark --- Full BFS visit +*** Benchmark --- Full BFS visit (single thread) #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | wall time | 1h48m | | throughput | 1.81 M nodes/s | | | (553 ns/node) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | wall time | 3h17m | | throughput | 988 M nodes/s | | | (1.01 µs/node) | #+BEAMER: \end{column}\end{columns} *** Benchmark --- Edge lookup - random sample: 1 B nodes (8.3% of entire graph) + random sample: 1 B nodes (8.3% of entire graph); then enumeration of all + successors #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 12.0 M edges/s | | | (83 ns/edge) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 9.45 M edges/s | | | (106 ns/edge) | #+BEAMER: \end{column}\end{columns} *** :PROPERTIES: :BEAMER_env: ignoreheading :END: Note how edge lookup time is close to DRAM random access time (50-60 ns). ** Discussion *** Incrementality compression is *not incremental*, due to the use of contiguous integer ranges - but the graph is append-only, so... - ...based on expected graph growth rate it should be possible to pre-allocate enough free space in the integer ranges to support *amortized incrementality* (future work) #+BEAMER: \pause *** In-memory v. on-disk the compressed in-memory graph structure has *no attributes* - usual design is to exploit the 0..N-1 integer ranges to *memory map node attributes* to disk for efficient access - works well for queries that does graph traversal first and "join" node attributes last; ping-pong between the two is expensive - edge attributes are more problematic