diff --git a/common/images/world-169.png b/common/images/world-169.png new file mode 100644 index 0000000..0f87a84 Binary files /dev/null and b/common/images/world-169.png differ diff --git a/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org b/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org new file mode 100644 index 0000000..d1a04cc --- /dev/null +++ b/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org @@ -0,0 +1,398 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Ultra-Large-Scale Repository Analysis via Graph Compression +#+BEAMER_HEADER: \date[19 Feb 2020, SANER]{19 February 2020\\SANER 2020: 27th Intl. Conf. on Software Analysis, Evolution and Reengineering\\London, ON, Canada} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 19 February 2020 +#+EMAIL: zack@irif.fr + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{world-169.png} +#+BEAMER_HEADER: \titlegraphic{} +#+BEAMER_HEADER: \institute[U-Paris \& Inria]{Université de Paris \& Inria, France} +#+BEAMER_HEADER: \author[Stefano Zacchiroli]{Stefano Zacchiroli\\ {\small\tt zack@irif.fr\\ @zacchiro}} +#+BEAMER_HEADER: \title[Large scale VCS analysis via graph compression]{Ultra-Large-Scale Repository Analysis via Graph Compression} + +* Introduction +** Motivations + - Free/Open Source Software (*FOSS*) + *social coding* (GitHub, GitLab, ...) + #+BEAMER: \linebreak + = massive amount of *data for empirical software engineering* (ESE) + - software evolution and clone detection have vastly benefited from it + + #+BEAMER: \vfill \pause +*** An ESE growth crisis? + - GitHub alone: ~100 M repositories + - exponential growth rate, doubling every ~2 years (Rousseau et al., 2009) + - possibly the tip of the iceberg w.r.t. the rise of distributed forges and + non-public collaborative development (cf. /inner source/) + + #+BEAMER: \vfill \pause +*** Current mitigation approaches + - *scale-out analysis:* not always applicable, expensive + - *sampling:* (e.g., top-starred repos) prone to selection bias and + external validity issues + +** Research question +*** + #+BEAMER: \Large + /Is it possible to efficiently perform software development history + analyses at ultra large scale, on a single, relatively cheap machine?/ +*** Details + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \vfill \pause + - *development history:* all information captured by state-of-the-art + Version Control Systems (VCS) + #+BEAMER: \pause + - *cheap machine:* commodity hardware, desktop- or server-gread, few kUSD + of investment + #+BEAMER: \pause + - *ultra large scale:* in the ballpark of (the known extent of) all + publicly available software source code + +** Corpus --- Software Heritage + - our proxy for publicly available software: + #+BEAMER: \vspace{-3mm} + #+BEAMER: \begin{center}\includegraphics[width=0.6\textwidth]{SWH-logo+motto}\end{center} + #+BEAMER: \vspace{-3mm} + - both *source code* and its *development history* as captured by VCS + - coverage: + - all public repositories from GitHub and GitLab.com + - historical forges: Google Code, Gitorious + - package manager repositories: NPM, PyPI, Debian + - 90 M repositories, 5.5 B unique files, 1.1 B unique files (data dump: + 2018-09-25) + - available as offline dataset + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \footnotesize + \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli\newblock + The Software Heritage Graph Dataset: Public software development under one roof\newblock + MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + \end{thebibliography} + #+END_EXPORT + # \newblock preprint: \url{http://deb.li/swhmsr19} + +* Background +** (Web) graph compression +*** The graph of the Web + :PROPERTIES: + :BEAMER_env: definition + :END: + Directed graph that has Web pages as nodes and hyperlinks between them as + edges. +*** Properties (1) + - **Locality:** pages links to pages whose URL is lexicographically + similar. URLs share long common prefixes. + + → use *D-gap compression* +*** Adjacency lists + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + #+BEAMER: \scriptsize + | *Node* | *Outdegree* | *Successors* | + |--------+-------------+--------------------------------------| + | ... | ... | ... | + | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | + | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | + | 17 | 0 | | + | 18 | 5 | 13,15,16,17,50 | + | ... | ... | ... | +*** D-gapped adjacency lists + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + #+BEAMER: \scriptsize + | *Node* | *Outdegree* | *Successors* | + |--------+-------------+-----------------------------| + | ... | ... | ... | + | 15 | 11 | 3,1,0,0,0,0,3,0,178,111,718 | + | 16 | 10 | 1,0,0,4,0,0,290,0,0,2723 | + | 17 | 0 | | + | 18 | 5 | 9,1,0,0,32 | + | ... | ... | ... | + +** (Web) graph compression (cont.) +*** The graph of the Web + :PROPERTIES: + :BEAMER_env: definition + :END: + Directed graph that has Web pages as nodes and hyperlinks between them as + edges. +*** Properties (2) + - **Similarity:** pages that are close together in lexicographic order tend + to have many common successors. + → use *reference compression* +*** Adjacency lists + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + #+BEAMER: \scriptsize + | *Node* | *Outd.* | *Successors* | + |--------+---------+--------------------------------------| + | ... | ... | ... | + | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | + | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | + | 17 | 0 | | + | 18 | 5 | 13,15,16,17,50 | + | ... | ... | ... | +*** Copy lists + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.55 + :END: + #+BEAMER: \scriptsize + | *Node* | *Ref.* | *Copy list* | *Extra nodes* | + |--------+--------+-------------+--------------------------------------| + | ... | ... | ... | ... | + | 15 | 0 | | 13,15,16,17,18,19,23,24,203,315,1034 | + | 16 | 1 | 01110011010 | 22,316,317,3041 | + | 17 | | | | + | 18 | 3 | 11110000000 | 50 | + | ... | ... | ... | | + +* Data model +** Data model + #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-data-model-h} + +** Data model --- example :noexport: + #+BEAMER: \centering + #+BEAMER: \only<1>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit1}} + #+BEAMER: \only<2>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit2}} + #+BEAMER: \only<3>{\includegraphics[width=\textwidth]{swh-merkle-dag-small}} + +* VCS compression +** Corpus --- as a graph +*** Nodes + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + | *Node type* | *N. of nodes* | + |-------------+---------------| + | origins | 88 M | + | snapshots | 57 M | + | releases | 9.9 M | + | revisions | 1.1 B | + | directories | 4.9 B | + | contents | 5.5 B | + |-------------+---------------| + | Total nodes | 12 B | +*** Edges + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + | *Edge type* | *N. of edges* | + |-----------------------+---------------| + | origin → snapshot | 195 M | + | snapshot → revision | 616 M | + | snapshot → release | 215 M | + | release → revision | 9.9 M | + | revision → revision | 1.2 B | + | revision → directory | 1.1 B | + | directory → directory | 48 B | + | directory → revisiony | 482 M | + | directory → content | 112 B | + |-----------------------+---------------| + | Total edges | 165 B | +*** + Archive snapshot 2018-09-25, from the Software Heritage graph dataset.\\ + Growth rate: exponential, doubling every 22-30 months. + +** Compression pipeline +*** Pipeline + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \vspace{-8mm} + #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps} + #+BEAMER: \vspace{-1.2cm} + - *MPH:* minimal perfect hash, mapping Merkle IDs to 0..N-1 integers + - *BV compress:* Boldi-Vigna compression (based on MPH order) + - *BFS:* breadth-first visit to renumber + - *Permute:* update BV compression according to BFS order + #+BEAMER: \vfill +*** (Re)establishing locality + - key for good compression is a node ordering that ensures locality and + similarity + - which is very much /not/ the case with Merkle IDs... + - ...but is the case /again/ after BFS + +** Compression time + We ran the compression pipeline on the input corpus using the WebGraph + framework + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \footnotesize + \bibitem{BoVWFI} Paolo Boldi and Sebastiano Vigna. + \newblock The WebGraph framework I: Compression techniques + \newblock WWW 2004: 13th Intl. World Wide Web Conference. ACM + \end{thebibliography} + #+END_EXPORT + + | *Step* | *Wall time* (hours) | + |-------------+---------------------| + | MPH | 2 | + | BV Compress | 84 | + | BFS | 19 | + | Permute | 18 | + | Transpose | 15 | + |-------------+---------------------| + | Total | 138 (6 days) | + + - server equipped with 24 CPUs and 750 GB of RAM + - RAM mostly used as I/O cache for the BFS step + - /minimum/ memory requirements are close to the RAM needed to load the + final compressed graph in memory + +** Compression efficiency +*** Forward graph + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.4 + :END: + | total size | 91 GiB | + | bits per edge | 4.91 | + # | compression ratio | 15.8% | +*** Backward graph + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.4 + :END: + | total size | 83 GiB | + | bits per edge | 4.49 | + # | compression ratio | 14.4% | + #+BEAMER: \vfill +*** Operation cost + The structure of a full bidirectional archive graph fits in less than 200 + GiB of RAM, for a hardware cost of ~300 USD. + +* Exploitation +** A domain-agnostic benchmark --- full corpus traversal +*** Benchmark --- Full BFS visit + #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} + | *Forward graph* | | + |------------------+----------------| + | wall time | 1h48m | + | throughput | 1.81 M nodes/s | + | | (553 ns/node) | + #+BEAMER: \end{column}\begin{column}{0.45\textwidth} + | *Backward graph* | | + |------------------+----------------| + | wall time | 3h17m | + | throughput | 988 M nodes/s | + | | (1.01 µs/node) | + #+BEAMER: \end{column}\end{columns} +*** Benchmark --- Edge lookup + random sample: 1 B nodes (8.3% of entire graph) + #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} + | *Forward graph* | | + |------------------+----------------| + | visited edges | 13.6 B | + | throughput | 12.0 M edges/s | + | | (83 ns/edge) | + #+BEAMER: \end{column}\begin{column}{0.45\textwidth} + | *Backward graph* | | + |------------------+----------------| + | visited edges | 13.6 B | + | throughput | 9.45 M edges/s | + | | (106 ns/edge) | + #+BEAMER: \end{column}\end{columns} +*** + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Note how edge lookup time is close to DRAM random access time (50-60 ns). + +** Domain-specific benchmarks --- source code artifact multiplication +*** + Simple *clone detection* style experiments realized exploiting the + compressed corpus: + 1. *file→commit multiplication:* how much identical source code files + re-occur in different comments + 2. *commit→origin multiplication:* how much identical commits re-occur in + different repositories +*** Implementation + - for each node---content for (1), commit for (2)---*visit* the backward + graph and *count* all reachable nodes of the desired type---commit for + (1), origin for (2) + - naive approach, O(|V|x|E|) complexity +** File→commit multiplication --- results + #+BEAMER: \vspace{-5mm} +*** Multiplication factor + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + #+BEAMER: \centering \includegraphics[width=0.9\textwidth]{this/content-revision-multiplication.png} +*** Visit size + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + #+BEAMER: \centering \includegraphics[width=0.9\textwidth]{this/content-revision-node-size.png} +*** Benchmark figures + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \vfill + - random sample of 953 M contents (17% of the full corpus) + - processing time: ~2.5 days (single machine with 20 x 2.4 GHz cores) + - /in spite of/ the naive O(|V|x|E|) approach, generally considered + intractable at this scale +* Discussion +** Limitations + #+BEAMER: \vspace{-2mm} +*** Incrementality + - compression is inherently *not incremental* + - not an issue for most research use cases, because we analyze immutable + data dumps + - common workaround (e.g., for the Web and social networks) is to keep an + uncompressed *in-memory overlay* for graph updates, and periodically + recompress + #+BEAMER: \pause +*** In-memory v. on-disk + - the compressed in-memory graph structure has *no attributes* + - usual data design is to exploit the 0..N-1 integer ranges to *memory map + /node/ attributes* to secondary storage + - we have done this with a *node type map*; it weights 4 GB (3 bit per + node) + - works well for queries that do graph traversal first and "join" node + attributes last; ping-pong between the two is expensive + - /edge/ attributes are more problematic +* Conclusion +** Wrapping up + #+BEAMER: \vspace{-2mm} +*** + - Graph compression is a viable technique to analyze the history of all + public source code, as captured by modern version control systems (VCS), + on a budget. + - It is a novel tool for VCS analyses that might allow to expand the scope + of our experiments, reducing selection biases and improving external + validity. + - More work is needed to provide compression incrementality and allow to + efficiently query VCS properties during traversal. +*** See full paper for more details + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \small \vspace{-2mm} + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + \end{thebibliography} + preprint: {\small \url{http://bit.ly/swh-graph-saner20}} + #+END_EXPORT +*** Contacts + Stefano Zacchiroli / zack@irif.fr / @zacchiro / talk to me at SANER 2020! + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: diff --git a/talks-public/2020-02-19-saner-graph/Makefile b/talks-public/2020-02-19-saner-graph/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2020-02-19-saner-graph/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides diff --git a/talks-public/2020-02-19-saner-graph/this/content-revision-multiplication.png b/talks-public/2020-02-19-saner-graph/this/content-revision-multiplication.png new file mode 100644 index 0000000..9d4ecb6 Binary files /dev/null and b/talks-public/2020-02-19-saner-graph/this/content-revision-multiplication.png differ diff --git a/talks-public/2020-02-19-saner-graph/this/content-revision-node-size.png b/talks-public/2020-02-19-saner-graph/this/content-revision-node-size.png new file mode 100644 index 0000000..ba8939b Binary files /dev/null and b/talks-public/2020-02-19-saner-graph/this/content-revision-node-size.png differ diff --git a/talks-public/2020-02-19-saner-graph/this/revision-origin-multiplication.png b/talks-public/2020-02-19-saner-graph/this/revision-origin-multiplication.png new file mode 100644 index 0000000..7d79d31 Binary files /dev/null and b/talks-public/2020-02-19-saner-graph/this/revision-origin-multiplication.png differ diff --git a/talks-public/2020-02-19-saner-graph/this/revision-origin-node-size.png b/talks-public/2020-02-19-saner-graph/this/revision-origin-node-size.png new file mode 100644 index 0000000..f9d533e Binary files /dev/null and b/talks-public/2020-02-19-saner-graph/this/revision-origin-node-size.png differ