diff --git a/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org b/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org
index d1a04cc..b693f68 100644
--- a/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org
+++ b/talks-public/2020-02-19-saner-graph/2020-02-19-saner-graph.org
@@ -1,398 +1,398 @@
 #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
 #+TITLE: Ultra-Large-Scale Repository Analysis via Graph Compression
 #+BEAMER_HEADER: \date[19 Feb 2020, SANER]{19 February 2020\\SANER 2020: 27th Intl. Conf. on Software Analysis, Evolution and Reengineering\\London, ON, Canada}
 #+AUTHOR: Stefano Zacchiroli
 #+DATE: 19 February 2020
 #+EMAIL: zack@irif.fr
 
 #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1
 #+INCLUDE: "../../common/modules/169.org"
 #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{world-169.png}
 #+BEAMER_HEADER: \titlegraphic{}
 #+BEAMER_HEADER: \institute[U-Paris \& Inria]{Université de Paris \& Inria, France}
-#+BEAMER_HEADER: \author[Stefano Zacchiroli]{Stefano Zacchiroli\\ {\small\tt zack@irif.fr\\ @zacchiro}}
+#+BEAMER_HEADER: \author[Stefano Zacchiroli]{Stefano Zacchiroli\\ {\small\tt zack@irif.fr\\ @zacchiro}\\ joint work with Paolo Boldi, Antoine Pietri, and Sebastiano Vigna}
 #+BEAMER_HEADER: \title[Large scale VCS analysis via graph compression]{Ultra-Large-Scale Repository Analysis via Graph Compression}
 
 * Introduction
 ** Motivations
    - Free/Open Source Software (*FOSS*) + *social coding* (GitHub, GitLab, ...)
      #+BEAMER: \linebreak
      = massive amount of *data for empirical software engineering* (ESE)
    - software evolution and clone detection have vastly benefited from it
 
    #+BEAMER: \vfill \pause
 *** An ESE growth crisis?
     - GitHub alone: ~100 M repositories
     - exponential growth rate, doubling every ~2 years (Rousseau et al., 2009)
     - possibly the tip of the iceberg w.r.t. the rise of distributed forges and
       non-public collaborative development (cf. /inner source/)
 
    #+BEAMER: \vfill \pause
 *** Current mitigation approaches
     - *scale-out analysis:* not always applicable, expensive
     - *sampling:* (e.g., top-starred repos) prone to selection bias and
       external validity issues
 
 ** Research question
 *** 
     #+BEAMER: \Large
     /Is it possible to efficiently perform software development history
     analyses at ultra large scale, on a single, relatively cheap machine?/
 *** Details
     :PROPERTIES:
     :BEAMER_env: ignoreheading
     :END:
     #+BEAMER: \vfill \pause
     - *development history:* all information captured by state-of-the-art
       Version Control Systems (VCS)
     #+BEAMER: \pause
     - *cheap machine:* commodity hardware, desktop- or server-gread, few kUSD
       of investment
     #+BEAMER: \pause
     - *ultra large scale:* in the ballpark of (the known extent of) all
       publicly available software source code
 
 ** Corpus --- Software Heritage
    - our proxy for publicly available software:
    #+BEAMER: \vspace{-3mm}
    #+BEAMER: \begin{center}\includegraphics[width=0.6\textwidth]{SWH-logo+motto}\end{center}
    #+BEAMER: \vspace{-3mm}
    - both *source code* and its *development history* as captured by VCS
    - coverage:
      - all public repositories from GitHub and GitLab.com
      - historical forges: Google Code, Gitorious
      - package manager repositories: NPM, PyPI, Debian
    - 90 M repositories, 5.5 B unique files, 1.1 B unique files (data dump:
      2018-09-25)
    - available as offline dataset
      #+BEGIN_EXPORT latex
      \begin{thebibliography}{Foo Bar, 1969}
      \footnotesize
      \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli\newblock
      The Software Heritage Graph Dataset: Public software development under one roof\newblock
      MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE
      \end{thebibliography}
      #+END_EXPORT
      # \newblock preprint: \url{http://deb.li/swhmsr19}
 
 * Background
 ** (Web) graph compression
 *** The graph of the Web
     :PROPERTIES:
     :BEAMER_env: definition
     :END:
     Directed graph that has Web pages as nodes and hyperlinks between them as
     edges.
 *** Properties (1)
     - **Locality:** pages links to pages whose URL is lexicographically
       similar. URLs share long common prefixes.
 
     → use *D-gap compression*
 *** Adjacency lists
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     #+BEAMER: \scriptsize
     | *Node* | *Outdegree* | *Successors*                         |
     |--------+-------------+--------------------------------------|
     |    ... |         ... | ...                                  |
     |     15 |          11 | 13,15,16,17,18,19,23,24,203,315,1034 |
     |     16 |          10 | 15,16,17,22,23,24,315,316,317,3041   |
     |     17 |           0 |                                      |
     |     18 |           5 | 13,15,16,17,50                       |
     |    ... |         ... | ...                                  |
 *** D-gapped adjacency lists
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     #+BEAMER: \scriptsize
     | *Node* | *Outdegree* | *Successors*                |
     |--------+-------------+-----------------------------|
     |    ... |         ... | ...                         |
     |     15 |          11 | 3,1,0,0,0,0,3,0,178,111,718 |
     |     16 |          10 | 1,0,0,4,0,0,290,0,0,2723    |
     |     17 |           0 |                             |
     |     18 |           5 | 9,1,0,0,32                  |
     |    ... |         ... | ...                         |
 
 ** (Web) graph compression (cont.)
 *** The graph of the Web
     :PROPERTIES:
     :BEAMER_env: definition
     :END:
     Directed graph that has Web pages as nodes and hyperlinks between them as
     edges.
 *** Properties (2)
     - **Similarity:** pages that are close together in lexicographic order tend
       to have many common successors.
     → use *reference compression*
 *** Adjacency lists
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     #+BEAMER: \scriptsize
     | *Node* | *Outd.* | *Successors*                         |
     |--------+---------+--------------------------------------|
     |    ... |     ... | ...                                  |
     |     15 |      11 | 13,15,16,17,18,19,23,24,203,315,1034 |
     |     16 |      10 | 15,16,17,22,23,24,315,316,317,3041   |
     |     17 |       0 |                                      |
     |     18 |       5 | 13,15,16,17,50                       |
     |    ... |     ... | ...                                  |
 *** Copy lists
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.55
     :END:
     #+BEAMER: \scriptsize
     | *Node* | *Ref.* | *Copy list* | *Extra nodes*                        |
     |--------+--------+-------------+--------------------------------------|
     |    ... |    ... | ...         | ...                                  |
     |     15 |      0 |             | 13,15,16,17,18,19,23,24,203,315,1034 |
     |     16 |      1 | 01110011010 | 22,316,317,3041                      |
     |     17 |        |             |                                      |
     |     18 |      3 | 11110000000 | 50                                   |
     |    ... |    ... | ...         |                                      |
 
 * Data model
 ** Data model
    #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-data-model-h}
 
 ** Data model --- example                                          :noexport:
    #+BEAMER: \centering
    #+BEAMER: \only<1>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit1}}
    #+BEAMER: \only<2>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit2}}
    #+BEAMER: \only<3>{\includegraphics[width=\textwidth]{swh-merkle-dag-small}}
 
 * VCS compression
 ** Corpus --- as a graph
 *** Nodes
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     | *Node type* | *N. of nodes* |
     |-------------+---------------|
     | origins     | 88 M          |
     | snapshots   | 57 M          |
     | releases    | 9.9 M         |
     | revisions   | 1.1 B         |
     | directories | 4.9 B         |
     | contents    | 5.5 B         |
     |-------------+---------------|
     | Total nodes | 12 B          |
 *** Edges
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     | *Edge type*           | *N. of edges* |
     |-----------------------+---------------|
     | origin → snapshot     | 195 M         |
     | snapshot → revision   | 616 M         |
     | snapshot → release    | 215 M         |
     | release → revision    | 9.9 M         |
     | revision → revision   | 1.2 B         |
     | revision → directory  | 1.1 B         |
     | directory → directory | 48 B          |
     | directory → revisiony | 482 M         |
     | directory → content   | 112 B         |
     |-----------------------+---------------|
     | Total edges           | 165 B         |
 *** 
     Archive snapshot 2018-09-25, from the Software Heritage graph dataset.\\
     Growth rate: exponential, doubling every 22-30 months.
 
 ** Compression pipeline
 *** Pipeline
     :PROPERTIES:
     :BEAMER_env: ignoreheading
     :END:
     #+BEAMER: \vspace{-8mm}
     #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps}
     #+BEAMER: \vspace{-1.2cm}
     - *MPH:* minimal perfect hash, mapping Merkle IDs to 0..N-1 integers
     - *BV compress:* Boldi-Vigna compression (based on MPH order)
     - *BFS:* breadth-first visit to renumber
     - *Permute:* update BV compression according to BFS order
     #+BEAMER: \vfill
 *** (Re)establishing locality
     - key for good compression is a node ordering that ensures locality and
       similarity
     - which is very much /not/ the case with Merkle IDs...
     - ...but is the case /again/ after BFS
 
 ** Compression time
    We ran the compression pipeline on the input corpus using the WebGraph
    framework
    #+BEGIN_EXPORT latex
    \begin{thebibliography}{Foo Bar, 1969}
    \footnotesize
    \bibitem{BoVWFI} Paolo Boldi and Sebastiano Vigna.
    \newblock The WebGraph framework I: Compression techniques
    \newblock WWW 2004: 13th Intl. World Wide Web Conference. ACM
    \end{thebibliography}
    #+END_EXPORT
 
    | *Step*      | *Wall time* (hours) |
    |-------------+---------------------|
    | MPH         |                   2 |
    | BV Compress |                  84 |
    | BFS         |                  19 |
    | Permute     |                  18 |
    | Transpose   |                  15 |
    |-------------+---------------------|
    | Total       |        138 (6 days) |
 
    - server equipped with 24 CPUs and 750 GB of RAM
    - RAM mostly used as I/O cache for the BFS step
    - /minimum/ memory requirements are close to the RAM needed to load the
      final compressed graph in memory
 
 ** Compression efficiency
 *** Forward graph
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.4
     :END:
     | total size        | 91 GiB |
     | bits per edge     |   4.91 |
     # | compression ratio |  15.8% |
 ***  Backward graph
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.4
     :END:
     | total size        | 83 GiB |
     | bits per edge     |   4.49 |
     # | compression ratio |  14.4% |
   #+BEAMER: \vfill
 *** Operation cost
     The structure of a full bidirectional archive graph fits in less than 200
     GiB of RAM, for a hardware cost of ~300 USD.
 
 * Exploitation
 ** A domain-agnostic benchmark --- full corpus traversal
 *** Benchmark --- Full BFS visit
     #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth}
     | *Forward  graph* |                |
     |------------------+----------------|
     | wall time        | 1h48m          |
     | throughput       | 1.81 M nodes/s |
     |                  | (553 ns/node)  |
     #+BEAMER: \end{column}\begin{column}{0.45\textwidth}
     | *Backward graph* |                |
     |------------------+----------------|
     | wall time        | 3h17m          |
     | throughput       | 988 M nodes/s  |
     |                  | (1.01 µs/node) |
     #+BEAMER: \end{column}\end{columns}
 *** Benchmark --- Edge lookup
     random sample: 1 B nodes (8.3% of entire graph)
     #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth}
     | *Forward  graph* |                |
     |------------------+----------------|
     | visited edges    | 13.6 B         |
     | throughput       | 12.0 M edges/s |
     |                  | (83 ns/edge)   |
     #+BEAMER: \end{column}\begin{column}{0.45\textwidth}
     | *Backward graph* |                |
     |------------------+----------------|
     | visited edges    | 13.6 B         |
     | throughput       | 9.45 M edges/s |
     |                  | (106 ns/edge)  |
     #+BEAMER: \end{column}\end{columns}
 *** 
     :PROPERTIES:
     :BEAMER_env: ignoreheading
     :END:
     Note how edge lookup time is close to DRAM random access time (50-60 ns).
 
 ** Domain-specific benchmarks --- source code artifact multiplication
 *** 
     Simple *clone detection* style experiments realized exploiting the
     compressed corpus:
     1. *file→commit multiplication:* how much identical source code files
        re-occur in different comments
     2. *commit→origin multiplication:* how much identical commits re-occur in
        different repositories
 *** Implementation
     - for each node---content for (1), commit for (2)---*visit* the backward
       graph and *count* all reachable nodes of the desired type---commit for
       (1), origin for (2)
     - naive approach, O(|V|x|E|) complexity
 ** File→commit multiplication --- results
   #+BEAMER: \vspace{-5mm}
 *** Multiplication factor
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     #+BEAMER: \centering \includegraphics[width=0.9\textwidth]{this/content-revision-multiplication.png}
 *** Visit size
     :PROPERTIES:
     :BEAMER_env: block
     :BEAMER_col: 0.5
     :END:
     #+BEAMER: \centering \includegraphics[width=0.9\textwidth]{this/content-revision-node-size.png}
 *** Benchmark figures
     :PROPERTIES:
     :BEAMER_env: ignoreheading
     :END:
     #+BEAMER: \vfill
     - random sample of 953 M contents (17% of the full corpus)
     - processing time: ~2.5 days (single machine with 20 x 2.4 GHz cores)
       - /in spite of/ the naive O(|V|x|E|) approach, generally considered
         intractable at this scale
 * Discussion
 ** Limitations
    #+BEAMER: \vspace{-2mm}
 *** Incrementality
     - compression is inherently *not incremental*
     - not an issue for most research use cases, because we analyze immutable
       data dumps
     - common workaround (e.g., for the Web and social networks) is to keep an
       uncompressed *in-memory overlay* for graph updates, and periodically
       recompress
   #+BEAMER: \pause
 *** In-memory v. on-disk
     - the compressed in-memory graph structure has *no attributes*
     - usual data design is to exploit the 0..N-1 integer ranges to *memory map
       /node/ attributes* to secondary storage
       - we have done this with a *node type map*; it weights 4 GB (3 bit per
         node)
     - works well for queries that do graph traversal first and "join" node
       attributes last; ping-pong between the two is expensive
     - /edge/ attributes are more problematic
 * Conclusion
 ** Wrapping up
   #+BEAMER: \vspace{-2mm}
 *** 
     - Graph compression is a viable technique to analyze the history of all
       public source code, as captured by modern version control systems (VCS),
       on a budget.
     - It is a novel tool for VCS analyses that might allow to expand the scope
       of our experiments, reducing selection biases and improving external
       validity.
     - More work is needed to provide compression incrementality and allow to
       efficiently query VCS properties during traversal.
 *** See full paper for more details
   #+BEGIN_EXPORT latex
   \begin{thebibliography}{Foo Bar, 1969}
   \small \vspace{-2mm}
   \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli
   \newblock Ultra-Large-Scale Repository Analysis via Graph Compression
   \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE
   \end{thebibliography}
   preprint: {\small \url{http://bit.ly/swh-graph-saner20}}
   #+END_EXPORT
 *** Contacts
     Stefano Zacchiroli / zack@irif.fr / @zacchiro / talk to me at SANER 2020!
 
 * Appendix                                                       :B_appendix:
   :PROPERTIES:
   :BEAMER_env: appendix
   :END: