diff --git a/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org b/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org new file mode 100644 index 0000000..5812075 --- /dev/null +++ b/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org @@ -0,0 +1,145 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage +#+SUBTITLE: Analyzing the Global Graph of Public Software Development +#+BEAMER_HEADER: \date[7 Feb 2020, Lyon 1]{7 February 2020\\Université Lyon 1 --- Lyon, France} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 7 February 2020 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[UPD \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +# Syntax highlighting setup +#+LATEX_HEADER_EXTRA: \usepackage{minted} +#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} +#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} +#+name: setup-minted +#+begin_src emacs-lisp :exports results :results silent + (setq org-latex-listings 'minted) + (setq org-latex-minted-options + '(("fontsize" "\\scriptsize") + ("linenos" ""))) + (setq org-latex-to-pdf-process + '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) +#+end_src +# End syntax highlighting setup + +* Software Heritage + #+INCLUDE: "../../common/modules/source-code-different-long.org::#everywhere" :minlevel 2 + #+INCLUDE: "../../common/modules/swh-motivations-foss-iconic.org::#main" :only-contents t + #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 + #+INCLUDE: "../../common/modules/principles-short.org::#principles" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 + +* Querying the archive +** Exploitation + #+BEAMER: \LARGE \centering + How do you query the Software Heritage archive? + #+BEAMER: \Large \\ + (on a budget) + +** Use cases --- product needs + e.g., for https://archive.softwareheritage.org +*** Browsing + - =ls= + - =git log= (Linux kernel: 800K+ commits) +*** Wayback machine + - tarball + - =git bundle= (Linux kernel: 7M+ nodes) +*** Provenance tracking + - commit provenance (one/all contexts) + - requires backtracking + - origin provenance (one/all contexts) +*** Note + We therefore need both the Merkle DAG graph and its *transposed* + +** Use cases --- research questions +*** For the sake of it + - local graph topology + - connected component size + - enabling question to identify the best approach (e.g., scale-up + v. scale-out) to conduct large-scale analyses + - any other emerging property +*** Software Engineering topics + - software provenance analysis at this scale is pretty much unexplored yet + - industry frontier: increase granularity down to the individual line of + code + - replicate at this scale (famous) studies that have generally been + conducted on (much) smaller version control system samples to + confirm/refute their findings + - ... + + #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t + +** Takeaways + - one /can/ query such a corpus SQL-style + - but relational representation shows its limits at this scale + - ...at least as deployed on commercial SQL offerings such as Athena + - note: (naive) sharding is ineffective, due to the pseudo-random + distribution of node identifiers + - experiments with Google BigQuery are ongoing + - (we broke it at the first import attempt..., due to very large arrays in + directory entry tables) + +* Graph compression + #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t + +* Conclusion +** We're hiring! (a postdoc) +*** Paris-based postdoc on software provenance + - large-scale, big data *graph analysis* + - tracking the *provenance of source code* artifacts + - … at the *scale of the world* (what else?) + - in the context of *industrial partnerships* on open source license + compliance + - supervision: Stefano Zacchiroli, Roberto Di Cosmo + +*** Learn more and apply + - https://softwareheritage.org/jobs/ + - ask me! zack@upsilon.cc + +** Wrapping up + #+latex: \vspace{-2mm} +*** + - Software Heritage archives all public source code as a huge Merkle DAG + - Querying and analyzing it pose scaling challenges (20/300 B nodes/edges) + - It is a gold mine of research leads for graph/database scholars. Wanna + join? + #+latex: \vspace{-2mm} +*** References + #+latex: \vspace{-1mm} + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \scriptsize + + \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli + \newblock Building the Universal Archive of Source Code + \newblock Communications of the ACM, October 2018 + + \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli + \newblock The Software Heritage graph dataset: public software development under one roof + \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + + \end{thebibliography} + #+END_EXPORT +*** Contacts + Stefano Zacchiroli / zack@upsilon.cc / @zacchiro + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: diff --git a/talks-public/2020-02-07-lyon-univ/Makefile b/talks-public/2020-02-07-lyon-univ/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2020-02-07-lyon-univ/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides