diff --git a/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org b/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org index 2a7021d..8a1fa07 100644 --- a/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org +++ b/talks-public/2020-02-07-lyon-univ/2020-02-07-lyon-univ.org @@ -1,148 +1,148 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: Analyzing the Global Graph of Public Software Development -#+BEAMER_HEADER: \date[7 Feb 2020, Lyon 1]{7 February 2020\\Université Lyon 1 --- Lyon, France} +#+BEAMER_HEADER: \date[7 Feb 2020, LIRIS]{7 February 2020\\LIRIS --- Lyon, France} #+AUTHOR: Stefano Zacchiroli #+DATE: 7 February 2020 #+EMAIL: zack@upsilon.cc #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[UPD \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} #+BEAMER_HEADER: \author{Stefano Zacchiroli} # Required by graph-compression.org module #+LATEX_HEADER_EXTRA: \usepackage{pdfpages} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} #+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-minted-options '(("fontsize" "\\scriptsize") ("linenos" ""))) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) #+end_src # End syntax highlighting setup * Software Heritage # #+INCLUDE: "../../common/modules/source-code-different-long.org::#everywhere" :minlevel 2 #+INCLUDE: "../../common/modules/swh-motivations-foss-iconic.org::#main" :only-contents t #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 # #+INCLUDE: "../../common/modules/principles-short.org::#principles" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 * Querying the archive ** Exploitation #+BEAMER: \LARGE \centering How do you query the Software Heritage archive? #+BEAMER: \Large \\ (on a budget) ** Use cases --- product needs e.g., for https://archive.softwareheritage.org *** Browsing - =ls= - =git log= (Linux kernel: 800K+ commits) *** Wayback machine - tarball - =git bundle= (Linux kernel: 7M+ nodes) *** Provenance tracking - commit provenance (one/all contexts) - requires backtracking - origin provenance (one/all contexts) *** Note We therefore need both the Merkle DAG graph and its *transposed* ** Use cases --- research questions *** For the sake of it - local graph topology - connected component size - enabling question to identify the best approach (e.g., scale-up v. scale-out) to conduct large-scale analyses - any other emerging property *** Software Engineering topics - software provenance analysis at this scale is pretty much unexplored yet - industry frontier: increase granularity down to the individual line of code - replicate at this scale (famous) studies that have generally been conducted on (much) smaller version control system samples to confirm/refute their findings - ... #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t ** Discussion - one /can/ query such a corpus SQL-style - but relational representation shows its limits at this scale - ...at least as deployed on commercial SQL offerings such as Athena - note: (naive) sharding is ineffective, due to the pseudo-random distribution of node identifiers - experiments with Google BigQuery are ongoing - (we broke it at the first import attempt..., due to very large arrays in directory entry tables) * Graph compression #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t * Conclusion ** We're hiring! (a postdoc) *** Paris-based postdoc on software provenance - large-scale, big data *graph analysis* - tracking the *provenance of source code* artifacts - … at the *scale of the world* (what else?) - in the context of *industrial partnerships* on open source license compliance - supervision: Stefano Zacchiroli, Roberto Di Cosmo *** Learn more and apply - https://softwareheritage.org/jobs/ - ask me! zack@upsilon.cc ** Wrapping up #+latex: \vspace{-2mm} *** - Software Heritage archives all public source code as a huge Merkle DAG - Querying and analyzing it pose scaling challenges (20/300 B nodes/edges) - It is a gold mine of research leads for graph/database scholars. Wanna join? #+latex: \vspace{-2mm} *** References #+latex: \vspace{-1mm} #+BEGIN_EXPORT latex \begin{thebibliography}{Foo Bar, 1969} \scriptsize \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli \newblock Building the Universal Archive of Source Code \newblock Communications of the ACM, October 2018 \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli \newblock The Software Heritage graph dataset: public software development under one roof \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT *** Contacts Stefano Zacchiroli / zack@upsilon.cc / @zacchiro * Appendix :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: