diff --git a/talks-public/2020-05-28-lip6/2020-05-28-lip6.org b/talks-public/2020-05-28-lip6/2020-05-28-lip6.org index dae8587..7bfe6a0 100644 --- a/talks-public/2020-05-28-lip6/2020-05-28-lip6.org +++ b/talks-public/2020-05-28-lip6/2020-05-28-lip6.org @@ -1,169 +1,167 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: Analyzing the Global Graph of Public Software Development #+BEAMER_HEADER: \date[28 May 2020, LIP6]{28 May 2020\\LIP6 --- Paris, France\\ (via conf call)\\[-2ex]} #+AUTHOR: Stefano Zacchiroli #+DATE: 28 May 2020 #+EMAIL: zack@upsilon.cc #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} #+BEAMER_HEADER: \author{Stefano Zacchiroli} # Required by graph-compression.org module #+LATEX_HEADER_EXTRA: \usepackage{pdfpages} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} #+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-minted-options - '(("fontsize" "\\scriptsize") - ("linenos" ""))) + '(("fontsize" "\\scriptsize"))) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) #+end_src # End syntax highlighting setup * About me :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+INCLUDE: "this/zack.org" :minlevel 2 * Software Heritage -** Software Heritage in a nutshell \hfill www.softwareheritage.org -#+BEAMER: \transdissolve -#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :only-contents t :minlevel 3 +** Software Heritage in a nutshell \hfill [[https://softwareheritage.org][softwareheritage.org]] + #+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :only-contents t :minlevel 3 ** An international, non profit initiative\hfill built for the long term :PROPERTIES: :CUSTOM_ID: support :END: *** Sharing the vision :B_block: :PROPERTIES: :CUSTOM_ID: endorsement :BEAMER_COL: .5 :BEAMER_env: block :END: #+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center} #+LATEX: \vspace{-0.8cm} #+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center} #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials} *** Donors, members, sponsors :B_block: :PROPERTIES: :CUSTOM_ID: sponsors :BEAMER_COL: .5 :BEAMER_env: block :END: #+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center} #+LATEX: \begin{center} #+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}} #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors} #+LATEX: \end{center} ** Status :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 * Querying the archive ** Exploitation #+BEAMER: \LARGE \centering How do you query the Software Heritage archive? #+BEAMER: \Large \\ (on a budget) ** Use cases --- product needs e.g., for https://archive.softwareheritage.org *** Browsing - =ls= - =git log= (Linux kernel: 800K+ commits) *** Wayback machine - tarball - =git bundle= (Linux kernel: 7M+ nodes) *** Provenance tracking - commit provenance (one/all contexts) - requires backtracking - origin provenance (one/all contexts) *** Note We therefore need both the Merkle DAG graph and its *transposed* ** Use cases --- research questions *** For the sake of it - - local graph topology - - connected component size + - local graph *topology* + - size distribution of *connected components/modules* - enabling question to identify the best approach (e.g., scale-up v. scale-out) to conduct large-scale analyses - - any other emerging property + - all other *emergent properties* *** Software Engineering topics - - software provenance analysis at this scale is pretty much unexplored yet - - industry frontier: increase granularity down to the individual line of - code - - replicate at this scale (famous) studies that have generally been - conducted on (much) smaller version control system samples to - confirm/refute their findings + - software *provenance analysis* at this scale remains pretty much + unexplored + - industry frontier: increase granularity down to the individual *line of + code* + - *replicate* at this scale (famous) studies that have been conducted at + (much) smaller scale to confirm/refute their findings - ... #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t ** Discussion - one /can/ query such a corpus SQL-style - but relational representation shows its limits at this scale - - ...at least as deployed on commercial SQL offerings such as Athena - - note: (naive) sharding is ineffective, due to the pseudo-random - distribution of node identifiers + - ...at least as deployed on commercial SQL offerings such as Athena + - note: (naive) sharding is ineffective, due to the pseudo-random + distribution of node identifiers - experiments with Google BigQuery are ongoing - (we broke it at the first import attempt..., due to very large arrays in directory entry tables) * Graph compression #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t * Conclusion #+INCLUDE: "this/roadmap.org" :minlevel 2 ** Wrapping up #+latex: \vspace{-2mm} *** - Software Heritage archives all public source code as a huge Merkle DAG - Querying and analyzing it at scale (20/300 B nodes/edges) is an open problem - Gold mine of research leads in sw. eng., complex networks, big code, reproducibility #+latex: \vspace{-2mm} *** References #+latex: \vspace{-1mm} #+BEGIN_EXPORT latex \begin{thebibliography}{} \scriptsize \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli \newblock Building the Universal Archive of Source Code \newblock Communications of the ACM, October 2018 \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli \newblock The Software Heritage graph dataset: public software development under one roof \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT *** Contacts - Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] + [[https://upsilon.cc/~zack/][Stefano Zacchiroli]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] / [[https://mastodon.xyz/@zacchiro][@zacchiro@mastodon.xyz]] * Appendix :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: diff --git a/talks-public/2020-05-28-lip6/this/roadmap.org b/talks-public/2020-05-28-lip6/this/roadmap.org index e072b27..5bb8de1 100644 --- a/talks-public/2020-05-28-lip6/this/roadmap.org +++ b/talks-public/2020-05-28-lip6/this/roadmap.org @@ -1,62 +1,74 @@ ** A (brief) research roadmap --- 1 *** Graph compression - incremental, amortized compression → ongoing UniMi collaboration - graph query languages on top of the compressed representation → ongoing LIRIS collaboration #+BEAMER: \vfill \pause *** Complex networks - local topology of the global VCS graph - emergent properties (the "classics": scale-free, small world, etc.) - dynamic modeling of graph evolution over time → collab. with physics @ UParis *** #+BEGIN_EXPORT latex \vspace{-2mm} \begin{thebibliography}{} \small \bibitem{Pietri2019} Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli\newblock Determining the Intrinsic Structure of Public Software Development History\newblock MSR 2020: 17th Intl. Conf. on Mining Software Repositories. IEEE\newblock registered study protocol, to appear \end{thebibliography} #+END_EXPORT + ** A (brief) research roadmap --- 2 #+BEAMER: \vspace{-2mm} *** Very-large-scale "big code" + #+BEAMER: \vspace{-1mm} - /big code/ = apply ML/DL to source code and other development byproducts - current results are language-specific and limited in scale; even the simplest problems become challenging at this scale and heterogeneity - lead: scalable language detection → collaboration with UniBo - lead: project classification → collaboration with CELI - the VCS graph remains largely unexplored in big code - lead: apply GNN (Graph Neural Network) for VCS node classification #+BEAMER: \vspace{-1mm} *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEAMER: \vspace{-1mm} \pause -*** Very-large-scale source code indexing +*** Very-large-scale source code indexing :noexport: - common AST-based approaches for code indexing are not viable here due do maximum heterogeneity - alternative: treat code as text and full-text index it - previous exp.: 3-gram based indexing in Debsources, supporting regexp matching - goal: find a sweet spot between the two #+BEAMER: \vspace{-1mm} +*** Supporting large-scale, incremental static analysis + - observation: to popularize static analysis we need CI integration and + incrementality (cf. Semmle) + - idea: exploit Merkle properties to efficiently identify changes between + commits + - reqirement: modular static analysis tools that can reuse past file-level + results + → grant proposal with CEA (for Frama-C) + #+BEAMER: \vspace{-1mm} + ** A (brief) research roadmap --- 3 *** Very-large-scale reproducibility in software engineering - most results in empirical software engineering are determined on corpuses significantly smaller than Software Heritage → external validity threat; do results generalize to the full body of public code? - 2-year research plan 1) identify impactful sw. eng. studies that can be reproduced using Software Heritage - selected topics (tentative): code reuse, code quality, project classification, technical debt, developer productivity 2) reproduce selected studies one-by-one, at Software Heritage scale 3) document findings, e.g., via RENE (REproducibility Studies and NEgative Results) scientific initiatives - collaboration with Microsoft Research (early stages)