diff --git a/talks-public/2021-05-19-telecom-paris/2021-05-19-telecom-paris.org b/talks-public/2021-05-19-telecom-paris/2021-05-19-telecom-paris.org new file mode 100644 index 0000000..834f69d --- /dev/null +++ b/talks-public/2021-05-19-telecom-paris/2021-05-19-telecom-paris.org @@ -0,0 +1,188 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage +#+SUBTITLE: Analyzing the Global Graph of Public Software Development +#+BEAMER_HEADER: \date[2021-05-19, ACES]{19 May 2021\\Team ACES --- Télécom Paris\\ (online)\\[-2ex]} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 19 May 2021 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +# Syntax highlighting setup +#+LATEX_HEADER_EXTRA: \usepackage{minted} +#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} +#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} +#+name: setup-minted +#+begin_src emacs-lisp :exports results :results silent + (setq org-latex-listings 'minted) + (setq org-latex-minted-options + '(("fontsize" "\\scriptsize") + ("linenos" ""))) + (setq org-latex-to-pdf-process + '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) +#+end_src +# End syntax highlighting setup + +* About me :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+INCLUDE: "this/zack.org" :minlevel 2 +* Software Heritage +#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 +** An international, non profit initiative + :PROPERTIES: + :CUSTOM_ID: support + :END: +*** Sharing the vision :B_block: + :PROPERTIES: + :CUSTOM_ID: endorsement + :BEAMER_COL: .5 + :BEAMER_env: block + :END: + #+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center} + #+LATEX: \vspace{-0.8cm} + #+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center} + #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials} +*** Donors, members, sponsors :B_block: + :PROPERTIES: + :CUSTOM_ID: sponsors + :BEAMER_COL: .5 + :BEAMER_env: block + :END: + #+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center} + #+LATEX: \begin{center} + #+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}} + #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors} + #+LATEX: \end{center} +** Status :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +#+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 +#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t +#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 +#+INCLUDE: "../../common/modules/data-model.org::#merklestruct" :minlevel 2 +#+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t +#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 +* Querying the archive +** Use cases --- product needs + e.g., for https://archive.softwareheritage.org +*** Browsing + - =ls= + - =git log= (Linux kernel: 800K+ commits) +*** Wayback machine + - tarball + - =git bundle= (Linux kernel: 7M+ nodes) +*** Provenance tracking + - commit provenance (one/all contexts) \hfill note: requires backtracking + - origin provenance (one/all contexts) +*** Note :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Note: we therefore need both the direct Merkle DAG graph and its + *transposed* + +** Use cases --- research questions +*** For the sake of it + - local graph topology + - connected component size + - enabling question to identify the best approach (e.g., scale-up + v. scale-out) to conduct large-scale analyses + - any other emerging property +*** Software Engineering topics + - software provenance analysis at this scale is pretty much unexplored yet + - industry frontier: increase granularity down to the individual line of + code + - replicate at this scale (famous) studies that have generally been + conducted on (much) smaller version control system samples to + confirm/refute their findings + - ... +** Exploitation + #+BEAMER: \LARGE \centering + How do you query the Software Heritage archive? + #+BEAMER: \Large \\ + (on a budget) + +** The Software Heritage Graph Dataset :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t + +** Sample study --- 50 years of gender differences in code contributions + - start from the Software Heritage graph dataset + - detect gender of author names using standard tooling (=gender-guesser=) + # - caveat: how to identify /first/ name? + - analyze both authors and commits over time, bucketing by commit timestamp + #+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center} +*** + #+BEGIN_EXPORT latex + \vspace{-1mm} + \begin{thebibliography}{} \footnotesize + \bibitem{Zacchiroli2021} Stefano Zacchiroli + \newblock Gender Differences in Public Code Contributions: a 50-year Perspective + \newblock IEEE Softw. 38(2): 45-50 (2021) + \end{thebibliography} + #+END_EXPORT + +** Discussion + - one /can/ query such a corpus SQL-style + - but relational representation shows its limits at this scale + - ...at least as deployed on commercial SQL offerings such as Athena + - note: (naive) sharding is ineffective, due to the pseudo-random + distribution of node identifiers + - experiments with Google BigQuery are ongoing + - (we broke it at the first import attempt..., due to very large arrays in + directory entry tables) + +* Graph compression + #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t + +* Security synergies and outlook + #+INCLUDE: "this/security.org" :minlevel 2 + #+INCLUDE: "this/roadmap.org" :minlevel 2 + +** Wrapping up + #+latex: \vspace{-1mm} +*** + - Software Heritage archives all public source code as a huge Merkle DAG + - Querying and analyzing it at scale (20/200 B nodes/edges) is an open + problem + - Gold mine of research leads in sw. eng., big code, reproducibility, + security + #+latex: \vspace{-2mm} +*** References (selected) + #+latex: \vspace{-1mm} + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \scriptsize + + \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli + \newblock Building the Universal Archive of Source Code + \newblock Communications of the ACM, October 2018 + + \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli + \newblock The Software Heritage graph dataset: public software development under one roof + \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + + \end{thebibliography} + #+END_EXPORT +*** Contacts + Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: diff --git a/talks-public/2021-05-19-telecom-paris/Makefile b/talks-public/2021-05-19-telecom-paris/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2021-05-19-telecom-paris/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides diff --git a/talks-public/2021-05-19-telecom-paris/this/commits-pie.pdf b/talks-public/2021-05-19-telecom-paris/this/commits-pie.pdf new file mode 100644 index 0000000..8b0e856 Binary files /dev/null and b/talks-public/2021-05-19-telecom-paris/this/commits-pie.pdf differ diff --git a/talks-public/2021-05-19-telecom-paris/this/ratio-female-authors.pdf b/talks-public/2021-05-19-telecom-paris/this/ratio-female-authors.pdf new file mode 100644 index 0000000..bcf325a Binary files /dev/null and b/talks-public/2021-05-19-telecom-paris/this/ratio-female-authors.pdf differ diff --git a/talks-public/2021-05-19-telecom-paris/this/ratio-female-commits.pdf b/talks-public/2021-05-19-telecom-paris/this/ratio-female-commits.pdf new file mode 100644 index 0000000..baab5bf Binary files /dev/null and b/talks-public/2021-05-19-telecom-paris/this/ratio-female-commits.pdf differ diff --git a/talks-public/2021-05-19-telecom-paris/this/roadmap.org b/talks-public/2021-05-19-telecom-paris/this/roadmap.org new file mode 100644 index 0000000..2db2f4a --- /dev/null +++ b/talks-public/2021-05-19-telecom-paris/this/roadmap.org @@ -0,0 +1,76 @@ +** A (brief) research roadmap --- 1 +*** Graph compression + - incremental, amortized compression → ongoing UniMi collaboration + - graph query languages on top of the compressed representation → LIRIS + collaboration (early stages) + #+BEAMER: \vfill \pause +*** Complex networks + - local topology of the global VCS graph + - emergent properties (the "classics": scale-free, small world, etc.) + - dynamic modeling of graph evolution over time → collab. with physics @ + UParis +*** + #+BEGIN_EXPORT latex + \vspace{-2mm} + \begin{thebibliography}{} + \small + \bibitem{Pietri2019} Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli\newblock + Determining the Intrinsic Structure of Public Software Development History\newblock + MSR 2020: 17th Intl. Conf. on Mining Software Repositories. IEEE\newblock + registered study protocol + \end{thebibliography} + #+END_EXPORT +** A (brief) research roadmap --- 2 + #+BEAMER: \vspace{-2mm} +*** Very-large-scale "big code" + - /big code/ = apply ML/DL to source code and other development byproducts + - current results are language-specific and limited in scale; even the + simplest problems become challenging at this scale and heterogeneity + - lead: scalable language detection → collaboration with UniBo + - lead: project classification → collaboration with CELI + - the VCS graph remains largely unexplored in big code + - lead: use GNN for VCS node classification → ANR COREOGRAPHIE + #+BEAMER: \vspace{-1mm} +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \vspace{-1mm} \pause +*** Very-large-scale source code indexing + - common AST-based approaches for code indexing are not viable here due do + maximum heterogeneity + - alternative: treat code as text and full-text index it + - previous exp.: 3-gram based indexing in Debsources, supporting regexp + matching + - goal: find a sweet spot between the two + #+BEAMER: \vspace{-1mm} +** Recent results on (visual) programming language identification :noexport: + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \small + + \bibitem{DelBonifro2021a} Francesca Del Bonifro, Maurizio Gabbrielli, Stefano Zacchiroli + \newblock Content-Based Textual File Type Detection at Scale + \newblock ICMLC 2021: The 13th International Conference on Machine Learning and Computing. ACM, 2021 + + \bibitem{DelBonifro2021b} Francesca Del Bonifro, Maurizio Gabbrielli, Antonio Lategano, Stefano Zacchiroli + \newblock Image-based many-language programming language identification + \newblock (under review) + + \end{thebibliography} + #+END_EXPORT +** A (brief) research roadmap --- 3 +*** Very-large-scale reproducibility in software engineering + - most results in empirical software engineering are determined on corpuses + significantly smaller than Software Heritage\\ + → external validity threat; do results generalize to the full body of + public code? + - 2-year research plan + 1) identify impactful sw. eng. studies that can be reproduced using + Software Heritage + - selected topics (tentative): code reuse, code quality, project + classification, technical debt, developer productivity + 2) reproduce selected studies one-by-one, at Software Heritage scale + 3) document findings, e.g., via RENE (REproducibility Studies and + NEgative Results) scientific initiatives + - collaboration with Microsoft Research (just started) diff --git a/talks-public/2021-05-19-telecom-paris/this/security.org b/talks-public/2021-05-19-telecom-paris/this/security.org new file mode 100644 index 0000000..3182456 --- /dev/null +++ b/talks-public/2021-05-19-telecom-paris/this/security.org @@ -0,0 +1,117 @@ +** Securing the open source supply chain + + *Software supply chain attacks* are becoming more and more popular and + raising in profile. → Cf. /SolarWindws attacks/ (2021), breaching several US + govt. branches + +*** Definition --- Reproducible Builds (R-B) + The build process of a software product is *reproducible* if, after + designating a specific version of its source code and all of its build + dependencies, every build produces *bit-for-bit identical artifacts*, no + matter the environment in which the build is performed. + +*** + - R-B allows to *increase trust in binary executables* built from trusted + (open source) code by untrusted 3rd-party software vendors (e.g., app + stores, distros) + + - The *[[https://reproducible-builds.org/][reproducible-builds.org project]]* has popularized the notion, is + backed by major open source industry players, and has made large open + source software collections reproducible (e.g., 95% of Debian packages) + +*** References :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \footnotesize + \bibitem{Lamb2021RB} Chris Lamb, Stefano Zacchiroli + \newblock Reproducible Builds: Increasing the Integrity of Software Supply + \newblock IEEE Software 2021 (to appear, DOI 10.1109/MS.2021.3073045) + \end{thebibliography} + #+END_EXPORT + +** Securing the open source supply chain (cont.) +*** + - Software Heritage provides key ingredients for R-B pipelines: on-demand + archival (e.g., of VCS commits referenced by build recipes) + long-term + availability + - We have implemented this by integrating the GNU Guix package manager with + Software Heritage + +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \begin{center}\hfill\includegraphics[height=0.4\textheight]{swh-guix-1}\hfill\includegraphics[height=0.4\textheight]{swh-guix-2}\hfill~\end{center} + #+BEAMER: \scriptsize + - \url{https://www.softwareheritage.org/2019/04/18/software-heritage-and-gnu-guix-join-forces-to-enable-long-term-reproducibility/} + - \url{https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/} + +** Tracking of vulnerable source code artifacts + +*** + Software Heritage provides a unique observatory on the (best approximation + of) the entire /Software Commons/, i.e., all software published in source + code form + +*** Software provenance tracking at the scale of the world + - by following the /transposed/ Software Heritage graph we can locate *all + known public occurrences* of source code artifacts (individual source + files, entier source tree, commits) in other commits or repositories + + - we have developed two approaches to do that: + + 1. database-based (Rousseau et al. EMSE 2020): incremental, answers a + fixed set of queries, requires significant disk space + + 2. compressed-graph-base (Boldi et al. SANER 2020): non-incremental, + flexible graph-base querying, fits in RAM + + - current applications: "intellectual property"/prior art, open source + license compliance, software composition analysis (SCA) + +** Tracking of vulnerable source code artifacts (cont.) + +*** Adding in-memory commit timestamps (experimental) + Idea: in-memory timestamp array (us precision, 8 bytes each), indexed by + revision node id. This enables to efficiently exploit timestamp information + during graph visits. + +*** Finding the /earliest/ commit referencing a source file/dir + Early experiment: finding the earliest revision containing a given file + using in-memory commit timestamps, on 10 M randomly selected blobs. + + Mean lookup time: 4.1 ms (avg on 95% percentile: 2.2 ms) + +*** Tracking vulnerable source code files/trees + Given a source file/tree affected by a known vulnerability (e.g., + identified by a CVE) we can efficiently identify /all/ commits (and + repositories, extending the traversals) that reference it, triggering + further inspection. Furthermore, we can efficiently select which commits to + filter out during visits, based on commit timestamps of other attributes + that can be made to fit in memory (or memory mapped to disk). + +** Tracking of vulnerable source code artifacts (cont.) + +*** v. State-of-the-art industry offerings + Similar to what GitHub/GitLab offer as a service, but: + + - without having to rely on repository scanning, because the "big picture" + is already present in the Software Heritage archive by design + + - independent from the development platform vendor (e.g., a "vulnerable + file" primarily hosted on GitHub can be spotted in GitLab repositories + and vice-versa) + + - complementary and synergistic with analyses of vulnerable dependency + information (which are also available in Software Heritage via metadata + mining) + +*** Caveats + + - current granularity stops at the file level and traceability breaks with + even just whitespace changes. Increasing tracking granularity to the + snippet/line of code level is possible, but untested at this scale yet + (cf. research roadmap) diff --git a/talks-public/2021-05-19-telecom-paris/this/zack.org b/talks-public/2021-05-19-telecom-paris/this/zack.org new file mode 100644 index 0000000..25b290b --- /dev/null +++ b/talks-public/2021-05-19-telecom-paris/this/zack.org @@ -0,0 +1,25 @@ + +** About me + - Associate Professor (/Maître de conférences/), Université de Paris + - on leave (/délégation/) at Inria + - Free/Open Source Software activist (20+ years) + - Debian Developer & Former 3x Debian Project Leader + - Former Open Source Initiative (OSI) director + - Software Heritage co-founder & CTO + + #+BEAMER: \vfill \pause + +*** Research path + 1) Formal methods for ensuring the quality of software upgrades (Mancoosi + project) + + Industry adoption: Debian, OPAM, Eclipse P2 + 2) Formal methods for automated upgrade planning in the cloud (Aeolus + project) + + Industry adoption: Mandriva, Kyriba + 3) Large-scale software evolution analysis (Debsources platform) + 4) Very-large-scale source code analysis and preservation (Software + Heritage) + + → this talk