diff --git a/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org b/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org index 3b391ef..39b2d1a 100644 --- a/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org +++ b/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org @@ -1,184 +1,187 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Determining the Intrinsic Structure of Public Software Development History #+BEAMER_HEADER: \date[MSR 2020]{MSR 2020\\online conference} #+AUTHOR: Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli #+DATE: June 2020 #+EMAIL: zack@upsilon.cc #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \titlegraphic{} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd-world}} #+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} #+BEAMER_HEADER: \author[Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli]{Antoine Pietri, Guillaume Rousseau, \uline{Stefano Zacchiroli}} #+BEAMER_HEADER: \title[Intrinsic Structure of Public Software Development]{Determining the Intrinsic Structure\\ of Public Software Development History} * Main matter ** Motivations - success of Free/Open Source Software (*FOSS*) + success of collaborative development platforms (e.g., *GitHub*, *GitLab*) → a wealth of *public source code artifacts* for MSR & ESE research - *Version Control Systems* (VCS) have been particularly studied (from the 90s on) - only recently exhaustive studies of all publicly available VCS artifacts - have (1) gathered attention and (2) become possible, cf.: + have (1) gathered attention and (2) become possible (e.g., Rousseau, Di + Cosmo, Zacchiroli; ESE 2020), thanks to platforms like: - Software Heritage - World of Code *** Research goal Conduct the first systematic, exploratory study on the *intrinsic structure* of the *global graph* of source code artifacts stored in all publicly available *version control systems*. ** Corpus #+latex: \vspace{-2mm} *** #+ATTR_LATEX: :width \extblockscale{0.9\textwidth} file:SWH-logo+motto.pdf *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+latex: \vspace{-1mm} *** Platform coverage #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/github} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.13\linewidth]{coverage/gitlab} #+BEAMER: \hfill #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.14\linewidth]{coverage/bitbucket}} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/googlecode} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/gitorious} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/framagit} #+BEAMER: \\ #+BEAMER: \includegraphics[width=0.10\linewidth]{coverage/hal} #+BEAMER: \hfill #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.12\linewidth]{coverage/debian}} #+BEAMER: \hfill #+BEAMER: \raisebox{1mm}{\includegraphics[width=0.11\linewidth]{coverage/npm}} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.06\linewidth]{coverage/cran} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/gnu} #+BEAMER: \hfill #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/inria} #+BEAMER: \hfill #+BEAMER: \raisebox{-1mm}{\includegraphics[width=0.11\linewidth]{coverage/pypi}} *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+latex: \vspace{-1mm} *** Dataset - #+BEGIN_EXPORT latex - \vspace{-2mm} - \begin{thebibliography}{} - \footnotesize - \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli - \newblock The Software Heritage graph dataset: public software development under one roof - \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE - \end{thebibliography} - #+END_EXPORT - 100+ M projects, 8+ B files, 1.5+ B commits (specifics vary with the - dataset version) + # #+BEGIN_EXPORT latex + # \vspace{-2mm} + # \begin{thebibliography}{} + # \footnotesize + # \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli + # \newblock The Software Heritage graph dataset: public software development under one roof + # \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + # \end{thebibliography} + # #+END_EXPORT + - The Software Heritage Graph Dataset (Pietri, Spinellis, Zacchiroli; + MSR 2019) + - 100+ M projects, 8+ B files, 1.5+ B commits (specifics vary with the + dataset version) ** Data model #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-data-model-h} *** A *global graph* linking together fully *deduplicated* source code artifact (files, commits, directories, releases, etc.) to the places that distribute them (e.g., Git repositories), providing a *unified view* on the entire */Software Commons/*. ** Research questions *** General idea and analysis approach The global VCS graph is a *complex network* resulting from the human activity of software development. We will study it using classic techniques from *network theory*. *** Specific research questions 1. _Topology:_ What is the *distribution of indegrees, outdegrees* and *local clustering*? Which laws do they fit? 2. _Modularity:_ What is the distribution of *connected component sizes*? 3. _"Height":_ What is the distribution of *shortest path lengths* from roots to leaves? *** Variants We will answer for both the full graph and *relevant sub-graphs*. E.g., file system layer (files+directories) v. development history layer (commits+releases). ** Execution plan 1. obtain the most recent version of the Software Heritage Graph Dataset 2. *compress the graph* using webgraph compression techniques, so that the - graph structure fits in memory. We will build on top of related work: - #+BEGIN_EXPORT latex - \begin{thebibliography}{} - \small - \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli - \newblock Ultra-Large-Scale Repository Analysis via Graph Compression - \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE - \end{thebibliography} - #+END_EXPORT + VCS graph structure fits in memory (Boldi et al.; SANER 2020). + # #+BEGIN_EXPORT latex + # \begin{thebibliography}{} + # \small + # \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + # \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + # \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + # \end{thebibliography} + # #+END_EXPORT All graph accesses from now on will be on the compressed graph representation via the WebGraph Java API. 3. compute *indegrees/outdegrees and local clustering* using standard algorithms (RQ1) 4. compute *connected components* using standard algorithms (RQ2) 5. create *shortest path spanning trees* to graph leaves, then measure lengths (RQ3) ** Significance of the findings 1. how does the global VCS graph compares to other naturally-occurring networks? - e.g., webgraph, social networks, etc. 2. determine the *best technological approaches* to perform *full-scale analyses* of the entire Software Commons with limited resources, e.g.: - if the global VCS graph is *modular* it will be possible to analyze it using distributed algorithms; much less so if it's dominated by one *giant component* - if degree distributions are fat-tailed, the VCS graph will compress better 3. determine the statistical properties of path lengths, which are limiting factors for *indexing* and *software provenance tracking* on the entire corpus of public code * Conclusion ** Wrapping up #+BEAMER: \vspace{-2mm} *** - - we will analyze the global version control system graph as a - naturally-occurring complex network and determine its intrinsic structure + - we will analyze the global version control system graph as a naturally\\ + occurring complex network and determine its intrinsic structure - to that end we will start from the Software Heritage Graph Dataset, compress it, and apply standard analysis techniques from network theory - findings will allow to compare the global VCS graph to other large networks and help determining how to analyze all public code in future MSR/ESE studies *** Learn more #+BEGIN_EXPORT latex \vspace{-2mm} \begin{thebibliography}{} \small \bibitem{Pietri2020b} Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli \newblock Determining the Intrinsic Structure of Public Software Development History \newblock MSR 2020: 17th Intl. Conf. on Mining Software Repositories. IEEE \newblock Registered report: \url{https://osf.io/7r2w4} \end{thebibliography} #+END_EXPORT *** Contacts [[https://upsilon.cc/~zack/][Stefano Zacchiroli]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] / [[https://mastodon.xyz/@zacchiro][@zacchiro@mastodon.xyz]] * Appendix :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: