diff --git a/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org b/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org new file mode 100644 index 0000000..4b02222 --- /dev/null +++ b/talks-public/2020-06-29-msr-topology/2020-06-29-msr-topology.org @@ -0,0 +1,182 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Determining the Intrinsic Structure of Public Software Development History +#+BEAMER_HEADER: \date[MSR 2020]{MSR 2020\\online conference} +#+AUTHOR: Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli +#+DATE: June 2020 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author[Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli]{Antoine Pietri, Guillaume Rousseau, \uline{Stefano Zacchiroli}} +#+BEAMER_HEADER: \title[Intrinsic Structure of Public Software Development]{Determining the Intrinsic Structure\\ of Public Software Development History} + +* Main matter +** Motivations + - success of Free/Open Source Software (*FOSS*) + success of collaborative + development platforms (e.g., *GitHub*, *GitLab*) + + → a wealth of *public source code artifacts* for MSR & ESE research + + - *Version Control Systems* (VCS) have been particularly studied (from the + 90s on) + + - only recently exhaustive studies of all publicly available VCS artifacts + have (1) gathered attention and (2) become possible, cf.: + - Software Heritage + - World of Code + +*** Research goal + Conduct the first systematic, exploratory study on the *intrinsic + structure* of *source code artifacts* stored in *[all] publicly available + version control systems*. + +** Corpus + #+latex: \vspace{-2mm} +*** + #+ATTR_LATEX: :width \extblockscale{0.9\textwidth} + file:SWH-logo+motto.pdf + +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+latex: \vspace{-1mm} + +*** Platform coverage + #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/github} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.13\linewidth]{coverage/gitlab} + #+BEAMER: \hfill + #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.14\linewidth]{coverage/bitbucket}} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/googlecode} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/gitorious} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/framagit} + #+BEAMER: \\ + #+BEAMER: \includegraphics[width=0.10\linewidth]{coverage/hal} + #+BEAMER: \hfill + #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.12\linewidth]{coverage/debian}} + #+BEAMER: \hfill + #+BEAMER: \raisebox{1mm}{\includegraphics[width=0.11\linewidth]{coverage/npm}} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.06\linewidth]{coverage/cran} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/gnu} + #+BEAMER: \hfill + #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/inria} + #+BEAMER: \hfill + #+BEAMER: \raisebox{-1mm}{\includegraphics[width=0.11\linewidth]{coverage/pypi}} + +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+latex: \vspace{-1mm} + +*** Dataset + #+BEGIN_EXPORT latex + \vspace{-2mm} + \begin{thebibliography}{} + \footnotesize + \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli + \newblock The Software Heritage graph dataset: public software development under one roof + \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + \end{thebibliography} + #+END_EXPORT + 100+ M projects, 8+ B files, 1.5+ B commits (specifics vary with the + dataset version) + +** Data model + #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-data-model-h} +*** + A *global graph* linking together fully *deduplicated* source code artifact + (files, commits, directories, releases, etc.) to the places that distribute + them (e.g., Git repositories), providing a *unified view* on the entire + */Software Commons/*. + +** Research questions +*** General idea and analysis approach + The global VCS graph is a *complex network* resulting from the human + activity of software development. We will study it using classic techniques + from *network theory*. + +*** Specific research questions + 1. _Topology:_ What is the *distribution of indegrees, outdegrees* and + *local clustering*? Which laws do they fit? + 2. _Modularity:_ What is the distribution of *connected component sizes*? + 3. _"Height":_ What is the distribution of *shortest path lengths* from + roots to leaves? + +*** Variants + We will answer for both the full graph and *relevant sub-graphs*. + + E.g., file system layer (files+directories) v. development history layer + (commits+releases). + +** Execution plan + 1. obtain the most recent version of the Software Heritage Graph Dataset + 2. *compress the graph* using webgraph compression techniques, so that the + graph structure fits in memory. We will build on top of related work: + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \small + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + \end{thebibliography} + #+END_EXPORT + All graph accesses from now on will be on the compressed graph + representation via the WebGraph Java API. + 3. compute *indegrees/outdegrees and local clustering* using standard + algorithms (RQ1) + 4. compute *connected components* using standard algorithms (RQ2) + 5. create *shortest path spanning trees* to graph leaves, then measure + lengths (RQ3) + +** Significance of the findings + 1. how does the global VCS graph compares to other naturally-occurring + networks? + - e.g., webgraph, social networks, etc. + 2. determine the *best technological approaches* to perform *full-scale + analyses* of the entire Software Commons with limited resources, e.g.: + - if the global VCS graph is *modular* it will be possible to analyze it + using distributed algorithms; much less so if it's dominated by one + *giant component* + - if degree distributions are fat-tailed, the VCS graph will compress + better + 3. determine the statistical properties of path lengths, which are limiting + factors for *indexing* and *software provenance tracking* on the entire + corpus of public code + +* Conclusion +** Wrapping up + #+BEAMER: \vspace{-2mm} +*** + - we will analyze the global version control system graph as a + naturally-occurring complex network and determine its intrinsic structure + - to that end we will start from the Software Heritage Graph Dataset, + compress it, and apply standard analysis techniques from network theory + - findings will allow to compare the global VCS graph to other large + networks and help determining how to analyze all public code in future + MSR/ESE studies +*** Learn more + #+BEGIN_EXPORT latex + \vspace{-2mm} + \begin{thebibliography}{} + \small + \bibitem{Pietri2020b} Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli + \newblock Determining the Intrinsic Structure of Public Software Development History + \newblock MSR 2020: 17th Intl. Conf. on Mining Software Repositories. IEEE + \newblock Registered report: \url{https://osf.io/7r2w4} + \end{thebibliography} + #+END_EXPORT +*** Contacts + [[https://upsilon.cc/~zack/][Stefano Zacchiroli]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] / [[https://mastodon.xyz/@zacchiro][@zacchiro@mastodon.xyz]] + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: diff --git a/talks-public/2020-06-29-msr-topology/Makefile b/talks-public/2020-06-29-msr-topology/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2020-06-29-msr-topology/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides