diff --git a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org b/common/modules/ese-research.org similarity index 75% copy from talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org copy to common/modules/ese-research.org index 4c39892..88389d2 100644 --- a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org +++ b/common/modules/ese-research.org @@ -1,115 +1,100 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) -#+TITLE: Empirical Software Engineering Research with Software Heritage -#+BEAMER_HEADER: \date[2022-09-28]{28 September 2022} -#+BEAMER_HEADER: \title[Empirical Software Eng. Research with Software Heritage]{Empirical Software Engineering Research with Software Heritage} -#+AUTHOR: Stefano Zacchiroli -#+DATE: 28 September 2022 -#+EMAIL: stefano.zacchiroli@telecom-paris.fr +#+INCLUDE: "prelude.org" :minlevel 1 -#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 -#+INCLUDE: "../../common/modules/169.org" -#+BEAMER_HEADER: \institute[Télécom Paris]{Télécom Paris, Polytechnic Institute of Paris\\ {\tt stefano.zacchiroli@telecom-paris.fr}} -#+BEAMER_HEADER: \author{Stefano Zacchiroli} - -* Datasets -** Graph dataset -#+INCLUDE: "../../common/modules/dataset.org::#graphdataset" :only-contents t -** Graph dataset --- example -#+INCLUDE: "../../common/modules/dataset.org::#graphquery1" :only-contents t -** License dataset -#+INCLUDE: "../../common/modules/dataset.org::#licensedataset" :only-contents t -* Accessing source code artifacts -** The Software Heritage Filesystem (SwhFS) -#+INCLUDE: "../../common/modules/swh-fuse.org::#oneslide" :only-contents t -** The Software Heritage Filesystem (SwhFS) --- example -#+INCLUDE: "../../common/modules/swh-fuse.org::#examplemini" :only-contents t -** Graph compression -#+INCLUDE: "../../common/modules/graph-compression.org::#oneslide" :only-contents t * Software provenance and evolution + :PROPERTIES: + :CUSTOM_ID: provenance + :END: ** Software provenance and evolution #+BEAMER: \begin{center} \includegraphics[width=0.7\textwidth]{commit-time-distro} \end{center} \vspace{-2mm} *** Key findings - The amount of original commits in public code doubles every ~30 months and has been doing so for 20+ years; original source code files double every ~22 months - It is possible to trace the provenance of source code artifacts at this scale in a compact relational model via the notion of isochrone graphs. #+BEAMER: \vspace{-2mm} *** #+BEGIN_EXPORT latex \vspace{-2mm} \footnotesize \begin{thebibliography}{Foo Bar, 1969} \bibitem{Rousseau2020} Rousseau, Di Cosmo, Zacchiroli\newblock Software Provenance Tracking at the Scale of Public Source Code\newblock In Empirical Software Engineering, 2020 \end{thebibliography} #+END_EXPORT * Software forks + :PROPERTIES: + :CUSTOM_ID: forks + :END: ** Software forks *** Idea - Forks can be detected via either platform metadata (e.g., GitHub keeping track of who clicked "fork" on what repo; the most common approach), or via shared version control system history. - Thanks to deduplication and platform agnosticity, Software Heritage provide a privileged observation point on the global fork ecosystem in public code. *** Research questions - What is the right definition of "being a fork"? (methodology) - How many forks could we miss by looking only at platform metadata? - How many "cross-platform" forks (e.g., GitHub → GitLab) exist in the wild? ** Software forks (cont.) *** Findings - Forks classification: based on platform metadata (“type 1” forks), sharing at least one commit (“type 2”), sharing a common root directory at some point in VCS history (“type 3”). - Up to 16% forks could be overlooked by considering only GitHub type 1 forks (a potentially significant threat to validity!). - Relevant independent development activity can happen on GitLab.com for projects initially just mirrored from GitHub. *** #+BEGIN_EXPORT latex \vspace{-3mm} \footnotesize \begin{thebibliography}{Foo Bar, 1969} \bibitem{Pietri2020} Pietri, Rousseau, Zacchiroli.\newblock Forking Without Clicking: on How to Identify Software Repository Forks.\newblock MSR 2020 \bibitem{Bhattacharjee2020} Bhattacharjee et al.\newblock An exploratory study to find motives behind cross-platform forks from Software Heritage dataset.\newblock MSR 2020 \end{thebibliography} #+END_EXPORT * Diversity, equity, and inclusion + :PROPERTIES: + :CUSTOM_ID: diversity + :END: ** Diversity, equity, and inclusion *** Idea Archived commit metadata contains public information that can be mined to study long-term trends of diversity, equity, and inclusion (DEI) traits of the global population of public code contributors. *** Key findings on the gender gap - Male authors contributed 92% of public code commits up to 2019. - The ratio of female authors (and their contributions) has grown stably for 15 years reaching for the first time 10% of yearly contributions in 2019. - The COVID-19 pandemic has reversed the trend. ** Diversity, equity, and inclusion (cont.) *** Key findings on the geographic gap - The early decades of public code were dominated by contributions from North America, followed by a period of alternating dominance between North America and Europe. - Since then geographic diversity has increased constantly, with raising importance of contributions from Central and South America. - The trend of increased female contributions is almost worlwide, with the notable exception of specific regions of Asia were it is either slower or flat. *** References #+BEAMER: \footnotesize - Zacchiroli. /Gender differences in public code contributions: a 50-year perspective/. IEEE Software, 2021 - Rossi and Zacchiroli. /Worldwide gender differences in public code contributions/. ICSE SEIS, 2022 - Rossi and Zacchiroli. /Geographic diversity in public code contributions/. MSR 2022 diff --git a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org b/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org index 4c39892..f0669d5 100644 --- a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org +++ b/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org @@ -1,115 +1,33 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Empirical Software Engineering Research with Software Heritage #+BEAMER_HEADER: \date[2022-09-28]{28 September 2022} #+BEAMER_HEADER: \title[Empirical Software Eng. Research with Software Heritage]{Empirical Software Engineering Research with Software Heritage} #+AUTHOR: Stefano Zacchiroli #+DATE: 28 September 2022 #+EMAIL: stefano.zacchiroli@telecom-paris.fr #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Télécom Paris]{Télécom Paris, Polytechnic Institute of Paris\\ {\tt stefano.zacchiroli@telecom-paris.fr}} #+BEAMER_HEADER: \author{Stefano Zacchiroli} * Datasets ** Graph dataset #+INCLUDE: "../../common/modules/dataset.org::#graphdataset" :only-contents t ** Graph dataset --- example #+INCLUDE: "../../common/modules/dataset.org::#graphquery1" :only-contents t ** License dataset #+INCLUDE: "../../common/modules/dataset.org::#licensedataset" :only-contents t * Accessing source code artifacts ** The Software Heritage Filesystem (SwhFS) #+INCLUDE: "../../common/modules/swh-fuse.org::#oneslide" :only-contents t ** The Software Heritage Filesystem (SwhFS) --- example #+INCLUDE: "../../common/modules/swh-fuse.org::#examplemini" :only-contents t ** Graph compression #+INCLUDE: "../../common/modules/graph-compression.org::#oneslide" :only-contents t * Software provenance and evolution -** Software provenance and evolution -#+BEAMER: \begin{center} \includegraphics[width=0.7\textwidth]{commit-time-distro} \end{center} \vspace{-2mm} -*** Key findings - - The amount of original commits in public code doubles every ~30 months - and has been doing so for 20+ years; original source code files double - every ~22 months - - It is possible to trace the provenance of source code artifacts at this - scale in a compact relational model via the notion of isochrone graphs. - - #+BEAMER: \vspace{-2mm} -*** - #+BEGIN_EXPORT latex - \vspace{-2mm} - \footnotesize - \begin{thebibliography}{Foo Bar, 1969} - \bibitem{Rousseau2020} Rousseau, Di Cosmo, Zacchiroli\newblock - Software Provenance Tracking at the Scale of Public Source Code\newblock - In Empirical Software Engineering, 2020 - \end{thebibliography} - #+END_EXPORT +#+INCLUDE: "../../common/modules/ese-research.org::#provenance" :only-contents t * Software forks -** Software forks -*** Idea -- Forks can be detected via either platform metadata (e.g., GitHub keeping - track of who clicked "fork" on what repo; the most common approach), or via - shared version control system history. -- Thanks to deduplication and platform agnosticity, Software Heritage provide a - privileged observation point on the global fork ecosystem in public code. -*** Research questions -- What is the right definition of "being a fork"? (methodology) -- How many forks could we miss by looking only at platform metadata? -- How many "cross-platform" forks (e.g., GitHub → GitLab) exist in the wild? -** Software forks (cont.) -*** Findings -- Forks classification: based on platform metadata (“type 1” forks), sharing at - least one commit (“type 2”), sharing a common root directory at some point in - VCS history (“type 3”). -- Up to 16% forks could be overlooked by considering only GitHub type 1 forks - (a potentially significant threat to validity!). -- Relevant independent development activity can happen on GitLab.com for - projects initially just mirrored from GitHub. -*** - #+BEGIN_EXPORT latex - \vspace{-3mm} \footnotesize - \begin{thebibliography}{Foo Bar, 1969} - \bibitem{Pietri2020} Pietri, Rousseau, Zacchiroli.\newblock - Forking Without Clicking: on How to Identify Software Repository Forks.\newblock - MSR 2020 - \bibitem{Bhattacharjee2020} Bhattacharjee et al.\newblock - An exploratory study to find motives behind cross-platform forks from Software Heritage dataset.\newblock - MSR 2020 - \end{thebibliography} - #+END_EXPORT +#+INCLUDE: "../../common/modules/ese-research.org::#forks" :only-contents t * Diversity, equity, and inclusion -** Diversity, equity, and inclusion -*** Idea - Archived commit metadata contains public information that can be mined to - study long-term trends of diversity, equity, and inclusion (DEI) traits of - the global population of public code contributors. - -*** Key findings on the gender gap - - Male authors contributed 92% of public code commits up to 2019. - - The ratio of female authors (and their contributions) has grown stably - for 15 years reaching for the first time 10% of yearly contributions - in 2019. - - The COVID-19 pandemic has reversed the trend. - -** Diversity, equity, and inclusion (cont.) - -*** Key findings on the geographic gap - - The early decades of public code were dominated by contributions from - North America, followed by a period of alternating dominance between - North America and Europe. - - Since then geographic diversity has increased constantly, with raising - importance of contributions from Central and South America. - - The trend of increased female contributions is almost worlwide, with the - notable exception of specific regions of Asia were it is either slower or - flat. - -*** References - #+BEAMER: \footnotesize - - Zacchiroli. /Gender differences in public code contributions: a 50-year - perspective/. IEEE Software, 2021 - - Rossi and Zacchiroli. /Worldwide gender differences in public code - contributions/. ICSE SEIS, 2022 - - Rossi and Zacchiroli. /Geographic diversity in public code - contributions/. MSR 2022 +#+INCLUDE: "../../common/modules/ese-research.org::#diversity" :only-contents t