diff --git a/common/images/commit-time-distro.png b/common/images/commit-time-distro.png new file mode 100644 index 0000000..3f7895a Binary files /dev/null and b/common/images/commit-time-distro.png differ diff --git a/common/modules/graph-compression.org b/common/modules/graph-compression.org index e42d9ae..4a3a687 100644 --- a/common/modules/graph-compression.org +++ b/common/modules/graph-compression.org @@ -1,345 +1,352 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # Depends: \usepackage{pdfpages} * Graph Compression :PROPERTIES: :CUSTOM_ID: main :END: ** Graph compression on the Software Heritage archive :noexport: :PROPERTIES: :CUSTOM_ID: oneslide :END: *** #+BEGIN_EXPORT latex \vspace{-3mm} \footnotesize \begin{thebibliography}{Foo Bar, 1969} \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT #+BEAMER: \vspace{-1mm} *** Research question Is it possible to efficiently perform software development history analyses at the scale of Software Heritage archive on a single, relatively cheap machine? #+BEAMER: \vspace{-1mm} *** Idea Apply state-of-the-art graph compression techniques from the field of Web graph / social network analysis. #+BEAMER: \vspace{-1mm} *** Results The entire archive graph (25 B nodes, 350 B edges) can be loaded in 200 GiB - and then traversed at the cost of tens of nanoseconds per edge (= a few - hours for a complete single-thread traversal of the archive). + and then traversed at the cost of tens of ns per edge (= a few hours for a + full single-thread visit). +*** + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Java and gRPC APIs available: + #+BEAMER: \small + [[https://docs.softwareheritage.org/devel/swh-graph/grpc-api.html][docs.softwareheritage.org/devel/swh-graph/grpc-api.html]] ** Graph compression on the Software Heritage archive :PROPERTIES: :CUSTOM_ID: intro :END: *** #+BEGIN_EXPORT latex \vspace{-3mm} \begin{thebibliography}{Foo Bar, 1969} \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT *** Research question Is it possible to efficiently perform software development history analyses at ultra large scale (= the scale of Software Heritage archive or more), on a single, relatively cheap machine? *** Idea Apply state-of-the-art graph compression techniques from the field of Web graph / social network analysis. ** Background --- (Web) graph compression :PROPERTIES: :CUSTOM_ID: background1 :END: *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (1) - **Locality:** pages link to pages whose URLs are lexicographically similar. URLs share long common prefixes. → use *D-gap compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.51 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** D-gapped adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.48 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+-----------------------------| | ... | ... | ... | | 15 | 11 | 3,1,0,0,0,0,3,0,178,111,718 | | 16 | 10 | 1,0,0,4,0,0,290,0,0,2723 | | 17 | 0 | | | 18 | 5 | 9,1,0,0,32 | | ... | ... | ... | ** Background --- (Web) graph compression (cont.) :PROPERTIES: :CUSTOM_ID: background2 :END: *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (2) - **Similarity:** pages that are close together in lexicographic order tend to have many common successors. → use *reference compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.47 :END: #+BEAMER: \scriptsize | *Node* | *Outd.* | *Successors* | |--------+---------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** Copy lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.60 :END: #+BEAMER: \scriptsize | *Node* | *Ref.* | *Copy list* | *Extra nodes* | |--------+--------+-------------+--------------------------------------| | ... | ... | ... | ... | | 15 | 0 | | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 1 | 01110011010 | 22,316,317,3041 | | 17 | | | | | 18 | 3 | 11110000000 | 50 | | ... | ... | ... | | ** Background --- Web graph compression (OLD) :noexport: Borrowing (great!) slides from: #+BEGIN_EXPORT latex \begin{thebibliography}{} \bibitem{Pibiri2018} Giulio Ermanno Pibiri \newblock Effective Web Graph Representations, 2018 \newblock \url{http://pages.di.unipi.it/pibiri/slides/webgraphs\_compression.pdf} \end{thebibliography} #+END_EXPORT ** Background -- Web graph compression (imported slides) (OLD) :noexport: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEGIN_EXPORT latex { \setbeamercolor{background canvas}{bg=} \setbeamertemplate{background}{} \includepdf[pages={4,11,12,13}]{webgraphs_compression.pdf} \addtocounter{framenumber}{4} } #+END_EXPORT ** Corpus :PROPERTIES: :CUSTOM_ID: corpus :END: *** Nodes :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: #+BEAMER: \small | *Node type* | *N. of nodes* | |-------------+---------------| | origins | 88 M | | snapshots | 57 M | | releases | 9.9 M | | revisions | 1.1 B | | directories | 4.9 B | | contents | 5.5 B | |-------------+---------------| | Total nodes | 12 B | *** Edges :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: #+BEAMER: \footnotesize | *Edge type* | *N. of edges* | |-----------------------+---------------| | origin → snapshot | 195 M | | snapshot → revision | 616 M | | snapshot → release | 215 M | | release → revision | 9.9 M | | revision → revision | 1.2 B | | revision → directory | 1.1 B | | directory → directory | 48 B | | directory → revisiony | 482 M | | directory → content | 112 B | |-----------------------+---------------| | Total edges | 165 B | *** :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEAMER: \vspace{1mm} Stats for archive snapshot 2018-09-25, from the Software Heritage graph dataset. Growth rate: exponential, doubling every 22-30 months, cf.: #+BEGIN_EXPORT latex \begin{thebibliography}{Foo Bar, 1969} \footnotesize \bibitem{Rousseau2020} Roberto Di Cosmo, Guillaume Rousseau, Stefano Zacchiroli \newblock Software Provenance Tracking at the Scale of Public Source Code \newblock Empirical Software Engineering 25(4): 2930-2959 (2020) \end{thebibliography} #+END_EXPORT ** Graph compression pipeline :PROPERTIES: :CUSTOM_ID: pipeline :END: #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps} #+BEAMER: \vspace{-1cm} *** - *MPH*: minimal perfect hash, mapping Merkle IDs to 0..N-1 integers - *BV* compress: Boldi-Vigna compression (based on MPH order) - *BFS*: breadth-first visit to renumber - *Permute*: update BV compression according to BFS order *** (Re)establishing locality - key for good compression is a node ordering that ensures locality and similarity - which is very much /not/ the case with Merkle IDs, ...but is the case /again/ after BFS reordering ** Compression experiment :PROPERTIES: :CUSTOM_ID: compexp :END: | *Step* | *Wall time* (hours) | |-------------+---------------------| | MPH | 2 | | BV Compress | 84 | | BFS | 19 | | Permute | 18 | | Transpose | 15 | |-------------+---------------------| | Total | 138 (6 days) | - server equipped with 24 CPUs and 750 GB of RAM - RAM mostly used as I/O cache for the BFS step - /minimum/ memory requirements are close to the RAM needed to load the final compressed graph in memory ** Compression efficiency (space) :PROPERTIES: :CUSTOM_ID: spaceefficiency :END: *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Forward graph* | | | total size | 91 GiB | | bits per edge | 4.91 | | compression ratio | 15.8% | *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Backward graph* | | | total size | 83 GiB | | bits per edge | 4.49 | | compression ratio | 14.4% | *** Operating cost The structure of a full bidirectional archive graph fits in less than 200 GiB of RAM, for a hardware cost of ~300 USD. ** Compression efficiency (time) :PROPERTIES: :CUSTOM_ID: timeefficiency :END: *** Benchmark --- Full BFS visit (single thread) #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | wall time | 1h48m | | throughput | 1.81 M nodes/s | | | (553 ns/node) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | wall time | 3h17m | | throughput | 988 M nodes/s | | | (1.01 µs/node) | #+BEAMER: \end{column}\end{columns} *** Benchmark --- Edge lookup random sample: 1 B nodes (8.3% of entire graph); then enumeration of all successors #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 12.0 M edges/s | | | (83 ns/edge) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 9.45 M edges/s | | | (106 ns/edge) | #+BEAMER: \end{column}\end{columns} *** :PROPERTIES: :BEAMER_env: ignoreheading :END: Note how edge lookup time is close to DRAM random access time (50-60 ns). ** Discussion :PROPERTIES: :CUSTOM_ID: discussion :END: *** Incrementality compression is *not incremental*, due to the use of contiguous integer ranges - but the graph is append-only, so... - ...based on expected graph growth rate it should be possible to pre-allocate enough free space in the integer ranges to support *amortized incrementality* (future work) #+BEAMER: \pause *** In-memory v. on-disk the compressed in-memory graph structure has *no attributes* - usual design is to exploit the 0..N-1 integer ranges to *memory map node attributes* to disk for efficient access - works well for queries that does graph traversal first and "join" node attributes last; ping-pong between the two is expensive - edge attributes are more problematic (work in progress) diff --git a/common/modules/swh-fuse.org b/common/modules/swh-fuse.org index 74f6bbf..3597a0b 100644 --- a/common/modules/swh-fuse.org +++ b/common/modules/swh-fuse.org @@ -1,139 +1,139 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 ** Software Heritage Filesystem (SwhFS) :PROPERTIES: :CUSTOM_ID: oneslide :END: *** The *Software Heritage Filesystem (SwhFS)* is a user-space POSIX filesystem that enables browsing parts of the Software Heritage archive as if it were locally available. *** - - code: https://forge.softwareheritage.org/source/swh-fuse/ - - documentation: https://docs.softwareheritage.org/devel/swh-fuse/ + - code: [[https://forge.softwareheritage.org/source/swh-fuse/][forge.softwareheritage.org/source/swh-fuse]] + - documentation: [[https://docs.softwareheritage.org/devel/swh-fuse/][docs.softwareheritage.org/devel/swh-fuse]] *** #+BEGIN_EXPORT latex \begin{thebibliography}{Foo Bar, 1969} \bibitem{Allancon2021} Thibault Allançon, Antoine Pietri, Stefano Zacchiroli \newblock The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development \newblock ICSE 2021: The 43rd International Conference on Software Engineering \newblock \url{https://arxiv.org/abs/2102.06390} \end{thebibliography} #+END_EXPORT ** Software Heritage Filesystem (SwhFS) --- Example :noexport: :PROPERTIES: :CUSTOM_ID: examplemini :END: *** #+BEAMER: \footnotesize #+BEGIN_SRC $ mkdir swhfs $ swh fs mount swhfs/ # mount the archive $ cd swhfs/ $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 #include int main(void) { printf("Hello, World!\n"); } $ cd archive/swh:1:dir:1fee702c7e6d14395bbf5ac3598e73bcbf97b030 $ ls | wc -l 127 $ grep -i antenna THE_LUNAR_LANDING.s | cut -f 5 # IS THE LR ANTENNA IN POSITION 1 YET # BRANCH IF ANTENNA ALREADY IN POSITION 1 #+END_SRC * Software Heritage Filesystem (SwhFS) --- Tutorial :PROPERTIES: :CUSTOM_ID: tutorial :END: ** Software Heritage Filesystem (SwhFS) --- Tutorial *** #+BEGIN_SRC $ pip install swh.fuse # install SwhFS $ mkdir swhfs $ swh fs mount swhfs/ # mount the archive $ ls -1F swhfs/ # list entry points archive/ # <- start browsing from here cache/ origin/ README #+END_SRC ** Software Heritage Filesystem (SwhFS) --- Tutorial (cont.) *** #+BEAMER: \footnotesize #+BEGIN_SRC $ cd swhfs/ $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 #include int main(void) { printf("Hello, World!\n"); } #+END_SRC ** Software Heritage Filesystem (SwhFS) --- Tutorial (cont.) *** #+BEAMER: \footnotesize #+BEGIN_SRC $ cd archive/swh:1:dir:1fee702c7e6d14395bbf5ac3598e73bcbf97b030 $ ls | wc -l 127 $ grep -i antenna THE_LUNAR_LANDING.s | cut -f 5 # IS THE LR ANTENNA IN POSITION 1 YET # BRANCH IF ANTENNA ALREADY IN POSITION 1 #+END_SRC ** Software Heritage Filesystem (SwhFS) --- Tutorial (cont.) *** #+BEAMER: \footnotesize #+BEGIN_SRC $ cd archive/swh:1:rev:9d76c0b163675505d1a901e5fe5249a2c55609bc $ ls -F history/ meta.json@ parent@ parents/ root@ $ jq '.author.name, .date, .message' meta.json "Michal Golebiowski-Owczarek" "2020-03-02T23:02:42+01:00" "Data:Event:Manipulation: Prevent collisions with Object.prototype ..." $ find root/src/ -type f -name '*.js' | xargs cat | wc -l 10136 #+END_SRC ** Software Heritage Filesystem (SwhFS) --- Tutorial (cont.) *** #+BEAMER: \footnotesize #+BEGIN_SRC $ swh web search git-annex --limit 1 ... git://git.joeyh.name/git-annex.git \ https://archive.softwareheritage.org/api/1/origin/git://git.joeyh.name/git-annex.git/visits/ ... $ swh web search git-annex --url-encode | cut -f 1 git%3A%2F%2Fgit.joeyh.name%2Fgit-annex.git $ cd origin/git%3A%2F%2Fgit.joeyh.name%2Fgit-annex.git $ ls -F 2020-12-19/ $ ls 2020-12-19/snapshot/refs/heads/master/root/ Annex/ COPYRIGHT NEWS Annex.hs Creds.hs P2P/ Assistant/ Crypto.hs README Assistant.hs Database/ Remote/ Backend/ debian/ RemoteDaemon/ #+END_SRC diff --git a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org b/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org index c5ab101..4c39892 100644 --- a/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org +++ b/talks-public/2022-09-28-ese-research/2022-09-28-ese-research.org @@ -1,67 +1,115 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Empirical Software Engineering Research with Software Heritage #+BEAMER_HEADER: \date[2022-09-28]{28 September 2022} #+BEAMER_HEADER: \title[Empirical Software Eng. Research with Software Heritage]{Empirical Software Engineering Research with Software Heritage} #+AUTHOR: Stefano Zacchiroli #+DATE: 28 September 2022 #+EMAIL: stefano.zacchiroli@telecom-paris.fr #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Télécom Paris]{Télécom Paris, Polytechnic Institute of Paris\\ {\tt stefano.zacchiroli@telecom-paris.fr}} #+BEAMER_HEADER: \author{Stefano Zacchiroli} * Datasets ** Graph dataset #+INCLUDE: "../../common/modules/dataset.org::#graphdataset" :only-contents t ** Graph dataset --- example #+INCLUDE: "../../common/modules/dataset.org::#graphquery1" :only-contents t ** License dataset #+INCLUDE: "../../common/modules/dataset.org::#licensedataset" :only-contents t * Accessing source code artifacts ** The Software Heritage Filesystem (SwhFS) #+INCLUDE: "../../common/modules/swh-fuse.org::#oneslide" :only-contents t ** The Software Heritage Filesystem (SwhFS) --- example #+INCLUDE: "../../common/modules/swh-fuse.org::#examplemini" :only-contents t ** Graph compression #+INCLUDE: "../../common/modules/graph-compression.org::#oneslide" :only-contents t * Software provenance and evolution ** Software provenance and evolution - TODO +#+BEAMER: \begin{center} \includegraphics[width=0.7\textwidth]{commit-time-distro} \end{center} \vspace{-2mm} +*** Key findings + - The amount of original commits in public code doubles every ~30 months + and has been doing so for 20+ years; original source code files double + every ~22 months + - It is possible to trace the provenance of source code artifacts at this + scale in a compact relational model via the notion of isochrone graphs. + + #+BEAMER: \vspace{-2mm} +*** + #+BEGIN_EXPORT latex + \vspace{-2mm} + \footnotesize + \begin{thebibliography}{Foo Bar, 1969} + \bibitem{Rousseau2020} Rousseau, Di Cosmo, Zacchiroli\newblock + Software Provenance Tracking at the Scale of Public Source Code\newblock + In Empirical Software Engineering, 2020 + \end{thebibliography} + #+END_EXPORT * Software forks ** Software forks - TODO +*** Idea +- Forks can be detected via either platform metadata (e.g., GitHub keeping + track of who clicked "fork" on what repo; the most common approach), or via + shared version control system history. +- Thanks to deduplication and platform agnosticity, Software Heritage provide a + privileged observation point on the global fork ecosystem in public code. +*** Research questions +- What is the right definition of "being a fork"? (methodology) +- How many forks could we miss by looking only at platform metadata? +- How many "cross-platform" forks (e.g., GitHub → GitLab) exist in the wild? +** Software forks (cont.) +*** Findings +- Forks classification: based on platform metadata (“type 1” forks), sharing at + least one commit (“type 2”), sharing a common root directory at some point in + VCS history (“type 3”). +- Up to 16% forks could be overlooked by considering only GitHub type 1 forks + (a potentially significant threat to validity!). +- Relevant independent development activity can happen on GitLab.com for + projects initially just mirrored from GitHub. +*** + #+BEGIN_EXPORT latex + \vspace{-3mm} \footnotesize + \begin{thebibliography}{Foo Bar, 1969} + \bibitem{Pietri2020} Pietri, Rousseau, Zacchiroli.\newblock + Forking Without Clicking: on How to Identify Software Repository Forks.\newblock + MSR 2020 + \bibitem{Bhattacharjee2020} Bhattacharjee et al.\newblock + An exploratory study to find motives behind cross-platform forks from Software Heritage dataset.\newblock + MSR 2020 + \end{thebibliography} + #+END_EXPORT * Diversity, equity, and inclusion ** Diversity, equity, and inclusion *** Idea Archived commit metadata contains public information that can be mined to study long-term trends of diversity, equity, and inclusion (DEI) traits of the global population of public code contributors. *** Key findings on the gender gap - Male authors contributed 92% of public code commits up to 2019. - The ratio of female authors (and their contributions) has grown stably for 15 years reaching for the first time 10% of yearly contributions in 2019. - The COVID-19 pandemic has reversed the trend. ** Diversity, equity, and inclusion (cont.) *** Key findings on the geographic gap - The early decades of public code were dominated by contributions from North America, followed by a period of alternating dominance between North America and Europe. - Since then geographic diversity has increased constantly, with raising importance of contributions from Central and South America. - The trend of increased female contributions is almost worlwide, with the notable exception of specific regions of Asia were it is either slower or flat. *** References #+BEAMER: \footnotesize - Zacchiroli. /Gender differences in public code contributions: a 50-year perspective/. IEEE Software, 2021 - Rossi and Zacchiroli. /Worldwide gender differences in public code contributions/. ICSE SEIS, 2022 - Rossi and Zacchiroli. /Geographic diversity in public code contributions/. MSR 2022