Page MenuHomeSoftware Heritage

add recently published scientific papers to the website
Closed, MigratedEdits Locked

Description

The following scientific papers about Software Heritage have been recently published and should be added to the publications page:

  • Stefano Zacchiroli. Gender Differences in Public Code Contributions: a 50-year Perspective. To appear in IEEE Software. ISSN 0740-7459, IEEE Computer Society. 2021.

    Abstract: Gender imbalance in information technology in general, and Free/Open Source Software specifically, is a well-known problem in the field. Still, little is known yet about the large-scale extent and long-term trends that underpin the phenomenon. We contribute to fill this gap by conducting a longitudinal study of the population of contributors to publicly available software source code. We analyze 1.6 billion commits corresponding to the development history of 120 million projects, contributed by 33 million distinct authors over a period of 50 years. We classify author names by gender and study their evolution over time. We show that, while the amount of commits by female authors remains low overall, there is evidence of a stable long-term increase in their proportion over all contributions, providing hope of a more gender-balanced future for collaborative software development.

    preprint, bibtex
  • Thibault Allançon, Antoine Pietri, Stefano Zacchiroli. The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development. To appear in proceedings of ICSE 2021: The 43rd International Conference on Software Engineering, May 2021, Madrid, Spain. IEEE 2021.

    Abstract: We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history. Using SwhFS, developers can quickly “checkout” any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage—individual source code files and trees, releases, and branches—can also be accessed using common programming tools and custom scripts, as if they were locally available. A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.

    preprint, bibtex
  • Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli. Determining the Intrinsic Structure of Public Software Development History. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Pages 602-605, IEEE 2020

    Abstract: Background: Collaborative software development has produced a wealth of version control system (VCS) data that can now be analyzed in full. Little is known about the intrinsic structure of the entire corpus of publicly available VCS as an interconnected graph. Understanding its structure is needed to determine the best approach to analyze it in full and to avoid methodological pitfalls when doing so. Objective: We intend to determine the most salient network topology properties of public software development history as captured by VCS. We will explore: degree distributions, determining whether they are scale-free or not; distribution of connect component sizes; distribution of shortest path lengths. Method: We will use Software Heritage---which is the largest corpus of public VCS data---compress it using webgraph compression techniques, and analyze it in-memory using classic graph algorithms. Analyses will be performed both on the full graph and on relevant subgraphs. Limitations: The study is exploratory in nature; as such no hypotheses on the findings is stated at this time. Chosen graph algorithms are expected to scale to the corpus size, but it will need to be confirmed experimentally. External validity will depend on how representative Software Heritage is of the software commons.

    preprint, bibtex
  • Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli. Forking Without Clicking: on How to Identify Software Repository Forks. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Pages 277-287, IEEE 2020.

    Abstract: The notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These “forge forks” however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks, observing how they are affected by the proposed definitions and discuss the potential impact on empirical research.

    preprint, bibtex
  • Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Pages 1-5. IEEE 2020.

    Abstract: Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens.

    preprint, bibtex

Event Timeline

zack triaged this task as Normal priority.Mar 10 2021, 11:16 AM
zack created this task.

Done ! Hopefully teachpress (WP plugin to display publications list) has a bibtex import that works like a charm.

Done ! Hopefully teachpress (WP plugin to display publications list) has a bibtex import that works like a charm.

Good to know!

While we are at this, can you update the entry for the SANER 2020 paper? It says "forthcoming", but that is no longer the case. The updated BiBTeX for that paper is here: https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib

And you can add "forthcoming" to the ICSE 2021 paper, which is indeed forthcoming, and for which we don't have yet full bibliographic info (e.g., page numbers).

In T3111#60316, @zack wrote:

Done ! Hopefully teachpress (WP plugin to display publications list) has a bibtex import that works like a charm.

Good to know!

While we are at this, can you update the entry for the SANER 2020 paper? It says "forthcoming", but that is no longer the case. The updated BiBTeX for that paper is here: https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.bib

And you can add "forthcoming" to the ICSE 2021 paper, which is indeed forthcoming, and for which we don't have yet full bibliographic info (e.g., page numbers).

Publications updated as requested.