Page MenuHomeSoftware Heritage

2021-05-19-telecom-paris.org
No OneTemporary

2021-05-19-telecom-paris.org

#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
#+TITLE: Software Heritage
#+SUBTITLE: Analyzing the Global Graph of Public Software Development
#+BEAMER_HEADER: \date[2021-05-19, ACES]{19 May 2021\\Team ACES --- Télécom Paris\\ (online)\\[-2ex]}
#+AUTHOR: Stefano Zacchiroli
#+DATE: 19 May 2021
#+EMAIL: zack@upsilon.cc
#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1
#+INCLUDE: "../../common/modules/169.org"
#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}}
#+BEAMER_HEADER: \author{Stefano Zacchiroli}
# Syntax highlighting setup
#+LATEX_HEADER_EXTRA: \usepackage{minted}
#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango}
#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize}
#+name: setup-minted
#+begin_src emacs-lisp :exports results :results silent
(setq org-latex-listings 'minted)
(setq org-latex-minted-options
'(("fontsize" "\\scriptsize")
("linenos" "")))
(setq org-latex-to-pdf-process
'("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"))
#+end_src
# End syntax highlighting setup
* About me :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "this/zack.org" :minlevel 2
* Software Heritage
#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2
** An international, non profit initiative
:PROPERTIES:
:CUSTOM_ID: support
:END:
*** Sharing the vision :B_block:
:PROPERTIES:
:CUSTOM_ID: endorsement
:BEAMER_COL: .5
:BEAMER_env: block
:END:
#+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center}
#+LATEX: \vspace{-0.8cm}
#+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center}
#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials}
*** Donors, members, sponsors :B_block:
:PROPERTIES:
:CUSTOM_ID: sponsors
:BEAMER_COL: .5
:BEAMER_env: block
:END:
#+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center}
#+LATEX: \begin{center}
#+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}}
#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors}
#+LATEX: \end{center}
** Status :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2
#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2
#+INCLUDE: "../../common/modules/data-model.org::#merklestruct" :minlevel 2
#+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2
* Querying the archive
** Use cases --- product needs
e.g., for https://archive.softwareheritage.org
*** Browsing
- =ls=
- =git log= (Linux kernel: 800K+ commits)
*** Wayback machine
- tarball
- =git bundle= (Linux kernel: 7M+ nodes)
*** Provenance tracking
- commit provenance (one/all contexts) \hfill note: requires backtracking
- origin provenance (one/all contexts)
*** Note :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
Note: we therefore need both the direct Merkle DAG graph and its
*transposed*
** Use cases --- research questions
*** For the sake of it
- local graph topology
- connected component size
- enabling question to identify the best approach (e.g., scale-up
v. scale-out) to conduct large-scale analyses
- any other emerging property
*** Software Engineering topics
- software provenance analysis at this scale is pretty much unexplored yet
- industry frontier: increase granularity down to the individual line of
code
- replicate at this scale (famous) studies that have generally been
conducted on (much) smaller version control system samples to
confirm/refute their findings
- ...
** Exploitation
#+BEAMER: \LARGE \centering
How do you query the Software Heritage archive?
#+BEAMER: \Large \\
(on a budget)
** The Software Heritage Graph Dataset :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t
** Sample study --- 50 years of gender differences in code contributions
- start from the Software Heritage graph dataset
- detect gender of author names using standard tooling (=gender-guesser=)
# - caveat: how to identify /first/ name?
- analyze both authors and commits over time, bucketing by commit timestamp
#+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center}
***
#+BEGIN_EXPORT latex
\vspace{-1mm}
\begin{thebibliography}{} \footnotesize
\bibitem{Zacchiroli2021} Stefano Zacchiroli
\newblock Gender Differences in Public Code Contributions: a 50-year Perspective
\newblock IEEE Softw. 38(2): 45-50 (2021)
\end{thebibliography}
#+END_EXPORT
** Discussion
- one /can/ query such a corpus SQL-style
- but relational representation shows its limits at this scale
- ...at least as deployed on commercial SQL offerings such as Athena
- note: (naive) sharding is ineffective, due to the pseudo-random
distribution of node identifiers
- experiments with Google BigQuery are ongoing
- (we broke it at the first import attempt..., due to very large arrays in
directory entry tables)
* Graph compression
#+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t
* Security synergies and outlook
#+INCLUDE: "this/security.org" :minlevel 2
#+INCLUDE: "this/roadmap.org" :minlevel 2
** Wrapping up
#+latex: \vspace{-1mm}
***
- Software Heritage archives all public source code as a huge Merkle DAG
- Querying and analyzing it at scale (20/200 B nodes/edges) is an open
problem
- Gold mine of research leads in sw. eng., big code, reproducibility,
security
#+latex: \vspace{-2mm}
*** References (selected)
#+latex: \vspace{-1mm}
#+BEGIN_EXPORT latex
\begin{thebibliography}{}
\scriptsize
\bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli
\newblock Building the Universal Archive of Source Code
\newblock Communications of the ACM, October 2018
\bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
\newblock The Software Heritage graph dataset: public software development under one roof
\newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE
\bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli
\newblock Ultra-Large-Scale Repository Analysis via Graph Compression
\newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE
\end{thebibliography}
#+END_EXPORT
*** Contacts
Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]]
* Appendix :B_appendix:
:PROPERTIES:
:BEAMER_env: appendix
:END:
** Meet the Software Heritage Identifiers (SWHIDs) \hfill [[https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html][(full spec)]]
#+INCLUDE: "../../common/modules/swhid.org::#oneslide" :only-contents t

File Metadata

Mime Type
text/x-tex
Expires
Jun 4 2025, 7:40 PM (10 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3245321

Event Timeline