Note: we therefore need both the direct Merkle DAG graph and its
*transposed*
** Use cases --- research questions
*** For the sake of it
- local graph topology
- connected component size
- enabling question to identify the best approach (e.g., scale-up
v. scale-out) to conduct large-scale analyses
- any other emerging property
*** Software Engineering topics
- software provenance analysis at this scale is pretty much unexplored yet
- industry frontier: increase granularity down to the individual line of
code
- replicate at this scale (famous) studies that have generally been
conducted on (much) smaller version control system samples to
confirm/refute their findings
- ...
** Exploitation
#+BEAMER: \LARGE \centering
How do you query the Software Heritage archive?
#+BEAMER: \Large \\
(on a budget)
** The Software Heritage Graph Dataset :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t
** Sample study --- 50 years of gender differences in code contributions
- start from the Software Heritage graph dataset
- detect gender of author names using standard tooling (=gender-guesser=)
# - caveat: how to identify /first/ name?
- analyze both authors and commits over time, bucketing by commit timestamp
#+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center}
***
#+BEGIN_EXPORT latex
\vspace{-1mm}
\begin{thebibliography}{} \footnotesize
\bibitem{Zacchiroli2021} Stefano Zacchiroli
\newblock Gender Differences in Public Code Contributions: a 50-year Perspective
\newblock IEEE Softw. 38(2): 45-50 (2021)
\end{thebibliography}
#+END_EXPORT
** Discussion
- one /can/ query such a corpus SQL-style
- but relational representation shows its limits at this scale
- ...at least as deployed on commercial SQL offerings such as Athena
- note: (naive) sharding is ineffective, due to the pseudo-random
distribution of node identifiers
- experiments with Google BigQuery are ongoing
- (we broke it at the first import attempt..., due to very large arrays in
directory entry tables)
* Graph compression
#+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t
* Security synergies and outlook
#+INCLUDE: "this/security.org" :minlevel 2
#+INCLUDE: "this/roadmap.org" :minlevel 2
** Wrapping up
#+latex: \vspace{-1mm}
***
- Software Heritage archives all public source code as a huge Merkle DAG
- Querying and analyzing it at scale (20/200 B nodes/edges) is an open
problem
- Gold mine of research leads in sw. eng., big code, reproducibility,
security
#+latex: \vspace{-2mm}
*** References (selected)
#+latex: \vspace{-1mm}
#+BEGIN_EXPORT latex
\begin{thebibliography}{}
\scriptsize
\bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli
\newblock Building the Universal Archive of Source Code
\newblock Communications of the ACM, October 2018
\bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
\newblock The Software Heritage graph dataset: public software development under one roof