2021-05-19-telecom-paris.org
No OneTemporary
Actions

Size

8 KB

Subscribers

None

2021-05-19-telecom-paris.org
View Options

	#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
	#+TITLE: Software Heritage
	#+SUBTITLE: Analyzing the Global Graph of Public Software Development
	#+BEAMER_HEADER: \date[2021-05-19, ACES]{19 May 2021\\Team ACES --- Télécom Paris\\ (online)\\[-2ex]}
	#+AUTHOR: Stefano Zacchiroli
	#+DATE: 19 May 2021
	#+EMAIL: zack@upsilon.cc

	#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1
	#+INCLUDE: "../../common/modules/169.org"
	#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}}
	#+BEAMER_HEADER: \author{Stefano Zacchiroli}

	# Syntax highlighting setup
	#+LATEX_HEADER_EXTRA: \usepackage{minted}
	#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango}
	#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize}
	#+name: setup-minted
	#+begin_src emacs-lisp :exports results :results silent
	(setq org-latex-listings 'minted)
	(setq org-latex-minted-options
	'(("fontsize" "\\scriptsize")
	("linenos" "")))
	(setq org-latex-to-pdf-process
	'("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
	"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
	"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"))
	#+end_src
	# End syntax highlighting setup

	* About me :B_ignoreheading:
	:PROPERTIES:
	:BEAMER_env: ignoreheading
	:END:
	#+INCLUDE: "this/zack.org" :minlevel 2
	* Software Heritage
	#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2
	** An international, non profit initiative
	:PROPERTIES:
	:CUSTOM_ID: support
	:END:
	*** Sharing the vision :B_block:
	:PROPERTIES:
	:CUSTOM_ID: endorsement
	:BEAMER_COL: .5
	:BEAMER_env: block
	:END:
	#+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center}
	#+LATEX: \vspace{-0.8cm}
	#+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center}
	#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials}
	*** Donors, members, sponsors :B_block:
	:PROPERTIES:
	:CUSTOM_ID: sponsors
	:BEAMER_COL: .5
	:BEAMER_env: block
	:END:
	#+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center}
	#+LATEX: \begin{center}
	#+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}}
	#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors}
	#+LATEX: \end{center}
	** Status :B_ignoreheading:
	:PROPERTIES:
	:BEAMER_env: ignoreheading
	:END:
	#+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2
	#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t
	#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2
	#+INCLUDE: "../../common/modules/data-model.org::#merklestruct" :minlevel 2
	#+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t
	#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2
	* Querying the archive
	** Use cases --- product needs
	e.g., for https://archive.softwareheritage.org
	*** Browsing
	- =ls=
	- =git log= (Linux kernel: 800K+ commits)
	*** Wayback machine
	- tarball
	- =git bundle= (Linux kernel: 7M+ nodes)
	*** Provenance tracking
	- commit provenance (one/all contexts) \hfill note: requires backtracking
	- origin provenance (one/all contexts)
	*** Note :B_ignoreheading:
	:PROPERTIES:
	:BEAMER_env: ignoreheading
	:END:
	Note: we therefore need both the direct Merkle DAG graph and its
	transposed

	** Use cases --- research questions
	*** For the sake of it
	- local graph topology
	- connected component size
	- enabling question to identify the best approach (e.g., scale-up
	v. scale-out) to conduct large-scale analyses
	- any other emerging property
	*** Software Engineering topics
	- software provenance analysis at this scale is pretty much unexplored yet
	- industry frontier: increase granularity down to the individual line of
	code
	- replicate at this scale (famous) studies that have generally been
	conducted on (much) smaller version control system samples to
	confirm/refute their findings
	- ...
	** Exploitation
	#+BEAMER: \LARGE \centering
	How do you query the Software Heritage archive?
	#+BEAMER: \Large \\
	(on a budget)

	** The Software Heritage Graph Dataset :B_ignoreheading:
	:PROPERTIES:
	:BEAMER_env: ignoreheading
	:END:
	#+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t
	#+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t

	** Sample study --- 50 years of gender differences in code contributions
	- start from the Software Heritage graph dataset
	- detect gender of author names using standard tooling (=gender-guesser=)
	# - caveat: how to identify /first/ name?
	- analyze both authors and commits over time, bucketing by commit timestamp
	#+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center}
	***
	#+BEGIN_EXPORT latex
	\vspace{-1mm}
	\begin{thebibliography}{} \footnotesize
	\bibitem{Zacchiroli2021} Stefano Zacchiroli
	\newblock Gender Differences in Public Code Contributions: a 50-year Perspective
	\newblock IEEE Softw. 38(2): 45-50 (2021)
	\end{thebibliography}
	#+END_EXPORT

	** Discussion
	- one /can/ query such a corpus SQL-style
	- but relational representation shows its limits at this scale
	- ...at least as deployed on commercial SQL offerings such as Athena
	- note: (naive) sharding is ineffective, due to the pseudo-random
	distribution of node identifiers
	- experiments with Google BigQuery are ongoing
	- (we broke it at the first import attempt..., due to very large arrays in
	directory entry tables)

	* Graph compression
	#+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t

	* Security synergies and outlook
	#+INCLUDE: "this/security.org" :minlevel 2
	#+INCLUDE: "this/roadmap.org" :minlevel 2

	** Wrapping up
	#+latex: \vspace{-1mm}
	***
	- Software Heritage archives all public source code as a huge Merkle DAG
	- Querying and analyzing it at scale (20/200 B nodes/edges) is an open
	problem
	- Gold mine of research leads in sw. eng., big code, reproducibility,
	security
	#+latex: \vspace{-2mm}
	*** References (selected)
	#+latex: \vspace{-1mm}
	#+BEGIN_EXPORT latex
	\begin{thebibliography}{}
	\scriptsize

	\bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli
	\newblock Building the Universal Archive of Source Code
	\newblock Communications of the ACM, October 2018

	\bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
	\newblock The Software Heritage graph dataset: public software development under one roof
	\newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE

	\bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli
	\newblock Ultra-Large-Scale Repository Analysis via Graph Compression
	\newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE

	\end{thebibliography}
	#+END_EXPORT
	*** Contacts
	Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]]

	* Appendix :B_appendix:
	:PROPERTIES:
	:BEAMER_env: appendix
	:END:

	** Meet the Software Heritage Identifiers (SWHIDs) \hfill [[https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html][(full spec)]]
	#+INCLUDE: "../../common/modules/swhid.org::#oneslide" :only-contents t

File Metadata

Mime Type: text/x-tex
Expires: Jun 4 2025, 7:40 PM (10 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3245321

2021-05-19-telecom-paris.orgNo OneTemporaryActions

2021-05-19-telecom-paris.orgView Options

File Metadata

Event Timeline

2021-05-19-telecom-paris.org
No OneTemporary
Actions

2021-05-19-telecom-paris.org
View Options