Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9338965
2020-03-24-ensea.org
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
7 KB
Subscribers
None
2020-03-24-ensea.org
View Options
#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
#+TITLE: Software Heritage
#+SUBTITLE: Analyzing the Global Graph of Public Software Development
#+BEAMER_HEADER: \date[24 Mar 2020, ENSEA]{24 Mars 2020\\ENSEA --- Cergy, France\\ (via conf call)\\[-2ex]}
#+AUTHOR: Stefano Zacchiroli
#+DATE: 24 March 2020
#+EMAIL: zack@upsilon.cc
#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1
#+INCLUDE: "../../common/modules/169.org"
#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}}
#+BEAMER_HEADER: \author{Stefano Zacchiroli}
# Required by graph-compression.org module
#+LATEX_HEADER_EXTRA: \usepackage{pdfpages}
# Syntax highlighting setup
#+LATEX_HEADER_EXTRA: \usepackage{minted}
#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango}
#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize}
#+name: setup-minted
#+begin_src emacs-lisp :exports results :results silent
(setq org-latex-listings 'minted)
(setq org-latex-minted-options
'(("fontsize" "\\scriptsize")
("linenos" "")))
(setq org-latex-to-pdf-process
'("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"))
#+end_src
# End syntax highlighting setup
* About me :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "this/zack.org" :minlevel 2
* Software Heritage
** Software Heritage in a nutshell \hfill www.softwareheritage.org
#+BEAMER: \transdissolve
#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :only-contents t :minlevel 3
** An international, non profit initiative\hfill built for the long term
:PROPERTIES:
:CUSTOM_ID: support
:END:
*** Sharing the vision :B_block:
:PROPERTIES:
:CUSTOM_ID: endorsement
:BEAMER_COL: .5
:BEAMER_env: block
:END:
#+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center}
#+LATEX: \vspace{-0.8cm}
#+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center}
#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials}
*** Donors, members, sponsors :B_block:
:PROPERTIES:
:CUSTOM_ID: sponsors
:BEAMER_COL: .5
:BEAMER_env: block
:END:
#+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center}
#+LATEX: \begin{center}
#+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}}
#+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors}
#+LATEX: \end{center}
** Status :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2
#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2
#+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2
* Querying the archive
** Exploitation
#+BEAMER: \LARGE \centering
How do you query the Software Heritage archive?
#+BEAMER: \Large \\
(on a budget)
** Use cases --- product needs
e.g., for https://archive.softwareheritage.org
*** Browsing
- =ls=
- =git log= (Linux kernel: 800K+ commits)
*** Wayback machine
- tarball
- =git bundle= (Linux kernel: 7M+ nodes)
*** Provenance tracking
- commit provenance (one/all contexts)
- requires backtracking
- origin provenance (one/all contexts)
*** Note
We therefore need both the Merkle DAG graph and its *transposed*
** Use cases --- research questions
*** For the sake of it
- local graph topology
- connected component size
- enabling question to identify the best approach (e.g., scale-up
v. scale-out) to conduct large-scale analyses
- any other emerging property
*** Software Engineering topics
- software provenance analysis at this scale is pretty much unexplored yet
- industry frontier: increase granularity down to the individual line of
code
- replicate at this scale (famous) studies that have generally been
conducted on (much) smaller version control system samples to
confirm/refute their findings
- ...
#+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t
#+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t
** Discussion
- one /can/ query such a corpus SQL-style
- but relational representation shows its limits at this scale
- ...at least as deployed on commercial SQL offerings such as Athena
- note: (naive) sharding is ineffective, due to the pseudo-random
distribution of node identifiers
- experiments with Google BigQuery are ongoing
- (we broke it at the first import attempt..., due to very large arrays in
directory entry tables)
* Graph compression
#+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t
* Conclusion
#+INCLUDE: "this/roadmap.org" :minlevel 2
** Wrapping up
#+latex: \vspace{-2mm}
***
- Software Heritage archives all public source code as a huge Merkle DAG
- Querying and analyzing it at scale (20/300 B nodes/edges) is an open
problem
- Gold mine of research leads in sw. eng., complex networks, big code,
reproducibility
#+latex: \vspace{-2mm}
*** References
#+latex: \vspace{-1mm}
#+BEGIN_EXPORT latex
\begin{thebibliography}{}
\scriptsize
\bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli
\newblock Building the Universal Archive of Source Code
\newblock Communications of the ACM, October 2018
\bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
\newblock The Software Heritage graph dataset: public software development under one roof
\newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE
\bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli
\newblock Ultra-Large-Scale Repository Analysis via Graph Compression
\newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE
\end{thebibliography}
#+END_EXPORT
*** Contacts
Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]]
* Appendix :B_appendix:
:PROPERTIES:
:BEAMER_env: appendix
:END:
File Metadata
Details
Attached
Mime Type
text/x-tex
Expires
Jul 4 2025, 9:18 AM (6 w, 16 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3283075
Attached To
rMSLD Slides and presentation material
Event Timeline
Log In to Comment