diff --git a/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org b/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org index a4fffd8..bb1661f 100644 --- a/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org +++ b/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org @@ -1,138 +1,135 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Towards Universal Software Evolution Analysis # #+SUBTITLE: Analyzing All the Code Source with Software Heritage #+BEAMER_HEADER: \date[10/12/2018, BENEVOL2018]{10 December 2018\\Belgium-Netherlands Software Evolution Workshop\\Delft, Netherlands} #+DATE: 10 December 2018 #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Inria]{\\[-5mm]Inria --- Software Heritage\\{\tt antoine.pietri@softwareheritage.org\\zack@upsilon.cc}} #+BEAMER_HEADER: \author{Antoine Pietri \and Stefano Zacchiroli} #+LATEX_HEADER_EXTRA: \usepackage{tikz} #+LATEX_HEADER_EXTRA: \usetikzlibrary{arrows,shapes} #+LATEX_HEADER_EXTRA: \definecolor{swh-orange}{RGB}{254,205,27} #+LATEX_HEADER_EXTRA: \definecolor{swh-red}{RGB}{226,0,38} #+LATEX_HEADER_EXTRA: \definecolor{swh-green}{RGB}{77,181,174} * Software Heritage #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#dataflow" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 ** A giant Merkle DAG #+BEAMER: \centering #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_1.pdf}}}% #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_2_contents.pdf}}}% #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_3_directories.pdf}}}% #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_4_revisions.pdf}}}% #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_5_releases.pdf}}}% # #+LATEX: {\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_1.pdf}}}% * A Platform for Software Analysis ** A Platform for Software Analysis *** Building a research platform for Software Analysis. - Analyze all software source code artifacts (code + development history) - At the largest possible scale #+BEAMER: \pause *** Examples of research questions the platform should support - What is the average size of a README? - What is the average directory depth of a Java repository? - What files are changed often in commits named "fix: ..."? - What are good predictors of software becoming popular/dying? - What are good predictors of a software getting forked? - ... ** Research requirements *** Categories of requested data - Content (/blobs/) - Metadata (/file names/, /directories/) - History graph (/revisions/) - Content search (/full-text search index/) - Provenance (/backwards index/) * Challenges ** Data volume challenges *** Analysis on a local mirror Handling data at that scale is a problem too hard for most researchers: - Data hardly fits on a single machine - Unusual size distribution of contents (a lot of very small files: median ~3 kB) \\ → hard to use classical distributed storage solutions - Graph doesn't fit in RAM \\ → hard to do intensive processing - Even with enough capacity, downloading that volume of data is hard *** Remote computations -- Compute on a remote server, /reduce/ the result and send it back -- How to describe queries expressively? +Compute on a remote server, /reduce/ the result and send it back ** Representation mismatch *** Storing everything deduplicated is storage-efficient for *archival* but *analysis tools* generally expect specific directory structures/formats. *** Potential solutions - Provide a way to "flatten" deduplicated structures - Keep deduplication information accessible -- No standard format for development history \\ - → export as VCS bundles? - ** Other open questions *** Software provenance - "What are the contents in this origin" is just half the story. - *"What origins contain this content"*? → Walk the tree backwards - Tradeoff: reduce nb. of indirections while avoiding combinatorial explosions *** Project metadata - Concept of a "project" is lost in a fully-deduplicated dataset - How to bridge project metadata with the objects? *** Expressivity The query language has to be expressive to allow combining types of computations while minimizing roundtrips. ** Roadmap *** - The entire dataset is accessible in Amazon Athena (graph) and S3 (contents) - Will soon be made public for everyone to run queries on it - *Collect all the use cases* to understand usage patterns - Elicit a query language. Please, give us ideas of what requests you would like to be able to run on the archive! ** Come and talk to us! Antoine Pietri / antoine.pietri@softwareheritage.org / @seirl_ #+BEAMER: \vspace{1cm} Links: - https://www.softwareheritage.org - https://archive.softwareheritage.org *** Footer :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEAMER: \scriptsize \vfill \hfill Slides licensed under [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons Attribution-ShareAlike 4.0 International License]] (CC BY-SA 4.0). ** References :PROPERTIES: :BEAMER_OPT: fragile,allowframebreaks,label= :END: + #+BEAMER: \scriptsize #+BEAMER: \nocite{*} #+BEAMER: \bibliographystyle{amsalpha} #+BEAMER: \bibliography{swh.bib}