diff --git a/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org b/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org new file mode 100644 index 0000000..4813e62 --- /dev/null +++ b/talks-public/2018-12-10-BENEVOL/2018-12-10-BENEVOL.org @@ -0,0 +1,125 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Towards Universal Software Evolution Analysis +# #+SUBTITLE: Analyzing All the Code Source with Software Heritage +#+BEAMER_HEADER: \date[10/12/2018, BENEVOL2018]{10 December 2018\\Belgium-Netherlands Software Evolution Workshop\\Delft, Netherlands} +#+DATE: 10 December 2018 + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[Inria]{\\[-5mm]Inria --- Software Heritage\\{\tt antoine.pietri@softwareheritage.org}} +#+BEAMER_HEADER: \author{Antoine Pietri} + +#+LATEX_HEADER_EXTRA: \usepackage{tikz} +#+LATEX_HEADER_EXTRA: \usetikzlibrary{arrows,shapes} +#+LATEX_HEADER_EXTRA: \definecolor{swh-orange}{RGB}{254,205,27} +#+LATEX_HEADER_EXTRA: \definecolor{swh-red}{RGB}{226,0,38} +#+LATEX_HEADER_EXTRA: \definecolor{swh-green}{RGB}{77,181,174} + +* Software Heritage + #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#dataflow" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 + +** A giant Merkle DAG + # #+BEAMER: \centering + #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_1.pdf}}} + #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_2_contents.pdf}}} + #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_3_directories.pdf}}} + #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_4_revisions.pdf}}} + #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_5_releases.pdf}}} +# #+LATEX: {\colorbox{white}{\includegraphics[width=.7\linewidth]{git-merkle/merkle_1.pdf}}} + +* A Platform for Software Analysis +** A Platform for Software Analysis + +*** + *Goal*: build a research platform for Software Analysis. + +#+BEAMER: \pause +*** Questions we want to be able to answer: + - What is the average size of a README? + - What is the average directory depth of a Java repository? + - What files are changed often in commits named "fix: ..."? + - What are good predictors of software becoming popular/dying? + - What are good predictors of a software getting forked? + - ... + +** Research requirements + +*** Categories of requested data + +- Content (/blobs/) +- Metadata (/file names/, /directories/) +- History graph (/revisions/) +- Content search (/full-text search index/) +- Provenance (/backwards index/) + +* Challenges +** Data volume challenges + +*** Analysis on a local mirror +Handling data at that scale is a problem too hard for most researchers: + +- Data hardly fits on a single machine +- Unusual size distribution of blobs (~3 kB compressed) \\ + → hard to use classical distributed storage solutions +- Graph doesn't fit in RAM \\ + → hard to do intensive processing +- Even with enough capacity, how can we send you so much data? + +*** Remote computations +- Compute queries externally, /reduce/ the result and send it back +- How to describe those queries expressively? + +** Representation mismatch + +*** +Storing everything deduplicated is great for *archival* but *analysis +tools* generally expect specific directory structures/formats. + +*** Potential solutions +- Provide a way to "flatten" deduplicated structures +- Keep deduplication information accessible +- No real standard for the revision graph? + + +** Other open questions + +*** Provenance mappings +- "What is the content of this revision" is just half the story. +- *"What revisions contain this content"*? → Walk the tree backwards +- Tradeoff: reduce nb. of indirections while avoiding combinatorial explosions + +*** Project metadata +- Concept of a "project" is lost in a fully-deduplicated dataset +- How to bridge project metadata with our objects? + +*** Expressivity +Our query language has to be expressive to allow combining types of +computations while minimizing roundtrips. + +** Use case collection + +*** +We want to *collect all the use cases* to understand usage patterns, and elicit +a query language. + +/Please/, give us ideas of what requests you would like to be able to run on +the archive! + +** Come and talk to us! + - Antoine Pietri / antoine.pietri@softwareheritage.org / @seirl_ + - Stefano Zacchiroli / zack@upsilon.cc / @zacchiro + + Links: + - https://www.softwareheritage.org + - https://archive.softwareheritage.org + - https://www.softwareheritage.org/support/sponsors/ +*** Footer :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \scriptsize \vfill \hfill + Slides licensed under + [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons + Attribution-ShareAlike 4.0 International License]] (CC BY-SA 4.0). diff --git a/talks-public/2018-12-10-BENEVOL/Makefile b/talks-public/2018-12-10-BENEVOL/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2018-12-10-BENEVOL/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides