diff --git a/talks-public/2020-01-29-Pidapalooza/2020-01-29-Pidapalooza.org b/talks-public/2020-01-29-Pidapalooza/2020-01-29-Pidapalooza.org index e1d31be..d3d4d61 100644 --- a/talks-public/2020-01-29-Pidapalooza/2020-01-29-Pidapalooza.org +++ b/talks-public/2020-01-29-Pidapalooza/2020-01-29-Pidapalooza.org @@ -1,188 +1,241 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: The swh-id: a digital fingerprint identifying software source code #+SUBTITLE: #+AUTHOR: Roberto Di Cosmo #+EMAIL: roberto@dicosmo.org @rdicosmo @swheritage #+BEAMER_HEADER: \date{January 29th, 2020} #+BEAMER_HEADER: \title[The swh-id]{The swh-id: a digital fingerprint identifying software source code} #+BEAMER_HEADER: \author[{\bf Roberto Di Cosmo}, Morane Gruenpeter]{{\bf Roberto Di Cosmo}, Morane Guenpeter\\[1em]% #+BEAMER_HEADER: Director, Software Heritage\\Computer Science full professor, Inria and IRIF\\[-1em]} # #+BEAMER_HEADER: \setbeameroption{show notes on second screen} #+BEAMER_HEADER: \setbeameroption{hide notes} #+KEYWORDS: software heritage legacy preservation knowledge mankind technology #+LATEX_HEADER: \usepackage{tcolorbox} #+LATEX_HEADER: \definecolor{links}{HTML}{2A1B81} #+LATEX_HEADER: \hypersetup{colorlinks,linkcolor=,urlcolor=links} # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" # +LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] #+LATEX_HEADER: \usepackage{bbding} #+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} # # If you want to change the title logo it's here # # +BEAMER_HEADER: \titlegraphic{\includegraphics[width=0.7\textwidth]{SWH-logo}} # aspect ratio can be changed, but the slides need to be adapted # - compute a "resizing factor" for the images (macro for picblocks?) # # set the background image # # https://pacoup.com/2011/06/12/list-of-true-169-resolutions/ # #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} #+LATEX: \addtocounter{framenumber}{-1} - -* The Software Heritage initiative +* Software as heritage +** Source Code: /executable/ and /human readable/ knowledge +#+INCLUDE: "../../common/modules/source-code-different-short.org::#thesourcecode" :only-contents t :minlevel 3 +*** + Len Shustek, CHM\hfill /“Source code provides a view into the mind of the designer.”/ +** The Paris call: Software Source Code is part of our Heritage + #+INCLUDE: "../../common/modules/paris-call-2019.org::#pariscall2019" :only-contents t :minlevel 3 +* Preserving all software source code #+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 -** A principled infrastructure \hfill \url{http://bit.ly/swhpaper} +** A principled infrastructure \hfill \url{http://bit.ly/swhpaper} :noexport: #+latex: \begin{center} #+ATTR_LATEX: :width 0.5\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+BEAMER: \pause #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.7\linewidth} file:growth.png #+BEAMER: \pause *** Technology :PROPERTIES: :BEAMER_col: 0.34 :BEAMER_env: block :END: - transparency and FOSS - replicas all the way down *** Content (billions!) :PROPERTIES: :BEAMER_col: 0.32 :BEAMER_env: block :END: - *intrinsic identifiers* - facts and provenance *** Organization :PROPERTIES: :BEAMER_col: 0.33 :BEAMER_env: block :END: - non-profit - multi-stakeholder -* The Knowledge is in the Source Code -** The knowledge is in the source code! -#+INCLUDE: "../../common/modules/source-code-different-short.org::#thesourcecode" :only-contents t :minlevel 3 ** Source code is /special/ -*** /Executable/ and /human readable/ knowledge \hfill copyright law - /“Programs must be written for people to read, and only incidentally for machines to execute.”/\\ - \hfill Harold Abelson -#+BEAMER: \pause *** Software /evolves/ over time - projects may last decades - the /development history/ is key to its /understanding/ #+BEAMER: \pause *** Complexity :B_picblock: :PROPERTIES: :BEAMER_env: picblock - :BEAMER_OPT: pic=python3-matplotlib.pdf, width=.6\linewidth + :BEAMER_OPT: pic=python3-matplotlib.pdf, width=.45\linewidth :END: - /millions/ of lines of code - large /web of dependencies/ - + easy to break, difficult to maintain - sophisticated /developer communities/ - -# ** How we built our scientific knowledge -# reproducibility and scientific knowledge pillars (one slide) -#+INCLUDE: "../../common/modules/swh-scientific-reproducibility.org::#main" :only-contents t :minlevel 2 -# - - -* Challenges -** Much more complex than it seems -*** Software is complex - - Structure :: monolithic/composite; self-contained/external dependencies - - Lifetime :: one-shot/long term - - Community :: one man/one team/distributed community - - Authorship :: complex set of roles - - Authority :: institutions/organizations/communities/single person #+BEAMER: \pause -*** Various granularities - - Exact status of the source code :: for reproducibility, e.g. -#+latex: \emph{``you can find at \href{https://archive.softwareheritage.org/swh:1:cnt:cdf19c4487c43c76f3612557d4dc61f9131790a4;lines=146-187/}{swh:1:cnt:cdf19c4487c43c76f3612557d4dc61f9131790a4;lines=146-187} the core algorithm used in this article''} - - - (Major) release :: \emph{``This functionality is available in OCaml version 4''} - - - Project :: \emph{``Inria has created OCaml and Scikit-Learn''}. -** We are not alone -*** Research Software does not exist in isolation :B_picblock: +*** Bottomline + - we must archive /all/ the source code + - we must preserve /all/ the history of its development + - we must /identify/ all the archived software artifacts (more than 20 billions today!) + \hfill how can we do this? +** Evolution of software development +*** Version control system (VCS) + - records changes made to a (set of) source code file(s) + - allows to operate on versions: diff/merge/fork/recover etc. + - *essential* tool for software development +*** Three decades of evolution + - Local VCS :: \mbox{}\\ + RCS (1982) + - Centralised VCS :: \mbox{}\\ + CVS (1990), Subversion (2000) + - Distributed VCS :: \mbox{}\\ + Git (2005), Mercurial (2005), Bazaar (2005) +** In a picture \hfill (from https://github.com/progit/progit2) :noexport: + #+BEGIN_EXPORT latex + \centering\forcebeamerstart + \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{localvcs}}\mbox{}\\[2em] + \texttt{co -r1.2 file.c} + } + \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{centralisedvcs}}\mbox{}\\[2em] + \texttt{cvs co -r Rel-1A ProgABC} + } + \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{distvcs}}\mbox{}\\[2em] + \texttt{git checkout df3b1b08f756569eff0919e37d8af1f403515b31} + } + \forcebeamerend + #+END_EXPORT +** Foundations of modern DVCS +**** Requirements for the D in DVCS + - *intrinsic* unique identifiers... \hfill (here: /cryptographic signature/, aka "hash") + - ... that work for *tree structures* (software directories) + #+BEAMER: \pause + # R. C. Merkle, A digital signature based on a conventional encryption + # function, Crypto '87 +**** Merkle tree to the rescue (R. C. Merkle, Crypto 1979) :B_picblock: + :PROPERTIES: + :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth + :BEAMER_env: picblock + :BEAMER_act: + :END: + Combination of + - tree + - hash function +** A massive adoption +*** GitHub today + - *100.000.000* repositories + - *40.000.000* developers worldwide + See https://octoverse.github.com/2017/ +*** + \hfill Let's use it! +* The SWH-ID: the source code fingerprint +** The SWH-ID schema + # TODO: drawing with swh:1:cnt:xxxxxxx "exploded" and explained +** A worked example + #+LATEX: \centering\forcebeamerstart + #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_1.pdf}}} + #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/contents.pdf}}} + #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_2_contents.pdf}}} + #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/directories.pdf}}} + #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_3_directories.pdf}}} + #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/revisions.pdf}}} + #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} + #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/releases.pdf}}} + #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_5_releases.pdf}}} + #+LATEX: \only<10>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/snapshots.pdf}}} + #+LATEX: \forcebeamerend +** Demo time +*** + Let's look at some famous exceprts of source code +#+BEAMER: \pause +*** Apollo 11 source code ([[https://archive.softwareheritage.org/swh:1:cnt:64582b78792cd6c2d67d35da5a11bb80886a6409;origin=https://github.com/virtualagc/virtualagc;lines=245-261/][excerpt]]) :B_block:BMCOL: :PROPERTIES: - :BEAMER_env: picblock - :BEAMER_OPT: pic=python3-matplotlib.pdf, width=.6\linewidth, leftpic=true + :BEAMER_col: 0.48 + :BEAMER_env: block :END: - large /web of dependencies/ on non-research software + #+LATEX: \includegraphics[width=\linewidth]{apollo-11-cranksilly.png} + # excerpt of routine that asks astronaut to turn around the LEM #+BEAMER: \pause -*** Industry and developers have been here :B_block: +*** Quake III source code ([[https://archive.softwareheritage.org/swh:1:cnt:bb0faf6919fc60636b2696f32ec9b3c2adb247fe;origin=https://github.com/id-Software/Quake-III-Arena;lines=549-572/][excerpt]]) :B_block:BMCOL: :PROPERTIES: + :BEAMER_col: 0.45 :BEAMER_env: block - :BEAMER_COL: .5 :END: - - NSRL (NIST) - - SPDX (Linux Foundation) - - SWH-ID (Software Heritage) - - SWID (ISO Standard) - - Wikidata Software Properties + #+LATEX: \includegraphics[width=\linewidth]{quake-carmack-sqrt-1.png} + # smart efficient implementation of 1/sqrt(x) on a CPU without special support #+BEAMER: \pause -*** We must :B_block: +*** :B_ignoreheading: :PROPERTIES: - :BEAMER_env: block - :BEAMER_COL: .5 + :BEAMER_env: ignoreheading :END: - - accept the complexity - - avoid reinventing the wheel - - connect with existing communities of practice - -* Extrinsic vs Intrinsic identifiers +*** It works! + we have /intrinsic/ identifiers for all 20+ billion objects in the archive +* Conclusion +** Food for thought +*** Intrinsic identifiers... + - can be extracted from the object itself, hence: + - no need for a central authority, nor maintenance + - any modification to the object changes the identifier + - identifies the object, not the metadata! +*** ... /for source code/ + - Distributed Version Control Systems made them popular + - massively used every day by millions of software developers + - Software Heritage provides SWH-IDs for billions of software artifacts +#+BEAMER: \pause +*** Intrinsic identifiers existed before! + - TODO: add images +* Extrinsic vs Intrinsic identifiers :noexport: ** An important distinction: DIOs vs. IDOs :PROPERTIES: :CUSTOM_ID: diovsido :END: #+BEGIN_EXPORT latex \begin{quote} The term “Digital Object Identifier” is construed as “digital identifier of an object," rather than “identifier of a digital object” \hfill Norman Paskin. 2010 \end{quote} #+END_EXPORT #+BEAMER: \pause *** DIO (Digital Identifier of an Object) digital identifiers for (potentially) *non digital objects* - epistemic complexity (manifestations, versions, locations, etc.) - need an authority to ensure persistence and uniqueness #+BEAMER: \pause *** IDO (Identifier of a Digital Object) digital identifiers (only) for *digital objects* - can provide both *integrity* and *no middle man* - broadly used in modern software development (git, etc.) ** An important distinction: DIOs vs. IDOs #+latex: \begin{center} #+ATTR_LATEX: :width 0.859\linewidth file:DIOvsIDO.png #+latex: \end{center} #+BEAMER: \pause \hfill for the core Software Heritage archive, *IDOs are enough* ** Intrinsic: what does it really mean? Examples of intrinsic identifiers (DNA, music notes, etc.) -* The SWH-ID: the source code fingerprint -** the origins -** an overview of the archive data model -** parmap showcase -** swh-identify: how to find a digital object's intrinsic identifier