diff --git a/talks-public/2021-04-21-RDA-Data-Granularity/2021-04-21-RDA-Data-Granularity.org b/talks-public/2021-04-21-RDA-Data-Granularity/2021-04-21-RDA-Data-Granularity.org new file mode 100644 index 0000000..01741c5 --- /dev/null +++ b/talks-public/2021-04-21-RDA-Data-Granularity/2021-04-21-RDA-Data-Granularity.org @@ -0,0 +1,342 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Granularity +#+SUBTITLE:Intrinsic vs. extrinsic identifiers for different granularity levels +#+AUTHOR: Roberto Di Cosmo +#+EMAIL: roberto@dicosmo.org @rdicosmo @swheritage +#+BEAMER_HEADER: \date[April 21st, 2021]{April 21st, 2021\\[-1em]} +#+BEAMER_HEADER: \title[Software Granularity]{Software Granularity} +#+BEAMER_HEADER: \author[Roberto Di Cosmo]{Roberto Di Cosmo\\[1em]} +#+KEYWORDS: software heritage legacy preservation knowledge mankind technology SWHID granularity +#+LATEX_HEADER: \usepackage{tcolorbox} +#+LATEX_HEADER: \definecolor{links}{HTML}{2A1B81} +#+LATEX_HEADER: \hypersetup{colorlinks,linkcolor=,urlcolor=links} +# +# prelude.org contains all the information needed to export the main beamer latex source +# use prelude-toc.org to get the table of contents +# + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 + + +#+INCLUDE: "../../common/modules/169.org" + +# +LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] +#+LATEX_HEADER: \usepackage{bbding} +#+LATEX_HEADER: \usepackage{tcolorbox} +#+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} + + +# +# If you want to change the title logo it's here +# +# +BEAMER_HEADER: \titlegraphic{\includegraphics[width=0.7\textwidth]{SWH-logo}} + +# aspect ratio can be changed, but the slides need to be adapted +# - compute a "resizing factor" for the images (macro for picblocks?) +# +# set the background image +# +# https://pacoup.com/2011/06/12/list-of-true-169-resolutions/ +# +#+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} +#+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} +#+LATEX: \addtocounter{framenumber}{-1} + + + +* Software Source code: A pillar of Open Science +** Software Source code: pillar of Open Science +*** Three pillars of Open Science :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .4 + :END: +#+latex: \begin{center} +#+ATTR_LATEX: :width \extblockscale{1.4\linewidth} +file:preservation_triangle_color.png +#+latex: \end{center} +#+BEAMER: \pause + +*** A plurality of needs :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .6 + :END: + - Researcher :: + - *archive* and *reference* software used in articles + - *find* useful software + - get *credit* for developed software + - verify/reproduce/improve results + #+BEAMER: \pause + - Laboratory/team :: track software contributions + - produce reports / web page + #+BEAMER: \pause + - Research Organization :: know its *software assets* + - technology *transfer* + - impact *metrics* +** A principled infrastructure \hfill \url{http://bit.ly/swhpaper} :noexport: + #+latex: \begin{center} + #+ATTR_LATEX: :width 0.5\linewidth + file:SWH-as-foundation-slim.png + #+latex: \end{center} + #+BEAMER: \pause + #+latex: \centering + #+ATTR_LATEX: :width \extblockscale{.7\linewidth} + file:growth.png + #+BEAMER: \pause +*** Technology + :PROPERTIES: + :BEAMER_col: 0.34 + :BEAMER_env: block + :END: + - transparency and FOSS + - replicas all the way down +*** Content (billions!) + :PROPERTIES: + :BEAMER_col: 0.32 + :BEAMER_env: block + :END: + - *intrinsic identifiers* + - facts and provenance +*** Organization + :PROPERTIES: + :BEAMER_col: 0.33 + :BEAMER_env: block + :END: + - non-profit + - multi-stakeholder + +** Software is not /just/ data +*** Software has multiple facets in research + - a *tool* + - a *research *outcome* or result + - the object of *study* + #+BEAMER: \pause +*** Source code is /special/ + :PROPERTIES: + :BEAMER_env: picblock + :BEAMER_OPT: pic=python3-matplotlib.pdf, width=.51\linewidth + :END: + Software **evolves** over time + - projects may last decades + - the /development history/ is key to its /understanding/ + #+BEAMER: \pause + Layers of **complexity** + - /millions/ of lines of code + - large /web of dependencies/ + - sophisticated /developer communities/ + +** What is software ? What do we want to identify? +#+latex: \begin{center} \huge{} \end{center} +#+BEAMER: \pause +*** Software as a concept :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.5 + :BEAMER_env: block + :END: + - software project / entity +#+BEAMER: \pause + - the creators and the community around it +#+BEAMER: \pause + +*** Software artifact :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.5 + :BEAMER_env: block + :END: + - the binaries for different environments +#+BEAMER: \pause + - the *software source code* for each version + +** Evolution of software development +*** Version control system (VCS) + - records changes made to a (set of) /source code file/ (s) + - allows to operate on versions: diff/merge/fork/recover etc. + - *essential* tool for software development + #+BEAMER: \pause +*** Three decades of evolution +#+LATEX: \centering +#+LATEX: \includegraphics[width=.8\linewidth]{VCS_history_timeline.png} + +** In a picture \hfill (from https://github.com/progit/progit2) :noexport: + #+BEGIN_EXPORT latex + \centering\forcebeamerstart + \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{localvcs}}\mbox{}\\[2em] + \texttt{co -r1.2 file.c} + } + \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{centralisedvcs}}\mbox{}\\[2em] + \texttt{cvs co -r Rel-1A ProgABC} + } + \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.5\linewidth}]{distvcs}}\mbox{}\\[2em] + \texttt{git checkout df3b1b08f756569eff0919e37d8af1f403515b31} + } + \forcebeamerend + #+END_EXPORT +** Foundations of modern DVCS +**** Requirements for the D in DVCS + - *intrinsic* unique identifiers... \hfill (here: /cryptographic signature/, aka "hash") + - ... that work for *tree structures* (software directories) + #+BEAMER: \pause + # R. C. Merkle, A digital signature based on a conventional encryption + # function, Crypto '87 +**** Merkle tree to the rescue (R. C. Merkle, Crypto 1979) :B_picblock: + :PROPERTIES: + :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth + :BEAMER_env: picblock + :BEAMER_act: + :END: + Combination of + - tree + - hash function +** A massive adoption +*** Stack Overflow \hfill \href{https://insights.stackoverflow.com/survey/2018}{[Survey 2018]} + :PROPERTIES: + :BEAMER_col: 0.47 + :BEAMER_env: block + :END: + #+latex: \centering + #+ATTR_LATEX: :width \extblockscale{1.4\linewidth} + file:stackoverflow-survey-VCS.png + + #+BEAMER: \pause + +*** In numbers + :PROPERTIES: + :BEAMER_col: 0.45 + :BEAMER_env: block + :END: + GitHub \hfill \href{https://octoverse.github.com/2017/}{[Octoverse 2017]} \href{https://github.blog/2018-11-08-100m-repos/}{[Blog 2018]} + - *100.000.000+* repositories + - *40.000.000+* developers worldwide + + Bitbucket \hfill \href{https://bitbucket.org/blog/celebrating-10-million-bitbucket-cloud-registered-users}{[Blog 2019]} + - *28.000.000+* repositories + - *10.000.000+* developers worldwide + + GitLab \hfill \href{https://about.gitlab.com/blog/2019/06/06/1-mil-merge-requests/}{[Blog 2019]} + - *1.000.000* MRs March 19' + #+BEAMER: \pause + +*** + \hfill Let's use it! +* The SWH-ID: the source code fingerprint +** The SWH-ID schema + # TODO: drawing with swh:1:cnt:xxxxxxx "exploded" and explained + #+LATEX: \centering\forcebeamerstart + #+LATEX: \only<1>{\includegraphics[width=\linewidth]{SWH-ID-1.png}} + #+LATEX: \only<2>{\includegraphics[width=\linewidth]{SWH-ID-2.png}} + #+LATEX: \only<3>{\includegraphics[width=\linewidth]{SWH-ID-3.png}} + #+LATEX: \forcebeamerend +** A worked example + #+LATEX: \centering\forcebeamerstart + #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_1.pdf}}} + #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/contents.pdf}}} + #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_2_contents.pdf}}} + #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/directories.pdf}}} + #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_3_directories.pdf}}} + #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/revisions.pdf}}} + #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} + #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/releases.pdf}}} + #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_5_releases.pdf}}} + #+LATEX: \only<10>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/snapshots.pdf}}} + #+LATEX: \forcebeamerend +** Demo time +*** + Let's look at some famous exceprts of source code +#+BEAMER: \pause +*** Apollo 11 source code ([[https://archive.softwareheritage.org/swh:1:cnt:64582b78792cd6c2d67d35da5a11bb80886a6409;origin=https://github.com/virtualagc/virtualagc;lines=245-261/][excerpt]]) :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.48 + :BEAMER_env: block + :END: + #+LATEX: \includegraphics[width=\linewidth]{apollo-11-cranksilly.png} + # excerpt of routine that asks astronaut to turn around the LEM +#+BEAMER: \pause +*** Quake III source code ([[https://archive.softwareheritage.org/swh:1:cnt:bb0faf6919fc60636b2696f32ec9b3c2adb247fe;origin=https://github.com/id-Software/Quake-III-Arena;lines=549-572/][excerpt]]) :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.45 + :BEAMER_env: block + :END: + #+LATEX: \includegraphics[width=\linewidth]{quake-carmack-sqrt-1.png} + # smart efficient implementation of 1/sqrt(x) on a CPU without special support +#+BEAMER: \pause +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +*** It works! + we have /intrinsic/ identifiers for all 20+ billion objects in the archive +* Software is our heritage +#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 +* Conclusion +** Food for thought +*** Intrinsic identifiers... + - can be extracted from the *object itself*, hence: + - no need for a /central authority/, nor maintenance + - any modification to the object changes the identifier + - identifies the /object/, not the /metadata/ ! +#+BEAMER: \pause +*** ... /for source code/ + - Distributed Version Control Systems made them popular + - massively used every day by millions of software developers + - Software Heritage provides *SWH-IDs* for billions of software artifacts + +** Come in, we're open! +*** + \url{www.softwareheritage.org} --- learn more \\ + \url{save.softwareheritage.org} --- save code now \\ + \url{www.softwareheritage.org/swhap} --- legacy software acquisition process \\ + \url{forge.softwareheritage.org} --- our own code + #+BEAMER: \vspace{-1mm} \flushright {\Huge Questions?} \vfill + +*** References :B_block: + :PROPERTIES: + :BEAMER_env: block + :END: + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \footnotesize + \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli\newblock + \emph{Building the Universal Archive of Source Code},\\ + Communications of the ACM, October 2018 + \href{https://doi.org/10.1145/3183558}{(10.1145/3183558)} + \bibitem{DiCosmo2019} Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli\newblock + \emph{Referencing Source Code Artifacts: a Separate Concern in Software Citation},\\ + Computing in Science and Engineering, IEEE, pp.1-9. \href{https://dx.doi.org/10.1109/MCSE.2019.2963148}{(10.1109/MCSE.2019.2963148)} + \href{https://hal.archives-ouvertes.fr/hal-02446202}{(hal-02446202)} + \end{thebibliography} + #+END_EXPORT + + + + +* Extrinsic vs Intrinsic identifiers :noexport: +** An important distinction: DIOs vs. IDOs + :PROPERTIES: + :CUSTOM_ID: diovsido + :END: +#+BEGIN_EXPORT latex + \begin{quote} + The term “Digital Object Identifier” is construed as “digital identifier of an object," rather than “identifier of a digital object” \hfill Norman Paskin. 2010 + \end{quote} +#+END_EXPORT +#+BEAMER: \pause +*** DIO (Digital Identifier of an Object) + digital identifiers for (potentially) *non digital objects* + - epistemic complexity (manifestations, versions, locations, etc.) + - need an authority to ensure persistence and uniqueness +#+BEAMER: \pause +*** IDO (Identifier of a Digital Object) + digital identifiers (only) for *digital objects* + - can provide both *integrity* and *no middle man* + - broadly used in modern software development (git, etc.) +** An important distinction: DIOs vs. IDOs + #+latex: \begin{center} + #+ATTR_LATEX: :width 0.859\linewidth + file:DIOvsIDO.png + #+latex: \end{center} +#+BEAMER: \pause + \hfill for the core Software Heritage archive, *IDOs are enough* + +** Intrinsic: what does it really mean? +Examples of intrinsic identifiers (DNA, music notes, etc.) diff --git a/talks-public/2021-04-21-RDA-Data-Granularity/METADATA b/talks-public/2021-04-21-RDA-Data-Granularity/METADATA new file mode 100644 index 0000000..fbdcfe1 --- /dev/null +++ b/talks-public/2021-04-21-RDA-Data-Granularity/METADATA @@ -0,0 +1,22 @@ +Title: Software Granularity- Intrinsic identifiers vs. extrinsic identifiers for different granularity levels + + + Abstract: + + During the RDA VP17, the Data Grranularity WG gives the opportinuty + to stakeholders to present existing approaches for segmenting datasets + and collections in some domains, and evaluate their merits and variance. + + The Software Heritage universal archive of software source code relies on + well established techniques used in software development communities to + identify the over 20 billion code artefacts it preserves + cryptographic hashes in a Merkle DAG data structure. + + In this mini we will first explain the motivations of this choice, + recalling Paskin's essential distinction between digital identifiers of + an object (DIOs) and identifiers of digital objects (IDOs). + + Then we will focus on the properties of the Software Heritage Identifiers + (SWH-IDs) that matter most in a reproducibility and long term archival framework: + intrinsic integrity and independent verifiability. + diff --git a/talks-public/2021-04-21-RDA-Data-Granularity/Makefile b/talks-public/2021-04-21-RDA-Data-Granularity/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2021-04-21-RDA-Data-Granularity/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides