diff --git a/talks-public/2019-10-10-jcad/2019-10-10-jcad.org b/talks-public/2019-10-10-jcad/2019-10-10-jcad.org new file mode 100644 index 0000000..ac04a0e --- /dev/null +++ b/talks-public/2019-10-10-jcad/2019-10-10-jcad.org @@ -0,0 +1,311 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage +#+SUBTITLE: Source Code Archival and Analysis at the Scale of the World +#+BEAMER_HEADER: \date[10 Oct 2019, JCAD]{10 October 2019\\Journées Calcul et Données --- Toulouse, France} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 10 October 2019 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[Software Heritage]{Software Heritage --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +* Software Source Code: a Forgotten Pillar of Open Science +** Software Source code: pillar of Open Science +*** Software is everywhere in modern research :B_picblock: + :PROPERTIES: + :BEAMER_opt: pic=papermountain, leftpic=true, width=.3\linewidth + :BEAMER_env: picblock + :BEAMER_COL: .6 + :END: +#+BEGIN_QUOTE +[...] software [...] essential in their fields. + +\mbox{}\hfill Top 100 papers (Nature, 2014) +#+END_QUOTE +#+BEGIN_QUOTE +Sometimes, if you dont have the software, you dont have the data + +\mbox{}\hfill Christine Borgman, Paris, 2018 +#+END_QUOTE +# http://www.nature.com/news/the-top-100-papers-1.16224 +#+BEAMER: \pause +*** Open Science: three pillars :B_block: + :PROPERTIES: + :BEAMER_COL: .45 + :BEAMER_env: block + :END: +#+latex: \begin{center} +#+ATTR_LATEX: :width \extblockscale{\linewidth} +file:PreservationTriangle.png +#+latex: \end{center} +#+BEAMER: \pause +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +*** Nota bene + \hfill The links in the picture are *essential* +** Source code is /special/ +#+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :only-contents t :minlevel 3 +** ~ 50 years, a lightning fast growth :noexport: +#+INCLUDE: "../../common/modules/50years-source-code.org::#apollolinux" :only-contents t :minlevel 3 +** Pressure to make the source code available is raising :noexport: +*** Why + Necessary to + - /reproduce/ and verify, + - /modify/ and /evolve/, *building new experiments* from old ones +#+BEAMER: \pause +*** When and where + - debate started end of first 2000 decade (biology, statistics, medicine, etc.) + - growing in Computer Science since the [[https://www.artifact-eval.org/about.html][ESEC/FSE 2011 Artifact Evaluation context]] (winner: Vouillon and Di Cosmo) +** ACM take on Reproducibility, Replicability and Source code :noexport: + ACM policies: [[https://www.acm.org/publications/policies/artifact-review-badging][Artifact Review and Badging]] +*** Terminology (not consensual yet!) + :PROPERTIES: + :BEAMER_col: 0.5 + :BEAMER_env: block + :END: + - *Repeatability* \\ same team, same experimental setup + - *Replicability* \\ different team, same experimental setup + - *Reproducibility* \\ different team, different experimental setup +#+BEAMER: \pause +*** Badging software artefacts + :PROPERTIES: + :BEAMER_col: 0.4 + :BEAMER_env: block + :END: +#+latex: \begin{center} + #+ATTR_LATEX: :width 0.6\linewidth +# file:file:metadata_landscape_final.png +file:acm_badges.png +#+latex: \end{center} +#+BEAMER: \pause + +** The state of the art is not ideal +#+INCLUDE: "../../common/modules/reprod-bad-sota.org::#collbergmethod" :only-contents t :minlevel 3 +** ... cont'd +#+INCLUDE: "../../common/modules/reprod-bad-sota.org::#collbergfindings" :only-contents t :minlevel 3 +#+BEAMER: \pause +*** The main reasons + \hfill source code (/or the right version of it/) cannot be found + +** Where we stand :noexport: +*** Lack of recognition + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .5 + :END: + not (yet) a first class citizen + - in the EOSC plan +# - in the EU copyright reform + - in the scholarly works +#+BEAMER: \pause +*** Lack of proper guidance on how to + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .5 + :END: + - /archive/ software + - choose a license + - /cite/ a software project +# #+BEAMER: \pause +# *** :B_ignoreheading: +# :PROPERTIES: +# :BEAMER_env: ignoreheading +# :END: +# *** Lack of basic prerequisites to reproducibility +# See a discussion in \url{annex.softwareheritage.org/talks/2018/2018-09-17-STScI_public.pdf} + +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +#+BEAMER: \pause +*** ... but a wealth of initiatives! + - Policies: ACM [[https://www.acm.org/publications/policies/artifact-review-badging][Artifact Review and Badging]], ... + - Working groups: [[https://www.force11.org/software-citation-principles][FORCE11]], [[https://www.rd-alliance.org/groups/software-source-code-ig][RDA]], [[https://www.ouvrirlascience.fr/logiciels-libres-et-open-source/][SPSO]], ... + - Metrics: [[https://www.ouvrirlascience.fr/about-the-proposal-for-software-indicators-in-open-science-monitor-3/][Open Science Monitor]] (Elsevier!), ... + - Journals: [[https://www.ipol.im/][IPOL]], ReScience, InsightJournal, eLife, ACM DL, ... + - Repositories: FigShare, Zenodo, ... +** What is at stake :noexport: +*** Metadata + Research software artifacts must be properly *described*\\ + \hfill make it easy to /discover/ them (/visibility/) +#+BEAMER: \pause +*** Archival + Research software artifacts must be properly *archived*\\ + \hfill make it sure we can /retrieve/ them (/reproducibility/) +#+BEAMER: \pause +*** Identification + Research software artifacts must be properly *referenced*\\ + \hfill make it sure we can /identify/ them (/reproducibility/) +#+BEAMER: \pause +*** Citation + Research software artifacts must be properly *cited* /(not the same as referenced)/\\ + \hfill to give /credit/ to authors (/evaluation/!) + +* Software Heritage + #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#fragile" :minlevel 2 + # #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#spread" :minlevel 2 + #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#research" :minlevel 2 + + #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 + # #+INCLUDE: "../../common/modules/principles-short.org::#principles" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 + + #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t + # #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :only-contents t + # #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 + +* Software Heritage for Open Science +** A revolutionary infrastructure for research and innovation +*** Reference archive for research software :B_picblock: + :PROPERTIES: + :BEAMER_env: picblock + :BEAMER_OPT: pic=PreservationTriangle.png,leftpic=true, width=.4\linewidth + :END: + - *curated deposit* of research software + + /prototype/ with *HAL*, *CCSD* and *Inria IES* + - *intrinsic* identifiers for *reproducibility* + #+BEAMER: \pause +*** Reference platform for /Big Code/ :B_picblock: + :PROPERTIES: + :BEAMER_opt: pic=universal, leftpic=true, width=.2\linewidth + :BEAMER_env: picblock + :BEAMER_act: + :END: + - unique *observatory* of all software development + - *big data, machine learning* paradise: classification, trends, coding patterns, code completion... +** Highlights \hfill bit.ly/swhpaper +*** The largest software source code archive /ever/ + #+latex: \centering + #+latex: \mbox{}\hfill\includegraphics[width=\extblockscale{.35\linewidth}]{swh-dataflow-merkle.pdf}\hfill\pause + #+latex: \includegraphics[width=\extblockscale{.75\linewidth}]{2019-01-archive-growth.png}\hfill\mbox{} +#+BEAMER: \pause +*** /10 billions intrinsic/ identifiers for reproducibility :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .6 + :END: + See DIO vs IDO in \hfill \url{bit.ly/swhpidpaper} + #+BEAMER: \pause +*** Research software deposit :B_block:noexport: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .4 + :END: + - [[https://www.softwareheritage.org/2018/09/28/depositing-scientific-software-into-software-heritage/][moderated via *HAL*]]\\ + \hfill /open since 9/2018/ +*** Reference archive :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .4 + :END: + See the work done at \hfill /swmath.org/ + #+BEAMER: \pause +*** SWH IDs now a standard for Wikidata + \mbox{}\hfill See https://www.wikidata.org/wiki/Property:P6138 + #+BEAMER: \pause +*** Collaboration HUB :B_block:noexport: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .33 + :END: + - industry, research + - digital preservation +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +*** Policy + \hfill Now part of the /French National Plan for Open Science/ \hfill\mbox{} +** Leveraging Software Heritage +*** Deposit research software \hfill /open since 9/2018/ :B_picblock: + :PROPERTIES: + :BEAMER_env: picblock + :BEAMER_OPT: pic=deposit-communication.png,width=.61\linewidth,leftpic=true + :END: +#+LATEX: \pause + *Generic mechanism (SWORD based):*\\ + - *review process*, versioning +# - /industry chimes in/ (details on demand) +#+BEAMER: \pause + - *(today)*: deposit .zip or .tar.gz file ([[http://bit.ly/swhdeposithalen][/guide/]]) + - *(tomorrow)*: provide /SWH id/ and (extract) metadata + \hfill [[https://www.softwareheritage.org/2018/09/28/depositing-scientific-software-into-software-heritage/][*click here to learn more...*]] +#+BEAMER: \pause +*** Reference archive: origins :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .5 + :END: + *swMATH.org* links into Software Heritage + - e.g. [[http://swmath.org/software/7116][/the SemiPar entry in swMATH.org/]] +*** Reference archive: releases :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .45 + :END: + *Wikidata* [[https://www.wikidata.org/wiki/Property:P6138][/SWH Release ID Property/]] + - e.g. [[https://www.wikidata.org/wiki/Q5533567][/the release 3.1.0 of Gensim/]] +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + +* Research challenges +** Realizing the "large telescope of source code" in practice +*** Requirements + - *Availability*: Software Heritage mirror, relatively up-to-date + - *Efficiency*: massive computing resources with fast access to the mirror + - *Sustainability*: pay-per-use or bring-your-own-computing +*** Challenges + - mirroring + - compression + - efficient processing + - experiments description language + - big code analysis (i.e., ML on source code) +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + ... at the scale of the world! + +* Conclusion +** Wrapping up + #+latex: \vspace{-2mm} +*** + - Software Heritage archives all software source code with its development + history. + - It is a major endeavor that benefits society, science, and industry. + - For computer scientists, it is a gold mine of research opportunities. + Wanna join? + #+latex: \vspace{-2mm} +*** References + #+latex: \vspace{-1mm} + #+BEGIN_EXPORT latex + \begin{thebibliography}{Foo Bar, 1969} + \scriptsize + + \bibitem{DiCosmo2019b} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli\newblock + The Software Heritage graph dataset: public software development under one roof\newblock + MSR 2019: Mining Software Repositories, IEEE + + \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli\newblock + Building the Universal Archive of Source Code\newblock + Communication of the ACM, October 2018 + + \bibitem{DiCosmo2018} Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli\newblock + Identifiers for Digital Objects: the Case of Software Source Code Preservation\newblock + iPRES 2018: Intl. Conf. on Digital Preservation + + \bibitem{DiCosmo2017} Roberto Di Cosmo, Stefano Zacchiroli\newblock + Software Heritage: Why and How to Preserve Software Source Code\newblock + iPRES 2017: Intl. Conf. on Digital Preservation + + \end{thebibliography} + #+END_EXPORT diff --git a/talks-public/2019-10-10-jcad/Makefile b/talks-public/2019-10-10-jcad/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2019-10-10-jcad/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides