diff --git a/talks-public/2019-10-10-jcad/2019-10-10-jcad.org b/talks-public/2019-10-10-jcad/2019-10-10-jcad.org index ac04a0e..938c4e3 100644 --- a/talks-public/2019-10-10-jcad/2019-10-10-jcad.org +++ b/talks-public/2019-10-10-jcad/2019-10-10-jcad.org @@ -1,311 +1,354 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: Source Code Archival and Analysis at the Scale of the World #+BEAMER_HEADER: \date[10 Oct 2019, JCAD]{10 October 2019\\Journées Calcul et Données --- Toulouse, France} #+AUTHOR: Stefano Zacchiroli #+DATE: 10 October 2019 #+EMAIL: zack@upsilon.cc #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Software Heritage]{Software Heritage --- {\tt zack@upsilon.cc, @zacchiro}} #+BEAMER_HEADER: \author{Stefano Zacchiroli} * Software Source Code: a Forgotten Pillar of Open Science ** Software Source code: pillar of Open Science *** Software is everywhere in modern research :B_picblock: :PROPERTIES: :BEAMER_opt: pic=papermountain, leftpic=true, width=.3\linewidth :BEAMER_env: picblock :BEAMER_COL: .6 :END: #+BEGIN_QUOTE [...] software [...] essential in their fields. \mbox{}\hfill Top 100 papers (Nature, 2014) #+END_QUOTE #+BEGIN_QUOTE Sometimes, if you dont have the software, you dont have the data \mbox{}\hfill Christine Borgman, Paris, 2018 #+END_QUOTE # http://www.nature.com/news/the-top-100-papers-1.16224 #+BEAMER: \pause *** Open Science: three pillars :B_block: :PROPERTIES: :BEAMER_COL: .45 :BEAMER_env: block :END: #+latex: \begin{center} #+ATTR_LATEX: :width \extblockscale{\linewidth} file:PreservationTriangle.png #+latex: \end{center} #+BEAMER: \pause *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: *** Nota bene \hfill The links in the picture are *essential* ** Source code is /special/ #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :only-contents t :minlevel 3 ** ~ 50 years, a lightning fast growth :noexport: #+INCLUDE: "../../common/modules/50years-source-code.org::#apollolinux" :only-contents t :minlevel 3 ** Pressure to make the source code available is raising :noexport: *** Why Necessary to - /reproduce/ and verify, - /modify/ and /evolve/, *building new experiments* from old ones #+BEAMER: \pause *** When and where - debate started end of first 2000 decade (biology, statistics, medicine, etc.) - growing in Computer Science since the [[https://www.artifact-eval.org/about.html][ESEC/FSE 2011 Artifact Evaluation context]] (winner: Vouillon and Di Cosmo) ** ACM take on Reproducibility, Replicability and Source code :noexport: ACM policies: [[https://www.acm.org/publications/policies/artifact-review-badging][Artifact Review and Badging]] *** Terminology (not consensual yet!) :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :END: - *Repeatability* \\ same team, same experimental setup - *Replicability* \\ different team, same experimental setup - *Reproducibility* \\ different team, different experimental setup #+BEAMER: \pause *** Badging software artefacts :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: #+latex: \begin{center} #+ATTR_LATEX: :width 0.6\linewidth # file:file:metadata_landscape_final.png file:acm_badges.png #+latex: \end{center} #+BEAMER: \pause ** The state of the art is not ideal #+INCLUDE: "../../common/modules/reprod-bad-sota.org::#collbergmethod" :only-contents t :minlevel 3 ** ... cont'd #+INCLUDE: "../../common/modules/reprod-bad-sota.org::#collbergfindings" :only-contents t :minlevel 3 #+BEAMER: \pause *** The main reasons \hfill source code (/or the right version of it/) cannot be found ** Where we stand :noexport: *** Lack of recognition :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .5 :END: not (yet) a first class citizen - in the EOSC plan # - in the EU copyright reform - in the scholarly works #+BEAMER: \pause *** Lack of proper guidance on how to :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .5 :END: - /archive/ software - choose a license - /cite/ a software project # #+BEAMER: \pause # *** :B_ignoreheading: # :PROPERTIES: # :BEAMER_env: ignoreheading # :END: # *** Lack of basic prerequisites to reproducibility # See a discussion in \url{annex.softwareheritage.org/talks/2018/2018-09-17-STScI_public.pdf} *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEAMER: \pause *** ... but a wealth of initiatives! - Policies: ACM [[https://www.acm.org/publications/policies/artifact-review-badging][Artifact Review and Badging]], ... - Working groups: [[https://www.force11.org/software-citation-principles][FORCE11]], [[https://www.rd-alliance.org/groups/software-source-code-ig][RDA]], [[https://www.ouvrirlascience.fr/logiciels-libres-et-open-source/][SPSO]], ... - Metrics: [[https://www.ouvrirlascience.fr/about-the-proposal-for-software-indicators-in-open-science-monitor-3/][Open Science Monitor]] (Elsevier!), ... - Journals: [[https://www.ipol.im/][IPOL]], ReScience, InsightJournal, eLife, ACM DL, ... - Repositories: FigShare, Zenodo, ... ** What is at stake :noexport: *** Metadata Research software artifacts must be properly *described*\\ \hfill make it easy to /discover/ them (/visibility/) #+BEAMER: \pause *** Archival Research software artifacts must be properly *archived*\\ \hfill make it sure we can /retrieve/ them (/reproducibility/) #+BEAMER: \pause *** Identification Research software artifacts must be properly *referenced*\\ \hfill make it sure we can /identify/ them (/reproducibility/) #+BEAMER: \pause *** Citation Research software artifacts must be properly *cited* /(not the same as referenced)/\\ \hfill to give /credit/ to authors (/evaluation/!) * Software Heritage #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#fragile" :minlevel 2 # #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#spread" :minlevel 2 #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#research" :minlevel 2 #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 # #+INCLUDE: "../../common/modules/principles-short.org::#principles" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t # #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :only-contents t - # #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 * Software Heritage for Open Science -** A revolutionary infrastructure for research and innovation +** A revolutionary infrastructure for research and innovation :noexport: *** Reference archive for research software :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_OPT: pic=PreservationTriangle.png,leftpic=true, width=.4\linewidth :END: - *curated deposit* of research software + /prototype/ with *HAL*, *CCSD* and *Inria IES* - *intrinsic* identifiers for *reproducibility* #+BEAMER: \pause *** Reference platform for /Big Code/ :B_picblock: :PROPERTIES: :BEAMER_opt: pic=universal, leftpic=true, width=.2\linewidth :BEAMER_env: picblock :BEAMER_act: :END: - unique *observatory* of all software development - *big data, machine learning* paradise: classification, trends, coding patterns, code completion... -** Highlights \hfill bit.ly/swhpaper +** Highlights \hfill bit.ly/swhpaper :noexport: *** The largest software source code archive /ever/ #+latex: \centering #+latex: \mbox{}\hfill\includegraphics[width=\extblockscale{.35\linewidth}]{swh-dataflow-merkle.pdf}\hfill\pause #+latex: \includegraphics[width=\extblockscale{.75\linewidth}]{2019-01-archive-growth.png}\hfill\mbox{} #+BEAMER: \pause *** /10 billions intrinsic/ identifiers for reproducibility :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .6 :END: See DIO vs IDO in \hfill \url{bit.ly/swhpidpaper} #+BEAMER: \pause *** Research software deposit :B_block:noexport: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .4 :END: - [[https://www.softwareheritage.org/2018/09/28/depositing-scientific-software-into-software-heritage/][moderated via *HAL*]]\\ \hfill /open since 9/2018/ *** Reference archive :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .4 :END: See the work done at \hfill /swmath.org/ #+BEAMER: \pause *** SWH IDs now a standard for Wikidata \mbox{}\hfill See https://www.wikidata.org/wiki/Property:P6138 #+BEAMER: \pause *** Collaboration HUB :B_block:noexport: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .33 :END: - industry, research - digital preservation *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: *** Policy \hfill Now part of the /French National Plan for Open Science/ \hfill\mbox{} -** Leveraging Software Heritage +** Leveraging Software Heritage for Open Science *** Deposit research software \hfill /open since 9/2018/ :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_OPT: pic=deposit-communication.png,width=.61\linewidth,leftpic=true :END: #+LATEX: \pause *Generic mechanism (SWORD based):*\\ - *review process*, versioning # - /industry chimes in/ (details on demand) #+BEAMER: \pause - *(today)*: deposit .zip or .tar.gz file ([[http://bit.ly/swhdeposithalen][/guide/]]) - *(tomorrow)*: provide /SWH id/ and (extract) metadata \hfill [[https://www.softwareheritage.org/2018/09/28/depositing-scientific-software-into-software-heritage/][*click here to learn more...*]] #+BEAMER: \pause *** Reference archive: origins :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .5 :END: *swMATH.org* links into Software Heritage - e.g. [[http://swmath.org/software/7116][/the SemiPar entry in swMATH.org/]] *** Reference archive: releases :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .45 :END: *Wikidata* [[https://www.wikidata.org/wiki/Property:P6138][/SWH Release ID Property/]] - e.g. [[https://www.wikidata.org/wiki/Q5533567][/the release 3.1.0 of Gensim/]] *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: +** Leveraging Software Heritage for Open Science (cont.) +*** GNU Guix :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.4 + :BEAMER_env: block + :END: + #+BEGIN_EXPORT latex + \includegraphics[width=\textwidth]{this/swh-guix} + #+END_EXPORT +*** code.etalab.gouv.fr :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.6 + :BEAMER_env: block + :END: + #+BEGIN_EXPORT latex + \includegraphics[width=\textwidth]{this/code-gouv} + #+END_EXPORT * Research challenges ** Realizing the "large telescope of source code" in practice *** Requirements - *Availability*: Software Heritage mirror, relatively up-to-date - *Efficiency*: massive computing resources with fast access to the mirror - *Sustainability*: pay-per-use or bring-your-own-computing *** Challenges - mirroring - compression - efficient processing - experiments description language - big code analysis (i.e., ML on source code) *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: ... at the scale of the world! +** Software Heritage Graph dataset + :PROPERTIES: + :CUSTOM_ID: graphdataset + :END: + #+BEAMER: \vspace{-1mm} + + **Use case:** large scale analyses of the most comprehensive corpus on the + development history of free/open source software. + +*** Dataset + - Relational representation of the full graph as a set of tables + - Available as open data: https://doi.org/10.5281/zenodo.2583978 + + #+BEAMER: \vspace{-1mm} +*** Formats + - Local use: PostgreSQL dumps, or Apache Parquet files (~1 TiB each) + - Live usage: Amazon Athena (SQL-queriable) + +*** MSR 2020 mining challenge :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEGIN_EXPORT latex + \includegraphics[width=\textwidth]{this/msr-2020} + #+END_EXPORT + * Conclusion ** Wrapping up #+latex: \vspace{-2mm} *** - Software Heritage archives all software source code with its development history. - It is a major endeavor that benefits society, science, and industry. - For computer scientists, it is a gold mine of research opportunities. Wanna join? #+latex: \vspace{-2mm} *** References #+latex: \vspace{-1mm} #+BEGIN_EXPORT latex \begin{thebibliography}{Foo Bar, 1969} \scriptsize \bibitem{DiCosmo2019b} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli\newblock The Software Heritage graph dataset: public software development under one roof\newblock MSR 2019: Mining Software Repositories, IEEE \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli\newblock Building the Universal Archive of Source Code\newblock Communication of the ACM, October 2018 \bibitem{DiCosmo2018} Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli\newblock Identifiers for Digital Objects: the Case of Software Source Code Preservation\newblock iPRES 2018: Intl. Conf. on Digital Preservation \bibitem{DiCosmo2017} Roberto Di Cosmo, Stefano Zacchiroli\newblock Software Heritage: Why and How to Preserve Software Source Code\newblock iPRES 2017: Intl. Conf. on Digital Preservation \end{thebibliography} #+END_EXPORT diff --git a/talks-public/2019-10-10-jcad/this/code-gouv.png b/talks-public/2019-10-10-jcad/this/code-gouv.png new file mode 100644 index 0000000..565c8eb Binary files /dev/null and b/talks-public/2019-10-10-jcad/this/code-gouv.png differ diff --git a/talks-public/2019-10-10-jcad/this/msr-2020.png b/talks-public/2019-10-10-jcad/this/msr-2020.png new file mode 100644 index 0000000..0c87433 Binary files /dev/null and b/talks-public/2019-10-10-jcad/this/msr-2020.png differ diff --git a/talks-public/2019-10-10-jcad/this/swh-guix.png b/talks-public/2019-10-10-jcad/this/swh-guix.png new file mode 100644 index 0000000..b81ec18 Binary files /dev/null and b/talks-public/2019-10-10-jcad/this/swh-guix.png differ