diff --git a/common/images/indexer_metadata_translate_example.png b/common/images/indexer_metadata_translate_example.png new file mode 100644 index 0000000..e2a6cb3 Binary files /dev/null and b/common/images/indexer_metadata_translate_example.png differ diff --git a/talks-public/2020-12-01-Master-STL/2020-12-01-Master-STL.org b/talks-public/2020-12-01-Master-STL/2020-12-01-Master-STL.org index 7e81ddf..17b33c7 100644 --- a/talks-public/2020-12-01-Master-STL/2020-12-01-Master-STL.org +++ b/talks-public/2020-12-01-Master-STL/2020-12-01-Master-STL.org @@ -1,406 +1,573 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: The universal source code archive #+BEAMER_HEADER: \title{Software Heritage} #+AUTHOR: Morane Gruenpeter #+EMAIL: morane@softwareheritage.org #+BEAMER_HEADER: \date[December 1st, 2020]{December 1st, 2020\\[-1em]} #+BEAMER_HEADER: \title[www.softwareheritage.org]{Software Heritage} #+BEAMER_HEADER: \institute[]{\\\href{mailto:morane@softwareheritage.org}{\tt morane@softwareheritage.org}} #+BEAMER_HEADER: \author[Morane Gruenpeter]{ Morane Gruenpeter\\[1em]% #+BEAMER_HEADER: Software engineer and metadata specialist\\Inria, Software Heritage\\[-1em]} # #+BEAMER_HEADER: \setbeameroption{show notes on second screen} #+BEAMER_HEADER: \setbeameroption{hide notes} #+KEYWORDS: software heritage legacy preservation knowledge mankind technology deposit # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" # +LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] #+LATEX_HEADER: \usepackage{bbding} #+LATEX_HEADER: \usepackage{tcolorbox} #+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} # # If you want to change the title logo it's here # # +BEAMER_HEADER: \titlegraphic{\includegraphics[width=0.7\textwidth]{SWH-logo}} # aspect ratio can be changed, but the slides need to be adapted # - compute a "resizing factor" for the images (macro for picblocks?) # # set the background image # # https://pacoup.com/2011/06/12/list-of-true-169-resolutions/ # #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} #+LATEX_HEADER: \usepackage{supertabular} #+LATEX_HEADER: \newcommand{\sponsor}[2]{{\bf #1}, #2} #+LATEX_HEADER: \newcommand{\teamster}[2]{{\textcolor{red}{#1}}, #2} * Introduction # BIO ** Short Bio: Morane Gruenpeter #+INCLUDE: "../../common/modules/mg-bio.org::#bio" :only-contents t :minlevel 3 # # One slide motivation + goals #+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 # # Where we are today: endorsement # ** Our principles \hfill iPres 2017 - \url{http://bit.ly/swhpaper} # #+INCLUDE: "../../common/modules/principles-compact.org::#principlesstatus" :only-contents t :minlevel 3 ** Our principles \hfill iPres 2017 - \url{http://bit.ly/swhpaper} :PROPERTIES: :CUSTOM_ID: principlesstatus :END: #+latex: \begin{center} #+ATTR_LATEX: :width .8\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+latex: \footnotesize\vspace{-3mm} # # #+BEAMER: \pause #+BEAMER: \pause #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} file:2020-09-08-growth.png ** Growing Support #+INCLUDE: "../../common/modules/support-compact.org::#support" :only-contents t :minlevel 3 * The knowledge is in the source code ! ** Software is all around us -# TODO +*** Apollo 11 Guidance Computer (~60.000 lines), 1969 + #+latex: \begin{minipage}{.25\linewidth} + #+latex: \begin{flushleft} + #+ATTR_LATEX: :width \extblockscale{.8\linewidth} + file:Margaret_Hamilton.jpg + #+latex: \end{flushleft} + #+latex: \end{minipage} + #+latex: \begin{minipage}{.7\linewidth} + #+latex: \begin{flushright} + #+latex: "When I first got into it, nobody knew what it was that we were doing. It was like the Wild West." + #+latex: \hfill Margaret Hamilton + #+latex: \end{flushright} + #+latex: \end{minipage} + + +*** The World Wide Web, 1989, at CERN on a NeXT machine + #+latex: \begin{minipage}{.65\linewidth} + #+latex: \begin{flushleft} + #+latex:“When somebody has learned how to program a computer … + #+latex: You're joining a group of people who can do incredible things. + #+latex: They can make the computer do anything they can imagine.” + #+latex: \end{flushleft} + #+latex: \end{minipage} + #+latex: \begin{minipage}{.3\linewidth} + #+latex: \begin{flushright} + #+ATTR_LATEX: :width \extblockscale{.95\linewidth} + file:tim_berners_lee.jpg + #+latex: \end{flushright} + #+latex: \end{minipage} + + \hfill From An Insight, An Idea with Tim Berners-Lee (2013) ** The knowledge is in the source code! #+INCLUDE: "../../common/modules/source-code-different-short.org::#thesourcecode" :only-contents t :minlevel 3 ** Source code is /special/ *** /Executable/ and /human readable/ knowledge \hfill copyright law /“Programs must be written for people to read, and only incidentally for machines to execute.”/\\ \hfill Harold Abelson #+BEAMER: \pause *** Software /evolves/ over time - projects may last decades - the /development history/ is key to its /understanding/ #+BEAMER: \pause *** Complexity :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_OPT: pic=python3-matplotlib.pdf, width=.6\linewidth :END: - /millions/ of lines of code - large /web of dependencies/ + easy to break, difficult to maintain - sophisticated /developer communities/ -** modules/vcs-history.org::#timeline +** Software Source Code human readable and executable knowledge +file:NOLI_SE_TANGERE.png +** Version Control Systems timeline #+INCLUDE: "../../common/modules/vcs-history.org::#timeline" :only-contents t :minlevel 3 -** modules/vcs-history.org::#dvcs-to-merkle +** DVCS to Merkle #+INCLUDE: "../../common/modules/vcs-history.org::#dvcs-to-merkle" :only-contents t :minlevel 3 -** modules/vcs-history.org::#vcs-explained +** Version Control Systems explained #+INCLUDE: "../../common/modules/vcs-history.org::#vcs-explained" :only-contents t :minlevel 3 -** modules/vcs-history.org::#adoption +** DVCS adoption #+INCLUDE: "../../common/modules/vcs-history.org::#adoption" :only-contents t :minlevel 3 * Data model and SWHID: the source code fingerprint # under the hood: automation and storage, the archive in pictures #+INCLUDE: "../../common/modules/under-the-hood-pictures.org::#main" :only-contents t :minlevel 2 ** Under the hood: identifying billions of objects \hfill \url{https://bit.ly/2wOOmyV} #+latex: \begin{center} #+ATTR_LATEX: :width .85\linewidth file:swh-merkle-dag-wide.pdf #+latex: \end{center} #+latex: \footnotesize\vspace{-3mm} ** Our challenges in the PID landscape :PROPERTIES: :CUSTOM_ID: challenges :END: *** Typical properties of systems of identifiers \hfill uniqueness, non ambiguity, persistence, abstraction (opacity) #+BEAMER: \pause *** Key needed properties from our use cases - gratis :: identifiers are free (billions of objects) - integrity :: the associated object cannot be changed (sw dev, /reproducibility/) - no middle man :: no central authority is needed (sw dev, /reproducibility/) #+BEAMER: \pause *** \hfill we could not find systems with both *integrity* and *no middle man* ! ** The SWH-ID schema # TODO: drawing with swh:1:cnt:xxxxxxx "exploded" and explained #+LATEX: \centering\forcebeamerstart #+LATEX: \only<1>{\includegraphics[width=\linewidth]{SWH-ID-1.png}} #+LATEX: \only<2>{\includegraphics[width=\linewidth]{SWH-ID-2.png}} #+LATEX: \only<3>{\includegraphics[width=\linewidth]{SWH-ID-3.png}} #+LATEX: \forcebeamerend ** Demo time *** Let's look at some famous exceprts of source code #+BEAMER: \pause *** Apollo 11 source code ([[https://archive.softwareheritage.org/swh:1:cnt:64582b78792cd6c2d67d35da5a11bb80886a6409;origin=https://github.com/virtualagc/virtualagc;lines=245-261/][excerpt]]) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.48 :BEAMER_env: block :END: #+LATEX: \includegraphics[width=\linewidth]{apollo-11-cranksilly.png} # excerpt of routine that asks astronaut to turn around the LEM #+BEAMER: \pause *** Quake III source code ([[https://archive.softwareheritage.org/swh:1:cnt:bb0faf6919fc60636b2696f32ec9b3c2adb247fe;origin=https://github.com/id-Software/Quake-III-Arena;lines=549-572/][excerpt]]) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.45 :BEAMER_env: block :END: #+LATEX: \includegraphics[width=\linewidth]{quake-carmack-sqrt-1.png} # smart efficient implementation of 1/sqrt(x) on a CPU without special support #+BEAMER: \pause *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: *** It works! we have /intrinsic/ identifiers for all 20+ billion objects in the archive -# metadata challenge- questions about a software entity and where to find metadata (one slide) -#+INCLUDE: "../../common/modules/identifiers-arena.org::#main" :only-contents t :minlevel 2 +* The software deposit - a first class research output + +** Software is a /forgotten/ pillar of Open Science +*** Lack of recognition + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.48 + :END: + not (yet) a first class citizen + - in the EOSC plan + - in the scholarly world + + + #+BEGIN_QUOTE + Sometimes, if you don't have the software, you don't have the data + + \mbox{}\hfill Christine Borgman, Paris, 2018 + #+END_QUOTE + + +*** + :PROPERTIES: + :BEAMER_COL: .5 + :END: + #+latex: \begin{center} + #+ATTR_LATEX: :width 0.9\linewidth + file:preservation_triangle_color.png + #+latex: \end{center} +#+BEAMER: \pause +*** Reproducibility is the key :B_picblock: + :PROPERTIES: + :BEAMER_opt: pic=Karl_Popper, leftpic=true, width=.16\linewidth + :BEAMER_env: picblock + :END: +#+latex: \begin{quote} + non-reproducible single occurrences are of no significance to science\\ + \\ + \mbox{} \hfill \scriptsize Karl Popper, \emph{The Logic of Scientific Discovery}, 1934 +#+latex: \end{quote} + +** What is at stake \hfill in increasing order of difficulty +\vspace{-7pt} +*** Archival + Research software artifacts must be properly *archived*\\ + \hfill make it sure we can /retrieve/ them (/reproducibility/) +#+BEAMER: \pause +*** Identification + Research software artifacts must be properly *referenced*\\ + \hfill make it sure we can /identify/ them (/reproducibility/) +#+BEAMER: \pause +*** Metadata + Research software artifacts must be properly *described*\\ + \hfill make it easy to /discover/ them (/visibility/) +#+BEAMER: \pause +*** Citation + Research software artifacts must be properly *cited* /(not the same as referenced!)/\\ + \hfill to give /credit/ to authors (/evaluation/!) + + + +** The research software (deposit) use case + :PROPERTIES: + :CUSTOM_ID: hal + :END: +*** the deposit workflow + :PROPERTIES: + :BEAMER_COL: .5 + :END: + #+latex: \begin{center} + #+ATTR_LATEX: :width \linewidth + file:deposit-communication-with-PID.png + #+latex: \end{center} +#+LATEX: \pause + +*** Deposit software in HAL \hfill [[http://hal.inria.fr/hal-01738741][poster]] :B_picblock: + :PROPERTIES: + :BEAMER_COL: .5 + :BEAMER_env: block + :END: + *\hspace{1em}Generic mechanism:* + - SWORD based + - review process + - versioning + +#+BEAMER: \pause + *\hspace{1em} How to do it:* \hfill ([[http://bit.ly/swhdeposithalen][/guide/]]) + - deposit .zip or .tar.gz file with metadata + +#+BEAMER: \pause + *\hspace{1em} Timeline:* + - /March 2018/: test phase on *HAL-Inria* + - /September 2018/: open to all *HAL* + - /December 2019/: + - 80 complete source code deposits + - 98 software records + +** Submit your source code \hfill ([[http://bit.ly/swhdeposithalen][/guide/]]) +#+latex: \begin{center} +#+ATTR_LATEX: :width \linewidth +file:HAL-form-IDCC.png +#+latex: \end{center} + +** The deposit view +#+latex: \begin{center} +#+ATTR_LATEX: :width \linewidth +file:HAL_deposit.png +#+latex: \end{center} + +** Reference vs. citation +*** Credit & Attribution + :PROPERTIES: + :BEAMER_col: 0.33 + :BEAMER_env: block + :END: + - a metadata record + - all authors & contributors +#+BEAMER: \pause + +*** Reuse & Reproducibility + :PROPERTIES: + :BEAMER_col: 0.33 + :BEAMER_env: block + :END: + - a specific artifact + - with complementary information (docs) +#+BEAMER: \pause + +*** Archive & Index + :PROPERTIES: + :BEAMER_col: 0.33 + :BEAMER_env: block + :END: + - metadata record (HAL) + - artifact itself (SWH) + \hfill connect the dots... + +#+BEAMER: \pause +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +#+latex: \begin{center} +#+ATTR_LATEX: :width 0.7\linewidth +file:citation-format-IDCC.png +#+latex: \end{center} + + + + +#+LATEX: \pause + +# scientific software (save code now) use-case (three slides) +#+INCLUDE: "../../common/modules/swh-scientific-preservation.org::#main" :only-contents t :minlevel 2 + + * The missing piece- the Metadata # metadata challenge- questions about a software entity and where to find metadata (one slide) #+INCLUDE: "../../common/modules/metadata-challenge.org::#main" :only-contents t :minlevel 2 ** The Software Ontology /Touchstone/ *** Software Citation Principles \tiny ( FORCE11's 2015 conference and WG) :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_opt: :END: - *Importance* : first class citizen in the scholarly ecosystem - *Credit and attribution* : authors, maintainer - *Unique identification*: points to a unique, specific software version (DOI, Git SHA1 hash, etc..) - *Persistence* : identification beyond the lifespan of the software (swh-id) - *Accessibility*: url, publisher - *Specificity* : version, environment # metadata landscape (one decomposed slide) #+INCLUDE: "../../common/modules/metadata-landscape.org::#main" :only-contents t :minlevel 2 ** Software Metadata Terms *** identify :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.2 :BEAMER_env: block :END: - identifier - title - authors - version - type - origin source #+BEAMER: \pause *** execute :B_block:BMCOL: :PROPERTIES: :BEAMER_opt: :BEAMER_env: block :BEAMER_col: 0.2 :END: - link to a compiled version - repository - compiler - environment - examples #+BEAMER: \pause *** classify :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.2 :BEAMER_env: block :END: - description - keywords - in/out data - references - algorithms - docs url #+BEAMER: \pause *** administrate :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.2 :BEAMER_env: block :END: - contact - authorship - funders - license - editor (publisher) - dates - status ** Much more complex than it seems *** Software is complex - Structure :: monolithic/composite; self-contained/external dependencies - Lifetime :: one-shot/long term - Community :: one man/one team/distributed community - Authorship :: complex set of roles - Authority :: institutions/organizations/communities/single person #+BEAMER: \pause *** Various granularities - Exact status of the source code :: for reproducibility, e.g. #+latex: \emph{``you can find at \href{https://archive.softwareheritage.org/swh:1:cnt:cdf19c4487c43c76f3612557d4dc61f9131790a4;lines=146-187/}{swh:1:cnt:cdf19c4487c43c76f3612557d4dc61f9131790a4;lines=146-187} the core algorithm used in this article''} - (Major) release :: \emph{``This functionality is available in OCaml version 4''} - Project :: \emph{``Inria has created OCaml and Scikit-Learn''}. +** Mining software metadata +*** supported intrinsic metadata files + - CodeMeta’s codemeta.json, + - Maven’s pom.xml, + - NPM’s package.json, + - Python’s PKG-INFO, + - Ruby’s .gemspec +*** Check the code +- /[[https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/][blog post]]/ +- /[[https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html#adding-support-for-additional-ecosystem-specific-metadata][tutorial in docs]]/ -* Software Source code as a research output -** Software is a /forgotten/ pillar of Open Science -*** Lack of recognition - :PROPERTIES: - :BEAMER_env: block - :BEAMER_col: 0.48 - :END: - not (yet) a first class citizen - - in the EOSC plan - - in the scholarly world - - - #+BEGIN_QUOTE - Sometimes, if you don't have the software, you don't have the data - - \mbox{}\hfill Christine Borgman, Paris, 2018 - #+END_QUOTE +** Mining software metadata (example) \hfill /[[https://archive.softwareheritage.org/swh:1:cnt:d04c0e34d8cdcf49bb8f7acc03c608884806af23;origin=https://forge.softwareheritage.org/source/swh-indexer.git;visit=swh:1:snp:8f4fad5fb55bf68fb8ae76a80ee7e2d41497b598;anchor=swh:1:rev:10f8af474c0df7511c4b40ea480646ae73596303;path=/swh/indexer/metadata_dictionary/codemeta.py/][see example]]/ +file:indexer_metadata_translate_example.png -*** - :PROPERTIES: - :BEAMER_COL: .5 - :END: - #+latex: \begin{center} - #+ATTR_LATEX: :width 0.9\linewidth - file:preservation_triangle_color.png - #+latex: \end{center} -#+BEAMER: \pause -*** Reproducibility is the key :B_picblock: - :PROPERTIES: - :BEAMER_opt: pic=Karl_Popper, leftpic=true, width=.16\linewidth - :BEAMER_env: picblock - :END: -#+latex: \begin{quote} - non-reproducible single occurrences are of no significance to science\\ - \\ - \mbox{} \hfill \scriptsize Karl Popper, \emph{The Logic of Scientific Discovery}, 1934 -#+latex: \end{quote} +* Development workflow +** Overall architecture +*** Using a bit of code +#+BEAMER: \vspace{1mm} +#+BEAMER: \centering \includegraphics[width=\extblockscale{1.4\linewidth}]{swh-modules-deps-internal} -** What is at stake \hfill in increasing order of difficulty -\vspace{-7pt} -*** Archival - Research software artifacts must be properly *archived*\\ - \hfill make it sure we can /retrieve/ them (/reproducibility/) -#+BEAMER: \pause -*** Identification - Research software artifacts must be properly *referenced*\\ - \hfill make it sure we can /identify/ them (/reproducibility/) -#+BEAMER: \pause -*** Metadata - Research software artifacts must be properly *described*\\ - \hfill make it easy to /discover/ them (/visibility/) -#+BEAMER: \pause -*** Citation - Research software artifacts must be properly *cited* /(not the same as referenced!)/\\ - \hfill to give /credit/ to authors (/evaluation/!) +Actually it's not so big: +- ~20ksloc of python3 +- ~80 python dependencies +- a bunch of js +- ... keep it as simple as possible, but no simpler... (almost) +** The big picture +#+BEAMER: \vspace{1mm} +#+BEAMER: \centering \includegraphics[height=.8\textheight]{general-architecture} -* The software deposit - a first class research output +/[[https://docs.softwareheritage.org/devel/architecture.html][More details in our docs]]/ -# reproducibility and scientific knowledge pillars (one slide) -#+INCLUDE: "../../common/modules/swh-scientific-reproducibility.org::#main" :only-contents t :minlevel 2 +** Starting points +*** Development documentation + https://docs.softwareheritage.org/devel/ + - in particular, Developer setup: + https://docs.softwareheritage.org/devel/developer-setup.html + - i.e.: virtualenv + pip + tox -** The research software (deposit) use case - :PROPERTIES: - :CUSTOM_ID: hal - :END: -*** the deposit workflow - :PROPERTIES: - :BEAMER_COL: .5 - :END: - #+latex: \begin{center} - #+ATTR_LATEX: :width \linewidth - file:deposit-communication-with-PID.png - #+latex: \end{center} -#+LATEX: \pause +*** "Software Development" pages on the public wiki + https://wiki.softwareheritage.org/wiki/Category:Software_development -# scientific software (save code now) use-case (three slides) -#+INCLUDE: "../../common/modules/swh-scientific-preservation.org::#main" :only-contents t :minlevel 2 +*** Internship page on the public wiki + https://wiki.softwareheritage.org/wiki/Internships + + +** Development forge + #+BEAMER: \vspace{-2mm} +*** Phabricator + https://forge.softwareheritage.org/ + - all development activities happen here + + #+BEAMER: \vspace{-2mm} +*** The classics + - VCS: Git, with repo browsing using Diffusion + https://forge.softwareheritage.org/diffusion/ + - Tasks and Bugs: Maniphest https://forge.softwareheritage.org/maniphest/ + - one project tag for each software product, e.g., Git Loader: + https://forge.softwareheritage.org/project/view/17/ + - we use task priorities, assignees, and tags + - visibility: all dev tasks are public * Conclusion ** Research Software Engineer tips *** Use a forge for your academic and personal projects \hfill Github, Gitlab or Bitbucket are the best way to create your *source code cv* #+BEAMER: \pause -*** Put in your projects metadata files +*** Put in your projects metadata files and document your code \hfill *README*, *LICENSE*, *AUTHORS* and *codemeta.json* to describe your project #+BEAMER: \pause *** Archive your projects on SWH \hfill Use the *Save Code Now* feature #+BEAMER: \pause *** Contribute to other projects \hfill When you contribute you learn how to *read code* #+BEAMER: \pause *** Ask \hfill Don't be afraid to ask on an *issue*, *mailing list* or *irc channel* (or your teachers) ** Come in, we're open! #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.5\linewidth]{SWH-logo.pdf} \end{center} \begin{center} \vfill {\Large Thank you! Any questions?} \end{center} #+END_EXPORT *** Join us on https://forge.softwareheritage.org/ :B_block: :PROPERTIES: :BEAMER_env: block :END: #+BEGIN_EXPORT latex \begin{thebibliography}{Foo Bar, 1969} \footnotesize \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli\newblock \emph{Building the Universal Archive of Source Code}, Communications of the ACM, October 2018 \bibitem{DiCosmo2018} Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli\newblock \emph{Identifiers for Digital Objects: the Case of Software Source Code Preservation}, iPRES 2018: Intl. Conf. on Digital Preservation \end{thebibliography} #+END_EXPORT *** contact: morane@softwareheritage.org