diff --git a/talks-public/2022-09-16-Guix/2022-06-24-Guix.org b/talks-public/2022-09-16-Guix/2022-06-24-Guix.org index 9638ca0..a53a7d5 100644 --- a/talks-public/2022-09-16-Guix/2022-06-24-Guix.org +++ b/talks-public/2022-09-16-Guix/2022-06-24-Guix.org @@ -1,323 +1,323 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+KEYWORDS: software heritage reproducibility guix #+TITLE: 10 years of Guix - Software Heritage #+SUBTITLE: SWH to the rescue of reproducible Science #+AUTHOR: vlorentz, ardumont #+EMAIL: vlorentz@softwareheritage.org, ardumont@softwareheritage.org #+DATE: 16 Sep 2022 #+BEAMER_HEADER: \date[16/09/2022]{16/09/2022\\Event 10 years of Guix, Paris 2022} # #+BEAMER_HEADER: \title[Archive and reference software~~~~ www.softwareheritage.org]{SWH to the rescue of reproducible Science} #+BEAMER_HEADER: \author{Valentin Lorentz (@vlorentz) / Antoine R. Dumont (@ardumont)} #+BEAMER_HEADER: \institute[Software Heritage]{Software Engineers, Software Heritage\\Inria} # #+BEAMER_HEADER: \setbeameroption{show notes on second screen} #+BEAMER_HEADER: \setbeameroption{hide notes} #+LATEX_HEADER: \usepackage{tcolorbox} #+LATEX_HEADER: \definecolor{links}{HTML}{2A1B81} #+LATEX_HEADER: \hypersetup{colorlinks,linkcolor=,urlcolor=links} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{emacs} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) (add-to-list 'org-latex-minted-langs '("emacs-lisp" "common-lisp")) #+end_src # End syntax highlighting setup # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" # +LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] #+LATEX_HEADER: \usepackage{bbding} #+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} # # If you want to change the title logo it's here # # +BEAMER_HEADER: \titlegraphic{\includegraphics[width=0.5\textwidth]{SWH-logo}} # aspect ratio can be changed, but the slides need to be adapted # - compute a "resizing factor" for the images (macro for picblocks?) # # set the background image # # https://pacoup.com/2011/06/12/list-of-true-169-resolutions/ # #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} #+LATEX: \addtocounter{framenumber}{-1} * Introduction: the Software Heritage project ** What is SoftwareHeritage? :PROPERTIES: :CUSTOM_ID: spread :END: The universal source code Archive ** Why an archive? Software is spread all around :PROPERTIES: :CUSTOM_ID: spread :END: #+latex: \begin{flushleft} #+ATTR_LATEX: :width \extblockscale{.5\linewidth} file:myriadsources.png #+latex: \end{flushleft} *** Fashion victims - disparate development platforms (popular forges: Guix, PyPI, npm, ...) - various places where distribution happens (standalone forges: gitlab, heptapod, cgit, gitea...) - projects tend to migrate from one place to another over time *** One place... :B_block: :PROPERTIES: :BEAMER_env: block :END: \hfill ... where can we find, track and search /all/ source code, rebuild tarballs? ** Why an archive? Software is fragile :PROPERTIES: :CUSTOM_ID: fragile :END: #+latex: \begin{flushleft} #+ATTR_LATEX: :width \extblockscale{.5\linewidth} file:fragilecloud.png #+latex: \end{flushleft} *** Like all digital information, FOSS is fragile # - inconsiderate and/or malicious code loss (e.g., Code Spaces) - link rot: projects are created, moved around, removed - business-driven code loss (e.g., Gitorious, Google Code, Bitbucket, ...) - data rot: physical media with legacy software decay *** If a website disappears you go to the Internet Archive... :B_block: :PROPERTIES: :BEAMER_env: block :END: \hfill where do you go if (a repository on) GitHub or GitLab goes away? ** Software Heritage in a Nutshell #+latex: \begin{center} #+ATTR_LATEX: :width \extblockscale{.6\linewidth} file:SWH-logo+motto.pdf #+latex: \end{center} *** Main Objectives - *Collect*, *Preserve* and *Share* *** Reference catalog :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .3 :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{myriadsources} \end{center} #+END_EXPORT *find* and *reference* all software source code *** Universal archive :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .3 :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{fragilecloud} \end{center} #+END_EXPORT *preserve* all the archived software source code *forever* *** Research infrastructure :B_block: :PROPERTIES: :BEAMER_COL: .3 :BEAMER_env: block :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{atacama-telescope} \end{center} #+END_EXPORT *enable analysis* of all software source code, make every piece *identifiable* and freely *available* ** Our principles :PROPERTIES: :CUSTOM_ID: principlesstatus :END: #+latex: \begin{center} #+ATTR_LATEX: :width .6\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+latex: \footnotesize\vspace{-3mm} #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} file:2022-05-06-archive-growth.png ** Under the hood: Automation, and storage :PROPERTIES: :CUSTOM_ID: automation :END: #+BEAMER: \begin{center} #+BEAMER: \only<1>{\includegraphics[width=\extblockscale{\textwidth}]{swh-dataflow-merkle.pdf}} #+BEAMER: \end{center} /Global development history/ *permanently archived* in a *uniform data model* - over *12 billion* unique source files from over *180 million* software projects - *~900 TB* (uncompressed) blobs, *~25 B* nodes, *~300 B* edges * Reference archived code with SWHIDs ** R(eference): granularity and identifiers \hfill [[http://doi.org/10.15497/RDA00053][10.15497/RDA00053]] #+LATEX: \centering\forcebeamerstart #+LATEX: \only<1>{\includegraphics[width=0.8\linewidth]{Granularity-Level-animated-0.png}} #+LATEX: \only<2>{\includegraphics[width=0.8\linewidth]{Granularity-Level-animated-1.png}} #+LATEX: \only<3>{\includegraphics[width=0.8\linewidth]{Granularity-Level-animated-2.png}} #+LATEX: \only<4>{\includegraphics[width=0.8\linewidth]{Granularity-Level-animated-3.png}} #+LATEX: \forcebeamerend #+LATEX: \only<1>{\begin{block}{}\centering Top concept layers vs. bottom artifact layers\end{block}} #+LATEX: \only<2>{\begin{block}{}\centering Extrinsic identifiers are key for the concept layers\end{block}} #+LATEX: \only<3>{\begin{block}{}\centering Intrinsic identifiers are key for the artifact layers\end{block}} #+LATEX: \only<4>{\begin{block}{}\centering In some cases, extrinsic identifiers can be added too\end{block}} ** Meet the SWHID intrinsic identifiers :PROPERTIES: :CUSTOM_ID: oneslide :END: #+LATEX: \centering #+LATEX: \only<1>{\includegraphics[width=\linewidth]{SWHID-v1.4_3.png}} #+LATEX: \forcebeamerend \vspace{-6mm} ** Meet the SWHID intrinsic identifiers \centering [[https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://src.koda.cnrs.fr/mmdc/sensorsio][SWHID DEMO !]] \vspace{1em} \centering [[https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/][Reference : Extrinsic vs intrinsic identifiers]] * Guix ** How does this relate to Guix? - Nothing is eternal, source code (in all forms) disappears - Hopefully, SWH keeps a copy of everything - Guix ensures source code is archived in SWH when building - After source code actually disappears, falls back to SWH when rebuilding ** Reproducibility is of the essence! *** Report - Tarballs will disappear (give it enough time) -- Persistent (intrinsic) identifier (SWHID) is not (yet?) package manager standard +- Persistent intrinsic identifiers (SWHID) are not (yet?) package manager standard - Guix (and other) package managers reference tarball hashes #+begin_src emacs-lisp (define-public ... (package ... (source (origin (method url-fetch) (uri (string-append "https://..." version ".tar.gz")) (sha256 (base32 "03mwi1l3354x52nar...")))) ... #+end_src *** Conclusion - make (non-specific swh) SWHID standard or rebuild original bit-by-bit tarball * Enters... Disarchive ** How it started *** Discussions - "gforge.inria.fr to be taken off-line in Dec. 2020" https://issues.guix.gnu.org/42162 - "lookup ingested tarballs by container checksum" https://forge.softwareheritage.org/T2430 *** New software - Disarchive by Timothy Sample https://git.ngyro.com/disarchive/ ** How it works: - - Manifest of tarball fields (entry order, PAX headers, ...) + - Manifest of tarball fields: entry order, PAX headers, ... - References to individual file hashes - WIP: guessing compression parameters/implementations (using zgz) - -> rebuild original `.tar`, then original `.tar.{gz,xz}` ** Example manifest (1/2) #+begin_src emacs-lisp (disarchive (version 0) (tarball (name "test-archive.tar") (digest (sha256 "0da9fa3e7b360533678338871d9dd36f3...")) (default-header (chksum (trailer " ")) (magic "ustar ") (version " \x00") (devmajor 0 (source "" (trailer ""))) (devminor 0 (source "" (trailer ""))) (data-padding "")) ... #+end_src ** Example manifest (2/2) #+begin_src emacs-lisp (disarchive ... (headers ("test-archive/" (mode 493) (chksum 4291) (typeflag 53)) ("test-archive/file-a" (size 15) (chksum 4849)) ("test-archive/file-b" (size 15) (chksum 4850))) (padding 6656) (input (directory-ref (version 0) (name "test-archive") (addresses (swhid "swh:1:dir:902b1e94f0f5efdde6...")) (digest (sha256 "277decb2666f4832ef64a..."))))) #+end_src ** Planned integration of SWH with Disarchive *** Currently - SWH does not store Disarchive manifests yet *** Plan - Run Disarchive every time SWH loads a tarball - Store it as `(tarball-hash, directory-hash, manifest)` tuples - when someone requests `tarball-hash`, rebuild from the manifest * Current Work in Progress ** Current Work in Progress *** NixGuix Coverage in SWH - - It's missing sources due to technical limitations (bare files and directories, patches) + - It's missing sources due to technical limitations: bare files and directories, patches - Redesign in progress to deal with such limitations *** Disarchive - code dump at https://git.ngyro.com/swh/ - needs to be reviewed and merged