diff --git a/talks-public/2022-09-16-Guix/2022-09-16.org b/talks-public/2022-09-16-Guix/2022-09-16.org index aca3398..5f0c260 100644 --- a/talks-public/2022-09-16-Guix/2022-09-16.org +++ b/talks-public/2022-09-16-Guix/2022-09-16.org @@ -1,380 +1,384 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+KEYWORDS: software heritage reproducibility guix #+TITLE: Software Heritage and Guix #+SUBTITLE: Software Heritage to the rescue of reproducible Science #+AUTHOR: vlorentz, ardumont #+EMAIL: vlorentz@softwareheritage.org, ardumont@softwareheritage.org #+DATE: 16 Sep 2022 #+BEAMER_HEADER: \date[16/09/2022]{16/09/2022\\Event 10 years of Guix, Paris 2022} #+BEAMER_HEADER: \author{Valentin Lorentz (@vlorentz) / Antoine R. Dumont (@ardumont)} #+BEAMER_HEADER: \institute[Software Heritage]{Software Engineers, Software Heritage\\Inria} #+BEAMER_HEADER: \setbeameroption{hide notes} #+LATEX_HEADER: \usepackage{tcolorbox} #+LATEX_HEADER: \definecolor{links}{HTML}{2A1B81} #+LATEX_HEADER: \hypersetup{colorlinks,linkcolor=,urlcolor=links} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{emacs} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) (add-to-list 'org-latex-minted-langs '("emacs-lisp" "common-lisp")) #+end_src # End syntax highlighting setup # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" # +LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] #+LATEX_HEADER: \usepackage{bbding} #+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} # # If you want to change the title logo it's here # # +BEAMER_HEADER: \titlegraphic{\includegraphics[width=0.5\textwidth]{SWH-logo}} # aspect ratio can be changed, but the slides need to be adapted # - compute a "resizing factor" for the images (macro for picblocks?) # # set the background image # # https://pacoup.com/2011/06/12/list-of-true-169-resolutions/ # #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} #+LATEX: \addtocounter{framenumber}{-1} * Introduction: the Software Heritage project ** What is SoftwareHeritage? :PROPERTIES: :CUSTOM_ID: spread :END: #+latex: \begin{center} #+ATTR_LATEX: :width \extblockscale{.6\linewidth} file:SWH-logo+motto.pdf #+latex: \end{center} The Universal Source Code Archive ** Why an archive? Software is spread all around :PROPERTIES: :CUSTOM_ID: spread :END: #+latex: \begin{flushleft} #+ATTR_LATEX: :width \extblockscale{.5\linewidth} file:myriadsources.png #+latex: \end{flushleft} *** Fashion victims - many development platforms (popular forges: Guix, PyPI, npm, ...) - various distribution places (standalone forges: gitlab, heptapod, cgit, gitea...) - projects tend to migrate from one place to another over time +#+BEAMER: \pause + *** One place... :B_block: :PROPERTIES: :BEAMER_env: block :END: \hfill ... where can we find, track and search /all/ source code, rebuild tarballs? ** Why an archive? Software is fragile :PROPERTIES: :CUSTOM_ID: fragile :END: #+latex: \begin{flushleft} #+ATTR_LATEX: :width \extblockscale{.5\linewidth} file:fragilecloud.png #+latex: \end{flushleft} *** Like all digital information, FOSS is fragile # - inconsiderate and/or malicious code loss (e.g., Code Spaces) - link rot: projects are created, moved around, removed - data rot: physical media with legacy software decay - business-driven code loss (e.g. Gitorious, Google Code, Bitbucket, ...) +#+BEAMER: \pause + *** If a website disappears you go to the Internet Archive... :B_block: :PROPERTIES: :BEAMER_env: block :END: \hfill where do you go if (a repository on) GitHub or GitLab goes away? ** Software Heritage in a Nutshell #+latex: \begin{center} #+ATTR_LATEX: :width \extblockscale{.6\linewidth} file:SWH-logo+motto.pdf #+latex: \end{center} *** Main Objectives - *Collect*, *Preserve* and *Share* ** Collect / Preserve *** Reference catalog :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .3 :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{myriadsources} \end{center} #+END_EXPORT *find* and *reference* all software source code #+BEAMER: \pause *** Universal archive :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .3 :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{fragilecloud} \end{center} #+END_EXPORT *preserve* *forever* archived software source code ** Share *** Research infrastructure :B_block: :PROPERTIES: :BEAMER_COL: .3 :BEAMER_env: block :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{atacama-telescope} \end{center} #+END_EXPORT - *enable analysis* of software source code - make every piece *identifiable* - and freely *available*... #+BEAMER: \pause *** Reproducibility :B_block: :PROPERTIES: :BEAMER_COL: .3 :BEAMER_env: block :END: #+BEGIN_EXPORT latex \begin{center} \includegraphics[width=.4\linewidth]{atacama-telescope} \end{center} #+END_EXPORT ... *exactly* as it was when archived (as much as possible) ** Our principles :PROPERTIES: :CUSTOM_ID: principlesstatus :END: #+latex: \begin{center} #+ATTR_LATEX: :width .6\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+latex: \footnotesize\vspace{-3mm} #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} [[file:2022-09-14-archive-growth.png]] ** Under the hood: Automation, and storage :PROPERTIES: :CUSTOM_ID: automation :END: #+BEAMER: \begin{center} #+BEAMER: \only<1>{\includegraphics[width=\extblockscale{\textwidth}]{swh-dataflow-merkle.pdf}} #+BEAMER: \end{center} /Global development history/ *permanently archived* in a *uniform data model* - over *12 billion* unique source files from over *180 million* software projects - *~900 TB* (uncompressed) blobs, *~25 B* nodes, *~300 B* edges * Reference archived code with SWHIDs ** Meet the SWHID intrinsic identifiers :PROPERTIES: :CUSTOM_ID: oneslide :END: #+LATEX: \centering #+LATEX: \only<1>{\includegraphics[width=\linewidth]{SWHID-v1.4_3.png}} #+LATEX: \forcebeamerend \vspace{-6mm} ** SWHID: A worked example #+LATEX: \centering\forcebeamerstart #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_1.pdf}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/contents.pdf}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_2_contents.pdf}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/directories.pdf}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_3_directories.pdf}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/revisions.pdf}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/releases.pdf}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/merkle_5_releases.pdf}}} #+LATEX: \only<10>{\colorbox{white}{\includegraphics[width=\extblockscale{\linewidth}]{git-merkle/snapshots.pdf}}} #+LATEX: \forcebeamerend * Collaboration Guix / SWH ** How does this relate to Guix? - Nothing is eternal, source code (in all forms) disappears - Hopefully, SWH keeps a copy of everything - Since November 2018, [[https://www.softwareheritage.org/2019/04/18/software-heritage-and-gnu-guix-join-forces-to-enable-long-term-reproducibility/][Guix ensures source code is archived in SWH when building]] - After source code actually disappears, falls back to SWH when rebuilding ** History of Guix / SWH integration - 2018: Guix / SWH to ensure source code artifacts are pushed to swh - 2019: TWEAG / Guix / SWH: Work on a new loader to regularly ingest new artifacts - 2022: ongoing work to refactor current loader into a standard lister/loader ** Reproducibility is of the essence! *** Report - Tarballs will disappear (give it enough time) - Persistent intrinsic identifiers (SWHID) are not (yet?) package manager standard - Guix (and other) package managers reference tarball hashes #+begin_src emacs-lisp (define-public ... (package ... (source (origin (method url-fetch) (uri (string-append "https://..." version ".tar.gz")) (sha256 (base32 "03mwi1l3354x52nar...")))) ... #+end_src *** Conclusion - make (non-specific swh) SWHID standard or rebuild original bit-by-bit tarball ** How to rebuild original tarballs? *** pristine-tar - https://manpages.debian.org/bullseye/pristine-tar/pristine-tar.1.en.html - xdelta: binary diffs of tar headers' content and order - zgz: guessing compression parameters - problem: brittle, large diffs * Enters... Disarchive ** How it started *** Discussions - "gforge.inria.fr to be taken off-line in Dec. 2020" https://issues.guix.gnu.org/42162 - "lookup ingested tarballs by container checksum" https://forge.softwareheritage.org/T2430 *** New software - Disarchive by Timothy Sample https://git.ngyro.com/disarchive/ ** How it works: *** Principles - Manifest of tarball fields: entry order, PAX headers, ... - References to individual file hashes - WIP: guessing compression parameters/implementations (using zgz) - -> rebuild original ~.tar~, then original ~.tar.{gz,xz}~ ** Example manifest (1/2) *** Example manifest (1/2) #+begin_src emacs-lisp (disarchive (version 0) (tarball (name "test-archive.tar") (digest (sha256 "0da9fa3e7b360533678338871d9dd36f3...")) (default-header (chksum (trailer " ")) (magic "ustar ") (version " \x00") (devmajor 0 (source "" (trailer ""))) (devminor 0 (source "" (trailer ""))) (data-padding "")) ... #+end_src ** Example manifest (2/2) *** Example manifest (2/2) #+begin_src emacs-lisp (disarchive ... (headers ("test-archive/" (mode 493) (chksum 4291) (typeflag 53)) ("test-archive/file-a" (size 15) (chksum 4849)) ("test-archive/file-b" (size 15) (chksum 4850))) (padding 6656) (input (directory-ref (version 0) (name "test-archive") (addresses (swhid "swh:1:dir:902b1e94f0f5efdde6...")) (digest (sha256 "277decb2666f4832ef64a..."))))) #+end_src ** Planned integration of SWH with Disarchive *** Currently - SWH does not store Disarchive manifests yet *** Plan - Run Disarchive every time SWH loads a tarball - Store it as ~(tarball-hash, directory-hash, manifest)~ tuples - when someone requests ~tarball-hash~, rebuild from the manifest * Current Work in Progress ** NixGuix manifests coverage in SWH *** goal: 100% coverage - currently missing sources due to technical limitations: bare files, directories, patches - Redesign in progress to deal with such limitations #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} file:ngyro-com-pog-reports-guix-coverage-2022-09-14.png ** Disarchive *** Integration - code dump at https://git.ngyro.com/swh/ - needs to be reviewed and merged * The End ** Questions? And thanks for your time! ** Copyright Copyright of images included in this document is held by their respective owners. The source of this document is available at https://forge.softwareheritage.org/source/slides/