diff --git a/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org b/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org index bdaf900..365a053 100644 --- a/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org +++ b/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org @@ -1,363 +1,374 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: The Great Library of (Python) Source Code #+BEAMER_HEADER: \date[06/10/2018, PyConFr]{6 Oct 2018\\PyConFr - Lille, France} #+DATE: 6 October 2018 #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Software Heritage]{Software Heritage --- {\tt \{olasd,zack\}@softwareheritage.org}} #+BEAMER_HEADER: \author{Nicolas Dandrimont, Stefano Zacchiroli} #+LATEX_HEADER_EXTRA: \usepackage{bbding} #+LATEX_HEADER_EXTRA: \DeclareUnicodeCharacter{66D}{\FiveStar} #+LATEX_HEADER_EXTRA: \usepackage{tikz} #+LATEX_HEADER_EXTRA: \usetikzlibrary{arrows,shapes} #+LATEX_HEADER_EXTRA: \definecolor{swh-orange}{RGB}{254,205,27} #+LATEX_HEADER_EXTRA: \definecolor{swh-red}{RGB}{226,0,38} #+LATEX_HEADER_EXTRA: \definecolor{swh-green}{RGB}{77,181,174} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} #+LaTeX_HEADER_EXTRA: \newminted{python}{fontsize=\scriptsize} #+LaTeX_HEADER_EXTRA: \newminted{html}{fontsize=\scriptsize} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-minted-options '(("fontsize" "\\scriptsize") ("linenos" ""))) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) #+end_src # End syntax highlighting setup * The Software Commons ** (Free) Software is everywhere #+latex: \begin{center} #+ATTR_LATEX: :width .75\linewidth file:software-center.pdf #+latex: \end{center} #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :minlevel 2 ** Our Software Commons #+INCLUDE: "../../common/modules/foss-commons.org::#commonsdef" :only-contents t #+BEAMER: \pause *** Source code is /a precious part/ of our commons \hfill are we taking care of it? #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#main" :only-contents t :minlevel 2 * Software Heritage #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 ** Core principles #+latex: \begin{center} #+ATTR_LATEX: :width .9\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+BEAMER: \pause *** Open approach :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - 100% Free Software - transparency *** In for the long haul :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - replication - non profit #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#dagdetail" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 # * Accessing the archive # #+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#apiintro" :minlevel 2 #+INCLUDE: "../../common/modules/vault.org::#overview" :minlevel 2 #+INCLUDE: "../../common/modules/webui.org::#intro" :minlevel 2 * The Great Library of Python source code ** Data flow redux #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\textwidth}]{swh-dataflow.pdf}\end{center} ** Our focus #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\textwidth}]{swh-dataflow-pypi.pdf}\end{center} ** Listing all Python modules (1/3) *** #+BEAMER: \footnotesize \centering https://forge.softwareheritage.org/source/swh-lister/ *** What does a Software Heritage lister do? - crawls and parses upstream list of project APIs - generates origins (records that the project has been detected) and loading tasks + +#+beamer: \pause *** Credits go to Avi Kelman for the lister scaffolding, and to Antoine Dumont for the PyPI implementation +#+beamer: \pause + + *** A visit of the Cheese Shop - A little bit more efficiently than [[https://www.youtube.com/watch?v=B3KBuQHHKx0][John Cleese]] - Uses https://pypi.org/simple/ (according to the warehouse docs, the only "package listing" API that's not on the way to deprecation) ** Listing all Python modules (2/3) *** GET https://pypi.org/simple/ #+begin_src html Simple index 0 0-._.-._.-._.-._.-._.-._.-0 [...] Django [...] #+end_src ** Listing all Python modules (3/3) *** #+begin_src python # Origin specification origin = { 'type': 'pypi', 'url': 'https://pypi.org/packages/Django/', # Canonical project URL } #+end_src #+beamer: \pause #+begin_src python # Scheduler task specification update_task = { 'type': 'origin-update-pypi', 'policy': 'recurring', 'next_run': datetime.now(tz=timezone.utc), 'arguments': { 'args': [ 'Django', # Project name 'https://pypi.org/packages/Django/', # Origin URL 'https://pypi.org/pypi/Django/json', # Metadata URL ], 'kwargs': {}, }, 'priority': None, } #+end_src ** Task scheduling (1/2) *** #+BEAMER: \footnotesize \centering https://forge.softwareheritage.org/source/swh-scheduler/ *** What does the Software Heritage scheduler do? - Record **recurrent** and **one-shot** jobs in a database - Schedules runs of these jobs, records their results - Manages retries for transient job failures (remote service unavailable, ...) - Manages adaptive intervals for recurrent jobs ** Task scheduling (2/2) *** Builds upon trusted Python tools - Celery is used as a task queuing middleware, and for its worker management framework - Workers send task results through the Celery events mechanism +#+beamer: \pause + *** And makes them more useful to us - The database is the single source of truth - ~swh.scheduler.celery_backend.runner~ pulls tasks from the database into Celery, limiting the RabbitMQ queue depth (allows task prioritization) - ~swh.scheduler.celery_backend.listener~ fetches task results from Celery events and updates the database - Archival of elapsed tasks/runs/logs in elasticsearch to keep the database snappy ** Loading Python packages (1/4) *** What's a Python package anyway? - Source distributions (~sdists~, currently tarballs or zips) - Binary distributions (~bdists~, which are mostly wheels these days) As we're interested in source code, Software Heritage looks at ~sdists~ exclusively +#+beamer: \pause + - The current sdist format is unspecified: you probably get a tarball, which maybe contains a ~setup.py~ somewhere - When building a sdist, distutils generates a machine-readable ~PKG-INFO~ file is generated and puts in the tarball +#+beamer: \pause + *** The long wait for PEP 517 ("A build-system independent format for source trees") - One uniform transport format: a gzipped tarball with one toplevel directory - Machine parsable data about the project by default (~pyproject.toml~) Hopefully soon in your nearest Cheese Shop (go help the folks in PyPA!) ** Loading Python packages (2/4) *** #+BEAMER: \footnotesize \centering https://forge.softwareheritage.org/source/swh-loader-pypi/ *** Common loading process :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :BEAMER_act: +- :END: Implemented in ~swh.loader.core~ - Fetch metadata about current versions - Compare to latest loaded versions - Download and process versions we had never seen - Load new data *** PyPI specifics :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :BEAMER_act: +- :END: Implemented in ~swh.loader.pypi~ - Comparison done using the ~sdist~ digests - PKG-INFO metadata parsed and saved - versions with multiple sdists imported separately ** Loading Python packages (3/4) *** PyPI snapshots #+begin_src python pifpaf_snapshot = { 'id': b'\xc6_\xfe#\x94\xba\x81\xc3\x94\x9b\xeb[\x06\xf5JC\x0f\x19n\xa6', 'branches': { b'releases/0.0.1': { b'releases/0.0.2': { ... b'releases/2.1.2': { 'target': b'\x8a\xcd\xf3l\xee\xe50\xe2\x81]\x08:5\xd9_\xd6\xeff\xc9\xa3', 'target_type': 'revision', }, b'releases/2.1.2.dev7': { 'target': b'hGh\x15h|\xf3\xd2v\xf8\xec-\xa7\xfeuB\xda3\x83x', 'target_type': 'revision', }, b'HEAD': { 'target': b'releases/2.1.2', 'target_type': 'alias', }, }, } #+end_src ** Loading Python packages (4/4) *** PyPI revisions #+begin_src python -i pifpaf_revision = { 'id': b'\x8a\xcd\xf3l\xee\xe50\xe2\x81]\x08:5\xd9_\xd6\xeff\xc9\xa3', 'author': { 'name': b'Julien Danjou', ... }, 'date': { 'timestamp': {'seconds': 1538577319, 'microseconds': 0}, }, ... 'type': 'tar', 'directory': b'\xa4\xf2\xad\xb1\xef\r\xcf\x894::@=\xf9R\x86=\x19"\\', 'message': b'2.1.2', #+end_src #+beamer: \pause #+begin_src python -i 'metadata': { 'project': { # Metadata parsed from PKG-INFO 'name': 'pifpaf', 'author': 'Julien Danjou', 'license': None, 'summary': 'Suite of tools and fixtures to manage daemons for testing', 'version': '2.1.2', ... #+end_src ** *** #+begin_src python -i 'classifiers': [ 'Intended Audience :: Information Technology', ... ], ... }, #+end_src #+beamer: \pause #+begin_src python -i 'original_artifact': { # The original tarball we downloaded 'url': 'https://files.pythonhosted.org/packages/cc/ce/2599[...]', 'date': '2018-10-03T14:35:19', 'sha1': '00c4efc47580b5c4ad1dcdb5118159f9b057b0fd', 'size': 192940, 'sha256': 'a6eef2ae56ac90d02df5f45885973e108c960a2ea113cc76[...]', 'filename': 'pifpaf-2.1.2.tar.gz', 'sha1_git': '8ce7e3ddda336dd9edff26ae8efaf4b81439c42c', 'blake2s256': 'c4f7fcd4324715f4bfb54f8eefb10fde803efb7a02e2[...]', 'archive_type': 'tar', }, }, 'synthetic': True, 'parents': [], } #+end_src * Getting involved #+INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2 ** You can help! #+BEAMER: \vspace{-2mm} *** Coding | ٭٭ | Web UI improvements | | ٭٭٭ | loaders for unsupported VCS/package formats | | ٭٭٭ | listers for unsupported forges/package managers | #+BEAMER: \vspace{-2mm} \footnotesize \centering \url{https://forge.softwareheritage.org/} \\ \url{https://docs.softwareheritage.org/devel/} #+BEAMER: \pause *** Community | ٭٭٭ | spread the world, help us with sustainability | | ٭٭ | document endangered source code | #+BEAMER: \vspace{-2mm} \footnotesize \centering \url{wiki.softwareheritage.org/Suggestion_box} #+BEAMER: \pause *** Join us #+BEAMER: \footnotesize \centering - \url{www.softwareheritage.org/jobs} --- *job openings* - \url{wiki.softwareheritage.org/Internship} --- *internships* ** Conclusion *** Software Heritage is - a reference archive of *all Free Software* ever written - an international, open, nonprofit, *mutualized infrastructure* - *now accessible* to developers, users, vendors - at the service of our community, *at the service of society* *** Come in, we're open! \url{www.softwareheritage.org} --- general information \\ \url{wiki.softwareheritage.org} --- internships, leads \\ \url{forge.softwareheritage.org} --- our own code