diff --git a/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org b/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org index 74ed842..353f8b7 100644 --- a/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org +++ b/talks-public/2018-10-06-lille-pycon/2018-10-06-lille-pycon.org @@ -1,219 +1,363 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: The Great Library of (Python) Source Code #+BEAMER_HEADER: \date[06/10/2018, PyConFr]{6 Oct 2018\\PyConFr\\Lille, France} #+DATE: 6 October 2018 #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute[Software Heritage]{Software Heritage --- {\tt \{olasd,zack\}@softwareheritage.org}} #+BEAMER_HEADER: \author{Nicolas Dandrimont, Stefano Zacchiroli} #+LATEX_HEADER_EXTRA: \usepackage{bbding} #+LATEX_HEADER_EXTRA: \DeclareUnicodeCharacter{66D}{\FiveStar} #+LATEX_HEADER_EXTRA: \usepackage{tikz} #+LATEX_HEADER_EXTRA: \usetikzlibrary{arrows,shapes} #+LATEX_HEADER_EXTRA: \definecolor{swh-orange}{RGB}{254,205,27} #+LATEX_HEADER_EXTRA: \definecolor{swh-red}{RGB}{226,0,38} #+LATEX_HEADER_EXTRA: \definecolor{swh-green}{RGB}{77,181,174} # Syntax highlighting setup #+LATEX_HEADER_EXTRA: \usepackage{minted} #+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} #+LaTeX_HEADER_EXTRA: \newminted{python}{fontsize=\scriptsize} #+LaTeX_HEADER_EXTRA: \newminted{html}{fontsize=\scriptsize} #+name: setup-minted #+begin_src emacs-lisp :exports results :results silent (setq org-latex-listings 'minted) (setq org-latex-minted-options - '(("frame" "lines") - ("fontsize" "\\scriptsize") + '(("fontsize" "\\scriptsize") ("linenos" ""))) (setq org-latex-to-pdf-process '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) #+end_src # End syntax highlighting setup * The Software Commons ** (Free) Software is everywhere #+latex: \begin{center} #+ATTR_LATEX: :width .75\linewidth file:software-center.pdf #+latex: \end{center} #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :minlevel 2 ** Our Software Commons #+INCLUDE: "../../common/modules/foss-commons.org::#commonsdef" :only-contents t #+BEAMER: \pause *** Source code is /a precious part/ of our commons \hfill are we taking care of it? #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#main" :only-contents t :minlevel 2 * Software Heritage #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 ** Core principles #+latex: \begin{center} #+ATTR_LATEX: :width .9\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+BEAMER: \pause *** Open approach :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - 100% Free Software - transparency *** In for the long haul :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - replication - non profit #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#dagdetail" :minlevel 2 :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 # * Accessing the archive # #+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#apiintro" :minlevel 2 #+INCLUDE: "../../common/modules/vault.org::#overview" :minlevel 2 #+INCLUDE: "../../common/modules/webui.org::#intro" :minlevel 2 * The Great Library of Python source code ** Data flow redux #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\textwidth}]{swh-dataflow.pdf}\end{center} ** Our focus #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\textwidth}]{swh-dataflow-pypi.pdf}\end{center} ** Listing all Python modules (1/3) *** +#+BEAMER: \footnotesize \centering https://forge.softwareheritage.org/source/swh-lister/ *** What does a Software Heritage lister do? - crawls and parses upstream list of project APIs - generates origins (records that the project has been detected) and loading tasks *** Credits go to Avi Kelman for the lister scaffolding, and to Antoine Dumont for the PyPI implementation *** A visit of the Cheese Shop - A little bit more efficiently than [[https://www.youtube.com/watch?v=B3KBuQHHKx0][John Cleese]] - Uses https://pypi.org/simple/ (according to the warehouse docs, the only "package listing" API that's not on the way to deprecation) ** Listing all Python modules (2/3) *** GET https://pypi.org/simple/ #+begin_src html Simple index 0 0-._.-._.-._.-._.-._.-._.-0 [...] Django [...] #+end_src ** Listing all Python modules (3/3) *** #+begin_src python # Origin specification origin = { 'type': 'pypi', 'url': 'https://pypi.org/packages/Django/', # Canonical project URL } +#+end_src + +#+beamer: \pause +#+begin_src python # Scheduler task specification update_task = { 'type': 'origin-update-pypi', 'policy': 'recurring', 'next_run': datetime.now(tz=timezone.utc), 'arguments': { 'args': [ 'Django', # Project name 'https://pypi.org/packages/Django/', # Origin URL 'https://pypi.org/pypi/Django/json', # Metadata URL ], 'kwargs': {}, }, 'priority': None, } #+end_src ** Task scheduling (1/2) *** +#+BEAMER: \footnotesize \centering https://forge.softwareheritage.org/source/swh-scheduler/ *** What does the Software Heritage scheduler do? - Record **recurrent** and **one-shot** jobs in a database - Schedules runs of these jobs, records their results - Manages retries for transient job failures (remote service unavailable, ...) - Manages adaptive intervals for recurrent jobs ** Task scheduling (2/2) *** Builds upon trusted Python tools - Celery is used as a task queuing middleware, and for its worker management framework - Workers send task results through the Celery events mechanism -*** And makes it more reliable +*** And makes them more useful to us - The database is the single source of truth - ~swh.scheduler.celery_backend.runner~ pulls tasks from the database into Celery, limiting the RabbitMQ queue depth (allows task prioritization) - ~swh.scheduler.celery_backend.listener~ fetches task results from Celery events and updates the database +- Archival of elapsed tasks/runs/logs in elasticsearch to keep the database + snappy + +** Loading Python packages (1/4) + +*** What's a Python package anyway? + +- Source distributions (~sdists~, currently tarballs or zips) +- Binary distributions (~bdists~, which are mostly wheels these days) + +As we're interested in source code, Software Heritage looks at ~sdists~ exclusively + +- The current sdist format is unspecified: you probably get a tarball, which + maybe contains a ~setup.py~ somewhere +- When building a sdist, distutils generates a machine-readable ~PKG-INFO~ file + is generated and puts in the tarball + +*** The long wait for PEP 517 ("A build-system independent format for source trees") + +- One uniform transport format: a gzipped tarball with one toplevel directory +- Machine parsable data about the project by default (~pyproject.toml~) +Hopefully soon in your nearest Cheese Shop (go help the folks in PyPA!) + +** Loading Python packages (2/4) + +*** +#+BEAMER: \footnotesize \centering + https://forge.softwareheritage.org/source/swh-loader-pypi/ + +*** Common loading process + :PROPERTIES: + :BEAMER_col: 0.5 + :BEAMER_env: block + :BEAMER_act: +- + :END: + +Implemented in ~swh.loader.core~ + +- Fetch metadata about current versions +- Compare to latest loaded versions +- Download and process versions we had never seen +- Load new data + +*** PyPI specifics + :PROPERTIES: + :BEAMER_col: 0.5 + :BEAMER_env: block + :BEAMER_act: +- + :END: + +Implemented in ~swh.loader.pypi~ + +- Comparison done using the ~sdist~ digests +- PKG-INFO metadata parsed and saved +- versions with multiple sdists imported separately + +** Loading Python packages (3/4) + +*** PyPI snapshots + +#+begin_src python + pifpaf_snapshot = { + 'id': b'\xc6_\xfe#\x94\xba\x81\xc3\x94\x9b\xeb[\x06\xf5JC\x0f\x19n\xa6', + 'branches': { + b'releases/0.0.1': { + b'releases/0.0.2': { + ... + b'releases/2.1.2': { + 'target': b'\x8a\xcd\xf3l\xee\xe50\xe2\x81]\x08:5\xd9_\xd6\xeff\xc9\xa3', + 'target_type': 'revision', + }, + b'releases/2.1.2.dev7': { + 'target': b'hGh\x15h|\xf3\xd2v\xf8\xec-\xa7\xfeuB\xda3\x83x', + 'target_type': 'revision', + }, + b'HEAD': { + 'target': b'releases/2.1.2', + 'target_type': 'alias', + }, + }, + } +#+end_src + +** Loading Python packages (4/4) +*** PyPI revisions + +#+begin_src python -i +pifpaf_revision = { + 'id': b'\x8a\xcd\xf3l\xee\xe50\xe2\x81]\x08:5\xd9_\xd6\xeff\xc9\xa3', + 'author': { + 'name': b'Julien Danjou', + ... + }, + 'date': { + 'timestamp': {'seconds': 1538577319, 'microseconds': 0}, + }, + ... + 'type': 'tar', + 'directory': b'\xa4\xf2\xad\xb1\xef\r\xcf\x894::@=\xf9R\x86=\x19"\\', + 'message': b'2.1.2', +#+end_src +#+beamer: \pause +#+begin_src python -i + 'metadata': { + 'project': { # Metadata parsed from PKG-INFO + 'name': 'pifpaf', + 'author': 'Julien Danjou', + 'license': None, + 'summary': 'Suite of tools and fixtures to manage daemons for testing', + 'version': '2.1.2', + ... +#+end_src +** +*** +#+begin_src python -i + 'classifiers': [ + 'Intended Audience :: Information Technology', + ... + ], + ... + }, +#+end_src +#+beamer: \pause +#+begin_src python -i + 'original_artifact': { # The original tarball we downloaded + 'url': 'https://files.pythonhosted.org/packages/cc/ce/2599[...]', + 'date': '2018-10-03T14:35:19', + 'sha1': '00c4efc47580b5c4ad1dcdb5118159f9b057b0fd', + 'size': 192940, + 'sha256': 'a6eef2ae56ac90d02df5f45885973e108c960a2ea113cc76[...]', + 'filename': 'pifpaf-2.1.2.tar.gz', + 'sha1_git': '8ce7e3ddda336dd9edff26ae8efaf4b81439c42c', + 'blake2s256': 'c4f7fcd4324715f4bfb54f8eefb10fde803efb7a02e2[...]', + 'archive_type': 'tar', + }, + }, + 'synthetic': True, + 'parents': [], +} +#+end_src -** Loading Python packages * Getting involved #+INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2 ** You can help! #+BEAMER: \vspace{-2mm} *** Coding | ٭٭ | Web UI improvements | | ٭٭٭ | loaders for unsupported VCS/package formats | | ٭٭٭ | listers for unsupported forges/package managers | #+BEAMER: \vspace{-2mm} \footnotesize \centering \url{https://forge.softwareheritage.org/} \\ \url{https://docs.softwareheritage.org/devel/} #+BEAMER: \pause *** Community | ٭٭٭ | spread the world, help us with sustainability | | ٭٭ | document endangered source code | #+BEAMER: \vspace{-2mm} \footnotesize \centering \url{wiki.softwareheritage.org/Suggestion_box} #+BEAMER: \pause *** Join us #+BEAMER: \footnotesize \centering - \url{www.softwareheritage.org/jobs} --- *job openings* - \url{wiki.softwareheritage.org/Internship} --- *internships* ** Conclusion *** Software Heritage is - a reference archive of *all Free Software* ever written - an international, open, nonprofit, *mutualized infrastructure* - *now accessible* to developers, users, vendors - at the service of our community, *at the service of society* *** Come in, we're open! \url{www.softwareheritage.org} --- general information \\ \url{wiki.softwareheritage.org} --- internships, leads \\ \url{forge.softwareheritage.org} --- our own code