diff --git a/talks-public/2020-12-16-scheduler-refactoring/2020-12-16-scheduler-refactoring.org b/talks-public/2020-12-16-scheduler-refactoring/2020-12-16-scheduler-refactoring.org new file mode 100644 index 0000000..634f146 --- /dev/null +++ b/talks-public/2020-12-16-scheduler-refactoring/2020-12-16-scheduler-refactoring.org @@ -0,0 +1,205 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Refactoring the SWH scheduler and listers +#+SUBTITLE: for better handling of recurrent origin visit tasks +#+BEAMER_HEADER: \date[16 Dec 2020]{16 December 2020\\tech talk - \sout{Inria Paris}the Internet} +#+AUTHOR: Nicolas Dandrimont +#+DATE: 16 December 2020 +#+EMAIL: olasd@softwareheritage.org + +#+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 +#+OPTIONS: ^:nil +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{world-169.png} +#+BEAMER_HEADER: \titlegraphic{} +#+BEAMER_HEADER: \institute[Fondation Inria]{Fondation Inria} +#+BEAMER_HEADER: \author[Nicolas Dandrimont]{Nicolas Dandrimont\\ {\small\tt olasd@softwareheritage.org\\ @olasd}} +#+BEAMER_HEADER: \hypersetup{colorlinks,urlcolor=blue,linkcolor=magenta,citecolor=red,linktocpage=true} + +* Introduction +** Some history + +*** In the beginning, there was... + - First listing of GitHub: [[https://forge.softwareheritage.org/rDLSe44226544aeee9b794afbd8834f07624a799ff02][ad-hoc scripts by zack]] + - First git clones: basic Celery worker setup ([[https://forge.softwareheritage.org/source/swh-cloner-git/][swh-cloner-git]] repository) + - First imports in the archive: basic Celery worker setup setup too (see early [[https://forge.softwareheritage.org/source/swh-loader-git/][swh-loader-git]] history) + +*** How to future-proof this infra? + - What to do for recurrent imports? + - Lots of Celery limitations quickly apparent: + - no real support of (adaptive) recurrent tasks in celery + - rabbitmq is generally FIFO: no task priorities + - recurrent data-loss on single-node rabbitmq setup with lots of messages in flight + +* The Software Heritage scheduler + +#+INCLUDE: "../../common/modules/status-extended.org::#dataflow" :minlevel 2 + +** swh-scheduler original goals / design +*** Goals + - "persistence layer" for recurrent task definitions around Celery + - adaptive recurrence interval according to task results + - single source of truth for scheduling + - secondary: implement priorities and automatic retry of tasks +*** Design + - core: storage of task definitions (type, args, queue position, recurrence interval) + - scheduler runner: sends next tasks to run to the celery queues + - scheduler listener: updates tasks in db from celery events + - celery/rabbitmq: + - only a buffer for tasks between database queries + - used for its worker management framework + +** a peek at the database +*** +#+begin_src +> select * from task where id=1; +-[ RECORD 1 ]----+--------------------------------------- +id | 1 +type | load-git +arguments | {"args": [], "kwargs": {"url": "https: + //github.com/hylang/hy"}} +next_run | 2020-03-06 17:11:05.482501+00 +current_interval | 12:00:00 +status | next_run_not_scheduled +policy | recurring +retries_left | 0 +priority | + +Temps : 103,580 ms +#+end_src + +** five year retrospective of swh-scheduler +*** Some good + + - OK permanence layer for celery, which has some useful features for worker mgmt + - when used sparingly, task priorities work fine (e.g. save code now) + +*** lots of drawbacks, however + + - poor introspectability: + - difficult to index on free-form task arguments + - inscrutable queue positions and adaptive recurrence: + - currently processing =load-git= tasks scheduled to run 11 months ago + - impossible to know at what point recurrent =load-git= tasks inserted now will run + - no input for external information about task scheduling + - *tons* of useless task runs for repos not updated in years + - the celery events feedback loop is a hack + - the events queue isn't persistent by default, and we struggle to work around this + +* Software Heritage lister infrastructure +** Current swh-lister design (1/2) +*** Basic operation + - iterate all pages of the upstream API + - insert records for origins found, in an ad-hoc database/table for each lister + - generate recurrent tasks for origin visits in swh.scheduler +*** Two main modes of operation + - incremental: if possible, only get "new" pages of results from the upstream API + - full: list the upstream API completely again, updating the stored information + +** Current swh-lister design (2/2) + +*** Design iterations: + - Originally based on the one-off GitHub listing scripts by zack + - Generalized from GitHub + BitBucket by fiendish in 2016 + - extracted common patterns useful to write listers (http, rate limiting, etc.) + - extracted a common, overridable database schema from the GitHub and BitBucket commonalities + - Grew lots of tentacles to implement a bunch of listers (13 different kinds of upstreams supported) + +** swh-lister design shortcomings (1/2) +*** + :PROPERTIES: + :BEAMER_env: alertblock + :END: + +#+BEAMER: \begin{center} + (strong) opinions ahead +#+BEAMER: \end{center} + +*** /deep/ and /wide/ inheritance hierarchy + - lots of subtly different mixins to implement common functionality + - that end up being overridden to handle peculiarities of every upstream + - lots of copy/paste to get a working lister + - debugging is quite painful + +*** Way too much magic in tests + - based on UnitTest with a fairly opaque base class + - Provide two (good/bad) api responses and you're done... + - ...but it's not clear what's covered or not when reading the tests for a given lister + +** swh-lister design shortcomings (2/2) + +*** Unhelpful generic database schema + - generic but needs very specific overrides + - lots of GH-specific/useless fields + - hard to do cross-cutting analysis of listed origins + +*** Supposed to be an "easy" entry point for new contributors + - all in all, pretty hard to actually implement anything + +* Scheduler/lister refactoring +** Scheduler for recurrent origin visits +*** Tracking task + [[https://forge.softwareheritage.org/T2345][T2345]] + +*** Scope + Only handle recurrent origin visit tasks. "One-shot" tasks are out of scope. + +*** Design elements + - A single, unified storage for lister state and listed origins, within the scheduler database + - Implemented, to be used by refactored listers + - *TODO*: A cache for quick, bulk access to information about the status of a given origin in the archive + - *TODO*: A scheduling policy component merging information from the two previous tables to send tasks for processing in workers + +** Scheduler for recurrent origin visits data model + +*** Lister + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + - id :: uuid + - name :: str (f.e. "github", "phabricator") + - instance_name :: str (f.e. "softwareheritage") + - current_state :: dict + - created :: timestamp of creation + - updated :: timestamp of last update + +*** ListedOrigin + :PROPERTIES: + :BEAMER_env: block + :BEAMER_col: 0.5 + :END: + - lister_id :: uuid + - url :: str + - visit_type :: str + - extra_loader_arguments :: Dict[str, str] + - last_update :: timestamp of last update (if provided upstream) + - enabled :: bool + - first_seen :: timestamp of earliest listing + - last_seen :: timestamp of latest listing + +** Lister refactoring +*** Forge references +[[https://forge.softwareheritage.org/T2453][https://forge.softwareheritage.org/T2453]] and open diffs [[https://forge.softwareheritage.org/D3425][D3425]], [[https://forge.softwareheritage.org/D3526][D3526]], [[https://forge.softwareheritage.org/D3527][D3527]], [[https://forge.softwareheritage.org/D4700][D4700]], [[https://forge.softwareheritage.org/D4705][D4705]], [[https://forge.softwareheritage.org/D4706][D4706]] + +*** Scope + - Replace direct recurrent task scheduling with the new swh-scheduler based lister storage + - Drop the ad-hoc database schemas for listers + - Improve/clarify test coverage + +*** Current status + - Base patterns implemented (with state storage: [[https://forge.softwareheritage.org/D3425][D3425]], stateless: [[https://forge.softwareheritage.org/D4705][D4705]]) + - GitHub lister reimplemented, with full test coverage ([[https://forge.softwareheritage.org/D3527][D3527]]) + - Phabricator lister reimplemented, with no test coverage ([[https://forge.softwareheritage.org/D4706][D4706]]) + +* Conclusion +** A call for help! + +*** All alone in the rabbit hole + - Progress on this has been (very, very) slow + - Even if the updated listers land, they'll write their data to a dead-end table +*** Multiple tasks can be distributed + - Review of the current code + - Implementation of the rest of the scheduler components + - And of course the implementation of more listers, as well as hopefully some refactoring of common behaviors... +*** + Maybe a sprint topic to get us started on 2021? diff --git a/talks-public/2020-12-16-scheduler-refactoring/Makefile b/talks-public/2020-12-16-scheduler-refactoring/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2020-12-16-scheduler-refactoring/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides