diff --git a/docs/index.rst b/docs/index.rst index 04afc88..5935f97 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,153 +1,168 @@ .. _swh-scheduler: Software Heritage - Job scheduler ================================= Task manager for asynchronous/delayed tasks, used for both recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package). Description ----------- This module provides a scheduler service for the Software Heritage platform. It allows to define tasks with a number of properties. In this documentation, we will call these swh-tasks to prevent confusion. These swh-tasks are stored in a database, and a HTTP-based RPC service is provided to create or find existing swh-task declarations. The execution model for these swh-tasks is using Celery. Thus, each swh-task type defined in the database must have a (series of) celery worker capable of executing such a swh-task. Then a number of services are also provided to manage the scheduling of these swh-tasks as Celery tasks. The `scheduler-runner` service is a daemon that regularly looks for swh-tasks in the database that should be scheduled. For each of the selected swh-task, a Celery task is instantiated. The `scheduler-listener` service is a daemon that listen to the Celery event bus and maintain scheduled swh-tasks workflow status. SWH Task Model ~~~~~~~~~~~~~~ Each swh-task-type is the declaration of a type of swh-task. Each swh-task-type have the following fields: - `type`: Name of the swh-task type; can be anything but must be unique, - `description`: Human-readable task description - `backend_name`: Name of the task in the job-running backend, - `default_interval`: Default interval for newly scheduled tasks, - `min_interval`: Minimum interval between two runs of a task, - `max_interval`: Maximum interval between two runs of a task, - `backoff_factor`: Adjustment factor for the backoff between two task runs, - `max_queue_length`: Maximum length of the queue for this type of tasks, - `num_retries`: Default number of retries on transient failures, - `retry_delay`: Retry delay for the task, Existing swh-task-types can be listed using the `swh scheduler` command line tool:: $ swh scheduler task-type list Known task types: check-deposit: Pre-checking deposit step before loading into swh archive index-fossology-license: Fossology license indexer task load-git: Update an origin of type git load-hg: Update an origin of type mercurial You can see the details of a swh-task-type:: $ swh scheduler task-type list -v -t load-git Known task types: load-git: swh.loader.git.tasks.UpdateGitRepository Update an origin of type git interval: 64 days, 0:00:00 [12:00:00, 64 days, 0:00:00] backoff_factor: 2.0 max_queue_length: 5000 num_retries: None retry_delay: None An swh-task is an 'instance' of such a swh-task-type, and consists in: - `arguments`: Arguments passed to the underlying job scheduler, - `next_run`: Next run of this task should be run on or after that time, - `current_interval`: Interval between two runs of this task, taking into account the backoff factor, - `policy`: Whether the task is "one-shot" or "recurring", - `retries_left`: Number of "short delay" retries of the task in case of transient failure, - `priority`: Priority of the task, - `id`: Internal task identifier, - `type`: References task_type table, - `status`: Task status ( among "next_run_not_scheduled", "next_run_scheduled", "completed", "disabled"). So a swh-task consist basically in: - a set of parameters defining how the scheduling of the swh-task is handled, - a set of parameters to specify the retry policy in case of transient failure upon execution, - a set of parameters that defines the job to be done (`bakend_name` + `arguments`). You can list pending swh-tasks (tasks that are to be scheduled ASAP):: $ swh scheduler task list-pending load-git --limit 2 Found 1 load-git tasks Task 9052257 Next run: 15 days ago (2019-06-25 10:35:10+00:00) Interval: 2 days, 0:00:00 Type: load-git Policy: recurring Args: 'https://github.com/turtl/mobile' Keyword args: Looking for existing swh-task can be done via the command line tool:: $ swh scheduler task list -t load-hg --limit 2 Found 2 tasks Task 168802702 Next run: in 4 hours (2019-07-10 17:56:48+00:00) Interval: 1 day, 0:00:00 Type: load-hg Policy: recurring Status: next_run_not_scheduled Priority: Args: 'https://bitbucket.org/kepung/pypy' Keyword args: Task 169800445 Next run: in a month (2019-08-10 17:54:24+00:00) Interval: 32 days, 0:00:00 Type: load-hg Policy: recurring Status: next_run_not_scheduled Priority: Args: 'https://bitbucket.org/lunixbochs/pypy-1' Keyword args: + +Writing a new worker for a new swh-task-type +-------------------------------------------- + +When you want to add a new swh-task-type, you need a celery worker backend +capable of executing this new task-type instances. + +Celery workers for swh-scheduler based tasks should be started using the Celery +app in `swh.scheduler.celery_config`. This later, among other things, provides +a loading mechanism for task types based on pkg_resources declared plugins under +the `[swh.workers]` entry point. + +TODO: add a fully working example of a dumb task. + + Reference Documentation ----------------------- .. toctree:: :maxdepth: 2 /apidoc/swh.scheduler diff --git a/swh/scheduler/celery_backend/config.py b/swh/scheduler/celery_backend/config.py index 7d3a3dd..0f275b8 100644 --- a/swh/scheduler/celery_backend/config.py +++ b/swh/scheduler/celery_backend/config.py @@ -1,271 +1,282 @@ # Copyright (C) 2015-2019 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging import os +import pkg_resources import urllib.parse from celery import Celery from celery.signals import setup_logging, celeryd_after_setup from celery.utils.log import ColorFormatter from celery.worker.control import Panel from kombu import Exchange, Queue from kombu.five import monotonic as _monotonic import requests from swh.scheduler import CONFIG as SWH_CONFIG from swh.core.config import load_named_config, merge_configs from swh.core.logger import JournalHandler DEFAULT_CONFIG_NAME = 'worker' CONFIG_NAME_ENVVAR = 'SWH_WORKER_INSTANCE' CONFIG_NAME_TEMPLATE = 'worker/%s' DEFAULT_CONFIG = { 'task_broker': ('str', 'amqp://guest@localhost//'), 'task_modules': ('list[str]', []), 'task_queues': ('list[str]', []), 'task_soft_time_limit': ('int', 0), } logger = logging.getLogger(__name__) @setup_logging.connect def setup_log_handler(loglevel=None, logfile=None, format=None, colorize=None, log_console=None, log_journal=None, **kwargs): """Setup logging according to Software Heritage preferences. We use the command-line loglevel for tasks only, as we never really care about the debug messages from celery. """ if loglevel is None: loglevel = logging.DEBUG if isinstance(loglevel, str): loglevel = logging._nameToLevel[loglevel] formatter = logging.Formatter(format) root_logger = logging.getLogger('') root_logger.setLevel(logging.INFO) log_target = os.environ.get('SWH_LOG_TARGET', 'console') if log_target == 'console': log_console = True elif log_target == 'journal': log_journal = True # this looks for log levels *higher* than DEBUG if loglevel <= logging.DEBUG and log_console is None: log_console = True if log_console: color_formatter = ColorFormatter(format) if colorize else formatter console = logging.StreamHandler() console.setLevel(logging.DEBUG) console.setFormatter(color_formatter) root_logger.addHandler(console) if log_journal: systemd_journal = JournalHandler() systemd_journal.setLevel(logging.DEBUG) systemd_journal.setFormatter(formatter) root_logger.addHandler(systemd_journal) logging.getLogger('celery').setLevel(logging.INFO) # Silence amqp heartbeat_tick messages logger = logging.getLogger('amqp') logger.addFilter(lambda record: not record.msg.startswith( 'heartbeat_tick')) logger.setLevel(loglevel) # Silence useless "Starting new HTTP connection" messages logging.getLogger('urllib3').setLevel(logging.WARNING) logging.getLogger('swh').setLevel(loglevel) # get_task_logger makes the swh tasks loggers children of celery.task logging.getLogger('celery.task').setLevel(loglevel) return loglevel @celeryd_after_setup.connect def setup_queues_and_tasks(sender, instance, **kwargs): """Signal called on worker start. This automatically registers swh.scheduler.task.Task subclasses as available celery tasks. This also subscribes the worker to the "implicit" per-task queues defined for these task classes. """ logger.info('Setup Queues & Tasks for %s', sender) instance.app.conf['worker_name'] = sender @Panel.register def monotonic(state): """Get the current value for the monotonic clock""" return {'monotonic': _monotonic()} def route_for_task(name, args, kwargs, options, task=None, **kw): """Route tasks according to the task_queue attribute in the task class""" if name is not None and name.startswith('swh.'): return {'queue': name} def get_queue_stats(app, queue_name): """Get the statistics regarding a queue on the broker. Arguments: queue_name: name of the queue to check Returns a dictionary raw from the RabbitMQ management API; or `None` if the current configuration does not use RabbitMQ. Interesting keys: - Consumers (number of consumers for the queue) - messages (number of messages in queue) - messages_unacknowledged (number of messages currently being processed) Documentation: https://www.rabbitmq.com/management.html#http-api """ conn_info = app.connection().info() if conn_info['transport'] == 'memory': # We're running in a test environment, without RabbitMQ. return None url = 'http://{hostname}:{port}/api/queues/{vhost}/{queue}'.format( hostname=conn_info['hostname'], port=conn_info['port'] + 10000, vhost=urllib.parse.quote(conn_info['virtual_host'], safe=''), queue=urllib.parse.quote(queue_name, safe=''), ) credentials = (conn_info['userid'], conn_info['password']) r = requests.get(url, auth=credentials) if r.status_code == 404: return {} if r.status_code != 200: raise ValueError('Got error %s when reading queue stats: %s' % ( r.status_code, r.json())) return r.json() def get_queue_length(app, queue_name): """Shortcut to get a queue's length""" stats = get_queue_stats(app, queue_name) if stats: return stats.get('messages') def register_task_class(app, name, cls): """Register a class-based task under the given name""" if name in app.tasks: return task_instance = cls() task_instance.name = name app.register_task(task_instance) INSTANCE_NAME = os.environ.get(CONFIG_NAME_ENVVAR) CONFIG_NAME = os.environ.get('SWH_CONFIG_FILENAME') CONFIG = {} if CONFIG_NAME: # load the celery config from the main config file given as # SWH_CONFIG_FILENAME environment variable. # This is expected to have a [celery] section in which we have the # celery specific configuration. SWH_CONFIG.clear() SWH_CONFIG.update(load_named_config(CONFIG_NAME)) CONFIG = SWH_CONFIG.get('celery') if not CONFIG: # otherwise, back to compat config loading mechanism if INSTANCE_NAME: CONFIG_NAME = CONFIG_NAME_TEMPLATE % INSTANCE_NAME else: CONFIG_NAME = DEFAULT_CONFIG_NAME # Load the Celery config CONFIG = load_named_config(CONFIG_NAME, DEFAULT_CONFIG) +CONFIG.setdefault('task_modules', []) +# load tasks modules declared as plugin entry points +for entrypoint in pkg_resources.iter_entry_points('swh.workers'): + worker_registrer_fn = entrypoint.load() + # The registry function is expected to return a dict which the 'tasks' key + # is a string (or a list of strings) with the name of the python module in + # which celery tasks are defined. + task_modules = worker_registrer_fn().get('task_modules', []) + CONFIG['task_modules'].extend(task_modules) + # Celery Queues CELERY_QUEUES = [Queue('celery', Exchange('celery'), routing_key='celery')] CELERY_DEFAULT_CONFIG = dict( # Timezone configuration: all in UTC enable_utc=True, timezone='UTC', # Imported modules imports=CONFIG.get('task_modules', []), # Time (in seconds, or a timedelta object) for when after stored task # tombstones will be deleted. None means to never expire results. result_expires=None, # A string identifying the default serialization method to use. Can # be json (default), pickle, yaml, msgpack, or any custom # serialization methods that have been registered with task_serializer='msgpack', # Result serialization format result_serializer='msgpack', # Late ack means the task messages will be acknowledged after the task has # been executed, not just before, which is the default behavior. task_acks_late=True, # A string identifying the default serialization method to use. # Can be pickle (default), json, yaml, msgpack or any custom serialization # methods that have been registered with kombu.serialization.registry accept_content=['msgpack', 'json'], # If True the task will report its status as “started” # when the task is executed by a worker. task_track_started=True, # Default compression used for task messages. Can be gzip, bzip2 # (if available), or any custom compression schemes registered # in the Kombu compression registry. # result_compression='bzip2', # task_compression='bzip2', # Disable all rate limits, even if tasks has explicit rate limits set. # (Disabling rate limits altogether is recommended if you don’t have any # tasks using them.) worker_disable_rate_limits=True, # Task routing task_routes=route_for_task, # Allow pool restarts from remote worker_pool_restarts=True, # Do not prefetch tasks worker_prefetch_multiplier=1, # Send events worker_send_task_events=True, # Do not send useless task_sent events task_send_sent_event=False, ) def build_app(config=None): config = merge_configs( {k: v for (k, (_, v)) in DEFAULT_CONFIG.items()}, config or {}) config['task_queues'] = CELERY_QUEUES + [ Queue(queue, Exchange(queue), routing_key=queue) for queue in config.get('task_queues', ())] logger.debug('Creating a Celery app with %s', config) # Instantiate the Celery app app = Celery(broker=config['task_broker'], task_cls='swh.scheduler.task:SWHTask') app.add_defaults(CELERY_DEFAULT_CONFIG) app.add_defaults(config) return app app = build_app(CONFIG) # XXX for BW compat Celery.get_queue_length = get_queue_length diff --git a/swh/scheduler/tests/conftest.py b/swh/scheduler/tests/conftest.py index db13bb9..e4e9e05 100644 --- a/swh/scheduler/tests/conftest.py +++ b/swh/scheduler/tests/conftest.py @@ -1,96 +1,100 @@ import os import pytest import glob from datetime import timedelta +import pkg_resources from swh.core.utils import numfile_sortkey as sortkey from swh.scheduler import get_scheduler from swh.scheduler.tests import SQL_DIR # make sure we are not fooled by CELERY_ config environment vars for var in [x for x in os.environ.keys() if x.startswith('CELERY')]: os.environ.pop(var) import swh.scheduler.celery_backend.config # noqa # this import is needed here to enforce creation of the celery current app # BEFORE the swh_app fixture is called, otherwise the Celery app instance from # celery_backend.config becomes the celery.current_app # test_cli tests depends on a en/C locale, so ensure it os.environ['LC_ALL'] = 'C.UTF-8' DUMP_FILES = os.path.join(SQL_DIR, '*.sql') # celery tasks for testing purpose; tasks themselves should be # in swh/scheduler/tests/tasks.py TASK_NAMES = ['ping', 'multiping', 'add', 'error'] @pytest.fixture(scope='session') def celery_enable_logging(): return True @pytest.fixture(scope='session') def celery_includes(): - return [ + task_modules = [ 'swh.scheduler.tests.tasks', ] + for entrypoint in pkg_resources.iter_entry_points('swh.workers'): + task_modules.extend(entrypoint.load()().get('task_modules', [])) + return task_modules @pytest.fixture(scope='session') def celery_parameters(): return { 'task_cls': 'swh.scheduler.task:SWHTask', } @pytest.fixture(scope='session') def celery_config(): return { 'accept_content': ['application/x-msgpack', 'application/json'], 'task_serializer': 'msgpack', 'result_serializer': 'json', } # override the celery_session_app fixture to monkeypatch the 'main' # swh.scheduler.celery_backend.config.app Celery application # with the test application. @pytest.fixture(scope='session') def swh_app(celery_session_app): swh.scheduler.celery_backend.config.app = celery_session_app yield celery_session_app @pytest.fixture def swh_scheduler(request, postgresql_proc, postgresql): scheduler_config = { 'db': 'postgresql://{user}@{host}:{port}/{dbname}'.format( host=postgresql_proc.host, port=postgresql_proc.port, user='postgres', dbname='tests') } all_dump_files = sorted(glob.glob(DUMP_FILES), key=sortkey) cursor = postgresql.cursor() for fname in all_dump_files: with open(fname) as fobj: cursor.execute(fobj.read()) postgresql.commit() scheduler = get_scheduler('local', scheduler_config) for taskname in TASK_NAMES: scheduler.create_task_type({ 'type': 'swh-test-{}'.format(taskname), 'description': 'The {} testing task'.format(taskname), 'backend_name': 'swh.scheduler.tests.tasks.{}'.format(taskname), 'default_interval': timedelta(days=1), 'min_interval': timedelta(hours=6), 'max_interval': timedelta(days=12), }) return scheduler