diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst --- a/docs/architecture/overview.rst +++ b/docs/architecture/overview.rst @@ -40,24 +40,130 @@ The latter supports a large variety of "cloud" object storage as backends, as well as a simple local filesystem. -Task management -^^^^^^^^^^^^^^^ -The :ref:`Scheduler ` manages the entire choreography of jobs/tasks -in |swh|, from detecting and ingesting repositories, to extracting metadata from them, -to repackaging repositories into small downloadable archives. +Journal +^^^^^^^ -It does this by managing its own database of tasks that need to run -(either periodically or only once), -and passing them to celery_ for execution on dedicated workers. +The :term:`Journal `, which is a persistent logger of every change in +the archive, with publish-subscribe_ support, using Kafka. -Listers -^^^^^^^ +The Storage publishes a kafka message in the journal each time a new object is +added to the archive; and many components consumes them to be notified of these +changes. For example, it allows the Scheduler to know when an origin has been +visited and what was the resulting status of that visit, which helps to decide +when to visit again these repositories. + +It is also the foundation of the :ref:`mirror` infrastructure, as it allows +mirrors to stay up to date. -:term:`Listers ` are type of task, run by the Scheduler, aiming at scraping a -web site, a forge, etc. to gather all the source code repositories it can -find, also known as :term:`origins `. -For each found source code repository, a :term:`loader` task is created. +Source code scraping +^^^^^^^^^^^^^^^^^^^^ + +The infrastructure aiming at finding new source code origins (git, mercurial +and other type of VCS, source packages, etc.) and regularly visiting them is +build around a few components based on a task scheduling scaffolding and using +a Celery-based asynchronous task execution framework. The scheduler itself +consists in 2 parts: a generic asynchronous task management system and a +specific management database aiming at gathering and keeping up to date +liveness information of listed origins that can be used to choose which of +them should be visited in priority. + +To summarize, the parts involved in this carousel are: + +:term:`Listers `: + tasks aiming at scraping a web site like a forge, etc. to gather all the + source code repositories it can find, also known as :term:`origins + `. Lister tasks are triggered by the scheduler, via Celery, and + will fill the listed origins table of the listing and visit statistics + database (see below). + +:term:`Loaders `: + tasks dedicated to importing source code from a source code repository (an + origin). It is the component that will insert :term:`blob` objects in the + :term:`object storage`, and insert nodes and edges in the :ref:`graph + `. + +:ref:`Scheduler `'s generic task management: + manages the choreography of listing tasks in |swh|, as well as a few other + utility tasks (save code now, deposit, vault, indexers). Note that this + component will not handle the scheduling of loading tasks any more. It + consists in a database and API allowing to define task types and to create + tasks to be scheduled (recurring or one shot), as well as a tool (the + ``scheduler-runner``) dedicated to spawn these tasks via the Celery + asynchronous execution framework, as well as another tool (the + ``scheduler-listener``) dedicated to keeping the scheduler database in + sync with executed tasks (task execution status, execution timestamps, + etc.). + +:ref:`Scheduler `'s listing and visit statistics: + database and API allowing to store information about liveness of a listed + origin as well as statistics about the loading of said origin. The visit + statistics are updated from the main :ref:`storage ` kafka + journal. + +:ref:`Scheduler `'s origin visit scheduling: + tool that will use the statistics about listed origins and previous visits + stored in the database to apply scheduling policies to select the next + pool origins to visit. This does not use the generic task management + system, but instead directly spawn loading Celery tasks. + + +.. thumbnail:: ../images/lister-loader-scheduling-architecture.svg + + +The Scheduler +~~~~~~~~~~~~~ + +The :ref:`Scheduler ` manages the generic choreography of +jobs/tasks in |swh|, namely listing origins of software source code, loading +them, extracting metadata from loaded origins and repackaging repositories into +small downloadable archives for the :term:`Vault `. + +It consists in a database where all the scheduling information is stored, an +API allowing unified access to this database, and a set of services and tools +to orchestrate the actual scheduling of tasks. Their execution being delegated +to a Celery-based set of asynchronous workers. + +While initially a single generic scheduling utility for all asynchronous task +types, the scheduling of origin visits has now been extracted in a new, +dedicated part of the Scheduler. These loading tasks used to be managed by this +generic task scheduler as recurrent tasks, but the number of these loading +tasks baceame a problem to handle then efficiently, as well as some of their +specificities could not be accounted for to help better and more efficient +scheduling of origin visits. + +There are now 2 parts in the scheduler: the original SWH Task management +system, and the new Origin Visit scheduling utility. + +Both have a similar architecture at first sight: a database, an API, a celery +based execution system. The main difference of the new visit-centric system it +is dedicated to origin visits, and thus can use specific information and +metadata on origins to optimise the scheduling policy; statstics about known +origins resulting from the listing of a forge can be used as entry point for +the scheduling of origin visits according to scheduling policies that can take +several metrics into considerations, like: + +- have the origin already been visited, + +- if not, how "old" is the origin (what is the timestamp of its first sign of + activity, e.g. creation date, timestamp of the first revision, etc.), + +- how long since the origin has last been visited, + +- how active is the origin (and thus how often it should be visited), + +- etc. + +For each new source code repository, a ``listed origin`` entry is added in the +scheduler database, as well as the timestamp of last known activity for this +origin as reported by the forge. For already known origins, only this last +activity timestamp is updated, if need be. + +It is then the responsibility of the ``schedule-recurrent`` scheduler service +to check listed origins, as well as visit statistics (see below), in order to +regularly select the next origins to visit. This service also uses live data +from Celery to choose an appropriate number of visits to schedule (keeping the +Celery queues filled at a constant and controlled level). The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a @@ -68,27 +174,10 @@ As one might observe in this diagram, it does two things: - it asks the forge (a gitlab_ instance in this case) the list of known - repositories, and - -- it insert one :term:`loader` task for each source code repository that will - be in charge of importing the content of that repository. - -Note that most listers usually work in incremental mode, meaning they store in a -dedicated database the current state of the listing of the forge. Then, on a subsequent -execution of the lister, it will ask only for new repositories. - -Also note that if the lister inserts a new loading task for a repository for which a -loading task already exists, the existing task will be updated (if needed) instead of -creating a new task. - -Loaders -^^^^^^^ - -:term:`Loaders ` are also a type of task, but aim at importing or -updating a source code repository. It is the one that inserts :term:`blob` -objects in the :term:`object storage`, and inserts nodes and edges in the -:ref:`graph `. + repositories as well as some metadata (especially last update timestamp), and +- it inserts one ``listed origin`` for each new source code repository found or + update the ``last update`` timestamp for the origin. The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any @@ -97,21 +186,6 @@ .. thumbnail:: ../images/tasks-git-loader.svg -Journal -^^^^^^^ - -The last core component is the :term:`Journal `, which is a persistent logger -of every change in the archive, with publish-subscribe_ support, using Kafka. - -The Storage writes to it every time a new object is added to the archive; -and many components read from it to be notified of these changes. -For example, it allows the Scheduler to know how often software repositories are -updated by their developers, to decide when next to visit these repositories. - -It is also the foundation of the :ref:`mirror` infrastructure, as it allows -mirrors to stay up to date. - - .. _architecture-tier-2: Other major components diff --git a/docs/images/lister-loader-scheduling-architecture.svg b/docs/images/lister-loader-scheduling-architecture.svg new file mode 100644 --- /dev/null +++ b/docs/images/lister-loader-scheduling-architecture.svg @@ -0,0 +1,3340 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + The Storage API. Publish each addition in a dedicated kafka topic. + + + Storage + API + + + + The scheduler journal client consumes the origin-visit and origin-visit-status topics and update the visit statistics in the Scheduler DB via the Visits API. + + + + Scheduler + + + + + + + + + journal client + + + + + + + + + + + The Scheduler API provide access to the Scheduler DB via 2 sets of APIs, a generic Task API and a new origin visit centric API + + + Scheduler + Task API + Visits API + + + + The scheduler-recurrent service regularly grab origins to visit from the Scheduler via the Visits API and create loader Celery tasks + + + + Scheduler + recurrent sch + + + + Grab next origins to visit + + 6 + + + Insert visit-related Storage objects (OriginVisit and OriginVisitStatus) + + 9' + + + + Produce kafka messages in the origin-visit and origin-visit-status kafka topics. + + 10 + + + The scheduler journal client subscribe to the origin-visit and origini-visit-status topics. + + 11 + + + Insert software artifacts (Revision, Content, etc.) + + + 9 + + + + + The Kafka journal, storing and publishing all the modification log entries of the Storage. + + + + + + + + + + + + + + + + + + + + + + + + + + + + Journal + + + + + + + + + + + + + + + + + + + + Celery + + + The scheduler-listener listen to Celery event to update the state of scheduled SWH Tasks in the Scheduler DB via the SWH Task API + + + Scheduler + + + + + + + + + + listener + + + The loader Celery worker execute loader Celery tasks, retrieveing software artifacts from the origin and inserting them in the Storage, including OriginVisit and OriginVisitStatus storage objects. + + + + Loader + + + + + + + + + + + + + + workers + + + + ... + + + + + + + + + + + + + + + the scheduler-runner regularly grab SWH tasks to schedule from the scheduler DB using the generic task API + + + + Scheduler + runner + + + + + + + The Lister is a Celery worker that execute listing tasks, scraping origin sources (forge, etc.) to find new origins and get metadata useful for scheduling policies (esp. last update timestamp for an origin) + + + + Lister + + + + + + + + + + + + + + workers + + loadertasks + listertasks + + insert Celery tasks for scheduled SWH Tasks + + 2 + + + Grab a set of SWH tasks to be scheduled next + Grab SWH task + + 1 + + + Receive lister Celery tasks and execute them + + 3 + + + Create a new visited origin (if needed) & update visit statistics (esp. last update timestamp, if available) + + 4 + + + Update SWH Task status via the Task API + + 5 + + + Insert loader Celery tasks in the loader task queue. + + 7 + + + Receive loader Celery tasks + + 8 + + + Update origin visit stats. + + 12 + + + + + + 6 + + grab origins to visit + + + + + 7 + + create loader celery task + + + + + 8 + + grab loader celery task to execute + + + + + 9 + + insert software artifact objects + + + + + 9' + + notify visit status + + + + + + 10 + + produce log entries in topics + + + + + 11 + + consume origin-visit topics + + + + + 12 + + update origin visit stats + + + + + Listen to Celery events + + 5' + + + + + + + 1 + + regularly grab SWH tasks to schedule + + + + + 2 + + create lister celery tasks + + + + + + 3 + + grab celery task to execute + + + + + 4 + + create a visited origin & update visit stats + + + + + 5' + + listen to Celery events + + + + + 5 + + update SWH task state + + + + + 0 + + + + + listing SWH tasks are added by the user + + diff --git a/docs/images/tasks-git-loader.uml b/docs/images/tasks-git-loader.uml --- a/docs/images/tasks-git-loader.uml +++ b/docs/images/tasks-git-loader.uml @@ -1,20 +1,21 @@ @startuml - participant SCH_DB as "scheduler DB" #B0C4DE - participant SCH_RUN as "scheduler runner" - participant SCH_LS as "scheduler listener" - participant RMQ as "Rabbit-MQ" - participant OBJSTORE as "object storage" - participant STORAGE_DB as "storage DB" #B0C4DE + participant SCH_DB as "scheduler visits API" #B0C4DE + participant SCH_RUN as "scheduler schedule-recurrent" + participant SCH_JC as "scheduler journal-client" + participant RMQ as "Celery queues" + participant JOURNAL as "Kafka journal" + participant STORAGE_DB as "storage DB" participant STORAGE_API as "storage API" + participant OBJSTORE as "object storage" participant WORK_GIT as "worker@loader-git" participant GIT as "git server" - Note over SCH_DB,SCH_RUN: Task T2 created beforehand \n by the lister-gitlab task + Note over SCH_DB,SCH_RUN: Listed-Origin O1 created beforehand \n by the lister-gitlab task loop Polling - SCH_RUN->>SCH_DB: GET TASK set state=scheduled - SCH_DB-->>SCH_RUN: TASK id=T2 activate SCH_RUN - SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git + SCH_RUN->>SCH_DB: GET origins/grab_next visit-type=git + SCH_DB-->>SCH_RUN: ORIGIN url=O1 + SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git url=O1 deactivate SCH_RUN activate RMQ end @@ -23,6 +24,18 @@ deactivate RMQ activate WORK_GIT + WORK_GIT->>STORAGE_API: ADD origin-visit url=O1 + activate STORAGE_API + STORAGE_API->>STORAGE_DB: INSERT Origin + STORAGE_API->>JOURNAL: publish message in 'origin' topic + STORAGE_API->>STORAGE_DB: INSERT OriginVisit url=O1 + STORAGE_API->>JOURNAL: publish message in 'origin-visit' topic + STORAGE_DB-->>STORAGE_API: OriginVisit id=V1 + STORAGE_API->>JOURNAL: publish message in 'origin-visit-status' topic + STORAGE_API->>STORAGE_DB: INSERT OriginVisitStatus url=O1 visit=V1 status=created date=now + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API + WORK_GIT->>STORAGE_API: GET origin state activate STORAGE_API STORAGE_API-->>WORK_GIT: 200 @@ -46,31 +59,36 @@ WORK_GIT->>STORAGE_API: LOAD NEW CONTENT activate STORAGE_API loop For each blob + STORAGE_API->>STORAGE_DB: ADD CONTENT STORAGE_API->>OBJSTORE: ADD BLOB + STORAGE_API->>JOURNAL: publish message in 'content' topic end STORAGE_API-->>WORK_GIT: 200 / blobs deactivate STORAGE_API - WORK_GIT->>STORAGE_API: NEW DIR + WORK_GIT->>STORAGE_API: NEW DIREcTORY activate STORAGE_API - loop For each DIR - STORAGE_API->>STORAGE_DB: INSERT DIR + loop For each DIRECTORY + STORAGE_API->>STORAGE_DB: INSERT DIRECTORY + STORAGE_API->>JOURNAL: publish message in 'directory' topic end STORAGE_API-->>WORK_GIT: 201 deactivate STORAGE_API - WORK_GIT->>STORAGE_API: NEW REV + WORK_GIT->>STORAGE_API: NEW REVISION activate STORAGE_API - loop For each REV - STORAGE_API->>STORAGE_DB: INSERT REV + loop For each REVISION + STORAGE_API->>STORAGE_DB: INSERT REVISION + STORAGE_API->>JOURNAL: publish message in 'revision' topic end STORAGE_API-->>WORK_GIT: 201 deactivate STORAGE_API - WORK_GIT->>STORAGE_API: NEW REL + WORK_GIT->>STORAGE_API: NEW RELEASE activate STORAGE_API - loop For each REL - STORAGE_API->>STORAGE_DB: INSERT REL + loop For each RELEASE + STORAGE_API->>STORAGE_DB: INSERT RELSEASE + STORAGE_API->>JOURNAL: publish message in 'release' topic end STORAGE_API-->>WORK_GIT: 201 deactivate STORAGE_API @@ -79,16 +97,23 @@ activate STORAGE_API loop For each SNAPSHOT STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT + STORAGE_API->>JOURNAL: publish message in 'snapshot' topic end STORAGE_API-->>WORK_GIT: 201 deactivate STORAGE_API - WORK_GIT-->>RMQ: SET CT2 status=eventful + WORK_GIT->>STORAGE_API: ADD origin-visit-status url=O1 visit=V1 snapshot=S1 status=full + activate STORAGE_API + STORAGE_API->>STORAGE_DB: INSERT OriginVisitStatus url=O1 visit=V1 status=full date=now + activate JOURNAL + STORAGE_API->>JOURNAL: publish message in 'origin-visit-status' topic + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API deactivate WORK_GIT - activate RMQ - RMQ->>SCH_LS: NOTIFY end of task CT2 - deactivate RMQ - activate SCH_LS - SCH_LS->>SCH_DB: UPDATE T2 set state=end - deactivate SCH_LS + + activate SCH_JC + JOURNAL->>SCH_JC: consume message from 'origin-visit-status' topic + deactivate JOURNAL + SCH_JC->>SCH_DB: UPDATE VISIT url=O1 + deactivate SCH_JC @enduml diff --git a/docs/images/tasks-lister.uml b/docs/images/tasks-lister.uml --- a/docs/images/tasks-lister.uml +++ b/docs/images/tasks-lister.uml @@ -2,6 +2,7 @@ participant WEB as "swh-web" participant SCH_API as "scheduler API" #ECECFF participant SCH_DB as "scheduler DB" #B0C4DE + participant SCH_VST_DB as "scheduler visits DB" #B0C4DE participant SCH_RUN as "scheduler runner" participant RMQ as "Rabbit-MQ" participant SCH_LS as "scheduler listener" @@ -36,9 +37,10 @@ deactivate GITLAB loop For Each Repo - WORK_GITLAB->>SCH_API: CREATE TASK loader-git + WORK_GITLAB->>SCH_API: CREATE LISTED_ORIGIN Repo.URL activate SCH_API - SCH_API->>SCH_DB: INSERT TASK + SCH_API->>SCH_VST_DB: [IF NEEDED] INSERT LISTED_ORIGIN Repo.URL + SCH_API->>SCH_VST_DB: UPDATE LISTED_ORIGIN(LAST_UPDATE) SCH_API-->>WORK_GITLAB: 201 deactivate SCH_API end