diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst --- a/docs/api/use-cases.rst +++ b/docs/api/use-cases.rst @@ -226,6 +226,7 @@ the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the deposit's content (returned after the deposit finished loading). +.. _use-case-metadata-only-deposit: Metadata-only deposit --------------------- diff --git a/docs/images/deposit-workflow-checking.uml b/docs/images/deposit-workflow-checking.uml new file mode 100644 --- /dev/null +++ b/docs/images/deposit-workflow-checking.uml @@ -0,0 +1,34 @@ +@startuml + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant CHECKER_TASK as "checker task" + participant CELERY as "celery" + participant SCHEDULER as "swh-scheduler" + + activate DEPOSIT + activate DEPOSIT_DATABASE + activate CELERY + activate SCHEDULER + + SCHEDULER ->> CELERY: new "check-deposit"\ntask available + CELERY ->> CHECKER_TASK: start task + activate CHECKER_TASK + + CHECKER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/check/ + + DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests + DEPOSIT_DATABASE ->> DEPOSIT: deposit requests + + loop for each request + DEPOSIT ->> DEPOSIT_DATABASE: get archive + DEPOSIT_DATABASE ->> DEPOSIT: archive content + DEPOSIT ->> DEPOSIT: check archive in the request + end + + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "verified" + DEPOSIT ->> SCHEDULER: schedule load + DEPOSIT ->> CHECKER_TASK: done + CHECKER_TASK ->> CELERY: done + deactivate CHECKER_TASK + CELERY ->> SCHEDULER: done +@enduml diff --git a/docs/images/deposit-workflow-loading.uml b/docs/images/deposit-workflow-loading.uml new file mode 100644 --- /dev/null +++ b/docs/images/deposit-workflow-loading.uml @@ -0,0 +1,44 @@ +@startuml + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant LOADER_TASK as "loader task" + participant STORAGE as "swh-storage" + participant CELERY as "celery" + participant SCHEDULER as "swh-scheduler" + + activate DEPOSIT + activate DEPOSIT_DATABASE + activate STORAGE + activate CELERY + activate SCHEDULER + + SCHEDULER ->> CELERY: new "load-deposit"\ntask available + CELERY ->> LOADER_TASK: start task + activate LOADER_TASK + + LOADER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/raw/ + + DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests + DEPOSIT_DATABASE ->> DEPOSIT: deposit requests + + loop for each request + DEPOSIT ->> DEPOSIT_DATABASE: get archive + DEPOSIT_DATABASE ->> DEPOSIT: archive content + DEPOSIT ->> DEPOSIT: aggregate + end + + DEPOSIT ->> LOADER_TASK: tarball + + LOADER_TASK ->> LOADER_TASK: unpack on disk + + loop + LOADER_TASK ->> LOADER_TASK: load objects + LOADER_TASK ->> STORAGE: store objects + end + + LOADER_TASK -> DEPOSIT: PUT /{collection}/{deposit_id}/status + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done" + LOADER_TASK ->> CELERY: done + deactivate LOADER_TASK + CELERY ->> SCHEDULER: done +@enduml diff --git a/docs/images/deposit-workflow-reception.uml b/docs/images/deposit-workflow-reception.uml new file mode 100644 --- /dev/null +++ b/docs/images/deposit-workflow-reception.uml @@ -0,0 +1,37 @@ +@startuml + participant CLIENT as "SWORD client" + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant STORAGE as "swh-storage" + participant SCHEDULER as "swh-scheduler" + + activate CLIENT + activate DEPOSIT + activate DEPOSIT_DATABASE + activate STORAGE + activate SCHEDULER + + CLIENT ->> DEPOSIT: Atom and/or archive + DEPOSIT ->> DEPOSIT_DATABASE: create new deposit + DEPOSIT_DATABASE -->> DEPOSIT: return deposit_id + DEPOSIT ->> DEPOSIT_DATABASE: record deposit request + + loop while the previous request has "In-Progress: true" + DEPOSIT ->> CLIENT: deposit receipt\n("partial") + CLIENT ->> DEPOSIT: Atom and/or archive + DEPOSIT ->> DEPOSIT_DATABASE: record deposit request + end + + + alt if metadata-only + DEPOSIT ->> STORAGE: target exists? + STORAGE ->> DEPOSIT: true + DEPOSIT ->> STORAGE: insert metadata + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done" + else + DEPOSIT ->> SCHEDULER: schedule checks + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "loading" + end + + DEPOSIT ->> CLIENT: deposit receipt\n("done" or "loading") +@enduml diff --git a/docs/internals/index.rst b/docs/internals/index.rst --- a/docs/internals/index.rst +++ b/docs/internals/index.rst @@ -12,3 +12,4 @@ dev-environment prod-environment authentication + loading-workflow diff --git a/docs/internals/loading-workflow.rst b/docs/internals/loading-workflow.rst new file mode 100644 --- /dev/null +++ b/docs/internals/loading-workflow.rst @@ -0,0 +1,91 @@ +Loading workflow +================ + +This section complements the :ref:`deposit-use-case` documentation, +by detailing how deposits are handled internally after clients deposited them. + +Reception +--------- + +For every HTTP request sent by a client, the deposit API checks some simple properties, +then creates a :class:`swh.deposit.models.DepositRequest` +object containing the data uploaded by the client verbatim (archive and/or metadata), +and inserts in the database +A corresponding :class:`swh.deposit.models.Deposit` object is also created +and inserted, if this is the initial request creating a deposit. + +Upon receiving the last request, identified by the lack of the ``In-Progress: true`` +header, the deposit server either: + +* checks the targeting objects exists in :ref:`swh-storage `, + then sends a request to swh-storage with the Atom metadata and updates the + deposit status to ``done``, + if it is a :ref:`metadata-only deposit ` +* updates the deposit status and schedules a checking task by querying + :ref:`swh-scheduler `, otherwise + +Graphically: + +.. figure:: ../images/deposit-workflow-reception.svg + :alt: + +For metadata-only deposits, this is the end of the story. +The next section narrates what happens next for "normal" deposits. + +Checking +-------- + +As we saw above, the deposit API server's synchronous work ends after sending +a checking task. +This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`; +which is simply an other call to the deposit API, +implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`. + +This API performs longer checks, which require inspecting the deposited archive +(or archives, for clients depositing archives in multiple steps). +This is why it is run by an asynchronous task instead of being checked immediately +when the client sent a query. + +When it is done, it sets the deposit's status to "verified" (so clients polling +for the status know this step succeeded) and schedule a loading task. + +Graphically: + +.. figure:: ../images/deposit-workflow-checking.svg + :alt: + +Note that the check task is actually just a thin wrapper around an API call. +While the checks could be done in the task itself, it would mean sending +all archives from the deposit API to the celery worker, which would be inefficient. +And the gains would not be great, as checking tasks only need to decompress archives, +which is not resource intensive. +Instead, this long-running call to the API proved to be a simpler +and more efficient solution at the current scale of the deposit. + +Loading +------- + +When the check task finished, it scheduled a load task, implemented by +:class:`swh.loader.package.deposit.loader.DepositLoader`. + +It is part of the ``swh.loader.package`` package instead of ``swh-deposit``, +because its design is close to other :ref:`package loaders `: + +1. fetch a tarball +2. extract it +3. use :mod:`swh.model.from_disk` to build SWH objects from it +4. load these objects in :ref:`swh-storage ` + +The only difference in this process is fetching the tarball from the deposit server, +instead of external repositories. +This tarball is returned by :class:`swh.deposit.api.private.deposit_read`, +which creates it by aggregating all archives sent by the client (usually +only one, but the SWORD protocol allows more). + +Finally, when it is done, the loader updates the deposit status via the deposit API. + +Graphically: + +.. figure:: ../images/deposit-workflow-loading.svg + :alt: +