diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst index bfc57f25..170923be 100644 --- a/docs/api/use-cases.rst +++ b/docs/api/use-cases.rst @@ -1,246 +1,247 @@ .. _deposit-use-cases: Use cases ========= The general idea is that a deposit can be created either in a single request or by multiple requests to allow the user to add elements to the deposit piece by piece (be it the deposited data or the metadata describing it). An update request that does not have the `In-Progress: true` HTTP header will de facto declare the deposit as *completed* (aka in the `deposited` status; see below) and thus ready for ingestion. Once the deposit is declared *complete* by the user, the server performs a few validation checks. Then, if valid, schedule the ingestion of the deposited data in the Software Heritage Archive (SWH). There is a `status` property attached to a deposit allowing to follow the processing workflow of the deposit. For example, when this ingestion task completes successfully, the deposit is marked as `done`. Possible deposit statuses are: partial The deposit is partially received, since it can be done in multiple requests. expired Deposit was there too long and is new deemed ready to be garbage-collected. deposited Deposit is complete, ready to be checked. rejected Deposit failed the checks. verified Deposit passed the checks and is ready for loading. loading Injection is ongoing on SWH's side. done Loading is successful. failed Loading failed. .. figure:: ../images/status.svg :alt: This document describes the possible scenarios for creating or updating a deposit. Deposit creation ---------------- From client's deposit repository server to SWH's repository server: 1. The client requests for the server's abilities and its associated :ref:`collections ` using the *SD/service document uri* (:http:get:`/1/servicedocument/`). 2. The server answers the client with the service document which lists the *collections* linked to the user account (most of the time, there will one and only one collection linked to the user's account). Each of these collection can be used to push a deposit via its *COL/collection IRI*. 3. The client sends a deposit (a zip archive, some metadata or both) through the *COL/collection uri*. This can be done in: * one POST request (metadata + archive) without the `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` * one POST request (metadata or archive) **with** `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` plus one or more PUT or POST requests *to the update uris* (*edit-media iri* or *edit iri*): - :http:post:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:post:`/1/(str:collection-name)/(int:deposit-id)/metadata/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/metadata/` Then: a. Server validates the client's input or returns detailed error if any. b. Server stores information received (metadata or software archive source code or both). 4. The server creates a loading task and submits it to the :ref:`Job Scheduler ` 5. The server notifies the client it acknowledged the client's request. An ``http 201 Created`` response with a deposit receipt in the body response is sent back. That deposit receipt will hold the necessary information to eventually complete the deposit later on if it was incomplete (also known as status ``partial``). Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: pushing a deposit via the SWORDv2_ protocol (nominal scenario): .. figure:: ../images/deposit-create-chart.svg :alt: Deposit update -------------- 6. Client updates existing deposit through the *update uris* (one or more POST or PUT requests to either the *edit-media iri* or *edit iri*). 1. Server validates the client's input or returns detailed error if any 2. Server stores information received (metadata or software archive source code or both) This would be the case for example if the client initially posted a ``partial`` deposit (e.g. only metadata with no archive, or an archive without metadata, or a split archive because the initial one exceeded the limit size imposed by swh repository deposit). The content of a deposit can only be updated while it is in the ``partial`` state; this causes the content to be **replaced** (the old version is discarded). Its metadata, however, can also be updated while in the ``done`` state; see below. Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: updating a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-update-chart.svg :alt: Deposit deletion (or associated archive, or associated metadata) ---------------------------------------------------------------- 7. Deposit deletion is possible as long as the deposit is still in ``partial`` state. 1. Server validates the client's input or returns detailed error if any 2. Server actually delete information according to request Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: deleting a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-delete-chart.svg :alt: Client asks for operation status -------------------------------- At any time during the next step, operation status can be read through a GET query to the *state iri*. Deposit loading --------------- In one of the previous steps, when a deposit was created or loaded without ``In-Progress: true``, the deposit server created a load task and submitted it to :ref:`swh-scheduler `. This triggers the following steps: Server: Triggering deposit checks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``deposited`` is reached for a deposit, checks for the associated archive(s) and metadata will be triggered. If those checks fail, the status is changed to ``rejected`` and nothing more happens there. Otherwise, the status is changed to ``verified``. Server: Triggering deposit load ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``verified`` is reached for a deposit, loading the deposit with its associated metadata will be triggered. The loading will result on status update, either ``done`` or ``failed`` (depending on the loading's status). This is described in the :ref:`loading specifications document `. Completing the deposit ---------------------- When this is all done, the loaders notify the deposit server, which sets the deposit status to ``done``. This can then be polled by deposit clients, using the *state iri*. Deposit metadata updates ------------------------ We saw earlier that a deposit can only be updated when in ``partial`` state. This is one exception to this rule: its metadata can be updated while in the ``done`` state; which adds a new version of the metadata in the SWH archive, **in addition to** the old one(s). In this state, ``In-Progress`` is not allowed, so the deposit cannot go back in the ``partial`` state, but only to ``deposited``. As a failsafe, to avoid accidentally updating the wrong deposit, this requires the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the deposit's content (returned after the deposit finished loading). +.. _use-case-metadata-only-deposit: Metadata-only deposit --------------------- Finally, as an extension to the SWORD protocol, swh-deposit allows a special type of deposit: metadata-only deposits. Unlike regular deposit (described above), they do not have a code archive. Instead, they describe an existing :term:`software artifact` present in the archive. This use case is triggered by a ```` tag in the Atom document, see the :ref:`protocol reference ` for details. In the current implementation, these deposits are loaded (or rejected) immediately after a request without ``In-Progress: true`` is made, ie. they skip the ``loading`` state. This may change in a future version. .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html diff --git a/docs/images/deposit-workflow-checking.uml b/docs/images/deposit-workflow-checking.uml new file mode 100644 index 00000000..7acd846b --- /dev/null +++ b/docs/images/deposit-workflow-checking.uml @@ -0,0 +1,34 @@ +@startuml + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant CHECKER_TASK as "checker task" + participant CELERY as "celery" + participant SCHEDULER as "swh-scheduler" + + activate DEPOSIT + activate DEPOSIT_DATABASE + activate CELERY + activate SCHEDULER + + SCHEDULER ->> CELERY: new "check-deposit"\ntask available + CELERY ->> CHECKER_TASK: start task + activate CHECKER_TASK + + CHECKER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/check/ + + DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests + DEPOSIT_DATABASE ->> DEPOSIT: deposit requests + + loop for each request + DEPOSIT ->> DEPOSIT_DATABASE: get archive + DEPOSIT_DATABASE ->> DEPOSIT: archive content + DEPOSIT ->> DEPOSIT: check archive in the request + end + + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "verified" + DEPOSIT ->> SCHEDULER: schedule load + DEPOSIT ->> CHECKER_TASK: done + CHECKER_TASK ->> CELERY: done + deactivate CHECKER_TASK + CELERY ->> SCHEDULER: done +@enduml diff --git a/docs/images/deposit-workflow-loading.uml b/docs/images/deposit-workflow-loading.uml new file mode 100644 index 00000000..d5f869d9 --- /dev/null +++ b/docs/images/deposit-workflow-loading.uml @@ -0,0 +1,44 @@ +@startuml + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant LOADER_TASK as "loader task" + participant STORAGE as "swh-storage" + participant CELERY as "celery" + participant SCHEDULER as "swh-scheduler" + + activate DEPOSIT + activate DEPOSIT_DATABASE + activate STORAGE + activate CELERY + activate SCHEDULER + + SCHEDULER ->> CELERY: new "load-deposit"\ntask available + CELERY ->> LOADER_TASK: start task + activate LOADER_TASK + + LOADER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/raw/ + + DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests + DEPOSIT_DATABASE ->> DEPOSIT: deposit requests + + loop for each request + DEPOSIT ->> DEPOSIT_DATABASE: get archive + DEPOSIT_DATABASE ->> DEPOSIT: archive content + DEPOSIT ->> DEPOSIT: aggregate + end + + DEPOSIT ->> LOADER_TASK: tarball + + LOADER_TASK ->> LOADER_TASK: unpack on disk + + loop + LOADER_TASK ->> LOADER_TASK: load objects + LOADER_TASK ->> STORAGE: store objects + end + + LOADER_TASK -> DEPOSIT: PUT /{collection}/{deposit_id}/status + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done" + LOADER_TASK ->> CELERY: done + deactivate LOADER_TASK + CELERY ->> SCHEDULER: done +@enduml diff --git a/docs/images/deposit-workflow-reception.uml b/docs/images/deposit-workflow-reception.uml new file mode 100644 index 00000000..c38a9a9e --- /dev/null +++ b/docs/images/deposit-workflow-reception.uml @@ -0,0 +1,37 @@ +@startuml + participant CLIENT as "SWORD client" + participant DEPOSIT as "deposit API" + participant DEPOSIT_DATABASE as "deposit DB" + participant STORAGE as "swh-storage" + participant SCHEDULER as "swh-scheduler" + + activate CLIENT + activate DEPOSIT + activate DEPOSIT_DATABASE + activate STORAGE + activate SCHEDULER + + CLIENT ->> DEPOSIT: Atom and/or archive + DEPOSIT ->> DEPOSIT_DATABASE: create new deposit + DEPOSIT_DATABASE -->> DEPOSIT: return deposit_id + DEPOSIT ->> DEPOSIT_DATABASE: record deposit request + + loop while the previous request has "In-Progress: true" + DEPOSIT ->> CLIENT: deposit receipt\n("partial") + CLIENT ->> DEPOSIT: Atom and/or archive + DEPOSIT ->> DEPOSIT_DATABASE: record deposit request + end + + + alt if metadata-only + DEPOSIT ->> STORAGE: target exists? + STORAGE ->> DEPOSIT: true + DEPOSIT ->> STORAGE: insert metadata + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done" + else + DEPOSIT ->> SCHEDULER: schedule checks + DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "loading" + end + + DEPOSIT ->> CLIENT: deposit receipt\n("done" or "loading") +@enduml diff --git a/docs/internals/index.rst b/docs/internals/index.rst index 5b0affce..a3350fd9 100644 --- a/docs/internals/index.rst +++ b/docs/internals/index.rst @@ -1,14 +1,15 @@ .. _swh-deposit-internals: Deposit internals ================= This chapter describes how swh-deposit works internally, and how to run it (either in production or locally for development). .. toctree:: :maxdepth: 1 dev-environment prod-environment authentication + loading-workflow diff --git a/docs/internals/loading-workflow.rst b/docs/internals/loading-workflow.rst new file mode 100644 index 00000000..4909ce16 --- /dev/null +++ b/docs/internals/loading-workflow.rst @@ -0,0 +1,91 @@ +Loading workflow +================ + +This section complements the :ref:`deposit-use-case` documentation, +by detailing how deposits are handled internally after clients deposited them. + +Reception +--------- + +For every HTTP request sent by a client, the deposit API checks some simple properties, +then creates a :class:`swh.deposit.models.DepositRequest` +object containing the data uploaded by the client verbatim (archive and/or metadata), +and inserts in the database +A corresponding :class:`swh.deposit.models.Deposit` object is also created +and inserted, if this is the initial request creating a deposit. + +Upon receiving the last request, identified by the lack of the ``In-Progress: true`` +header, the deposit server either: + +* checks the targeting objects exists in :ref:`swh-storage `, + then sends a request to swh-storage with the Atom metadata and updates the + deposit status to ``done``, + if it is a :ref:`metadata-only deposit ` +* updates the deposit status and schedules a checking task by querying + :ref:`swh-scheduler `, otherwise + +Graphically: + +.. figure:: ../images/deposit-workflow-reception.svg + :alt: + +For metadata-only deposits, this is the end of the story. +The next section narrates what happens next for "normal" deposits. + +Checking +-------- + +As we saw above, the deposit API server's synchronous work ends after sending +a checking task. +This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`; +which is simply an other call to the deposit API, +implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`. + +This API performs longer checks, which require inspecting the deposited archive +(or archives, for clients depositing archives in multiple steps). +This is why it is run by an asynchronous task instead of being checked immediately +when the client sent a query. + +When it is done, it sets the deposit's status to "verified" (so clients polling +for the status know this step succeeded) and schedule a loading task. + +Graphically: + +.. figure:: ../images/deposit-workflow-checking.svg + :alt: + +Note that the check task is actually just a thin wrapper around an API call. +While the checks could be done in the task itself, it would mean sending +all archives from the deposit API to the celery worker, which would be inefficient. +And the gains would not be great, as checking tasks only need to decompress archives, +which is not resource intensive. +Instead, this long-running call to the API proved to be a simpler +and more efficient solution at the current scale of the deposit. + +Loading +------- + +When the check task finished, it scheduled a load task, implemented by +:class:`swh.loader.package.deposit.loader.DepositLoader`. + +It is part of the ``swh.loader.package`` package instead of ``swh-deposit``, +because its design is close to other :ref:`package loaders `: + +1. fetch a tarball +2. extract it +3. use :mod:`swh.model.from_disk` to build SWH objects from it +4. load these objects in :ref:`swh-storage ` + +The only difference in this process is fetching the tarball from the deposit server, +instead of external repositories. +This tarball is returned by :class:`swh.deposit.api.private.deposit_read`, +which creates it by aggregating all archives sent by the client (usually +only one, but the SWORD protocol allows more). + +Finally, when it is done, the loader updates the deposit status via the deposit API. + +Graphically: + +.. figure:: ../images/deposit-workflow-loading.svg + :alt: +