Page MenuHomeSoftware Heritage

D5495.id19644.diff
No OneTemporary

D5495.id19644.diff

diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst
--- a/docs/api/use-cases.rst
+++ b/docs/api/use-cases.rst
@@ -226,6 +226,7 @@
the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the
deposit's content (returned after the deposit finished loading).
+.. _use-case-metadata-only-deposit:
Metadata-only deposit
---------------------
diff --git a/docs/images/deposit-workflow-checking.uml b/docs/images/deposit-workflow-checking.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-checking.uml
@@ -0,0 +1,34 @@
+@startuml
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant CHECKER_TASK as "checker task"
+ participant CELERY as "celery"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate CELERY
+ activate SCHEDULER
+
+ SCHEDULER ->> CELERY: new "check-deposit"\ntask available
+ CELERY ->> CHECKER_TASK: start task
+ activate CHECKER_TASK
+
+ CHECKER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/check/
+
+ DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests
+ DEPOSIT_DATABASE ->> DEPOSIT: deposit requests
+
+ loop for each request
+ DEPOSIT ->> DEPOSIT_DATABASE: get archive
+ DEPOSIT_DATABASE ->> DEPOSIT: archive content
+ DEPOSIT ->> DEPOSIT: check archive in the request
+ end
+
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "verified"
+ DEPOSIT ->> SCHEDULER: schedule load
+ DEPOSIT ->> CHECKER_TASK: done
+ CHECKER_TASK ->> CELERY: done
+ deactivate CHECKER_TASK
+ CELERY ->> SCHEDULER: done
+@enduml
diff --git a/docs/images/deposit-workflow-loading.uml b/docs/images/deposit-workflow-loading.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-loading.uml
@@ -0,0 +1,44 @@
+@startuml
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant LOADER_TASK as "loader task"
+ participant STORAGE as "swh-storage"
+ participant CELERY as "celery"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate STORAGE
+ activate CELERY
+ activate SCHEDULER
+
+ SCHEDULER ->> CELERY: new "load-deposit"\ntask available
+ CELERY ->> LOADER_TASK: start task
+ activate LOADER_TASK
+
+ LOADER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/raw/
+
+ DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests
+ DEPOSIT_DATABASE ->> DEPOSIT: deposit requests
+
+ loop for each request
+ DEPOSIT ->> DEPOSIT_DATABASE: get archive
+ DEPOSIT_DATABASE ->> DEPOSIT: archive content
+ DEPOSIT ->> DEPOSIT: aggregate
+ end
+
+ DEPOSIT ->> LOADER_TASK: tarball
+
+ LOADER_TASK ->> LOADER_TASK: unpack on disk
+
+ loop
+ LOADER_TASK ->> LOADER_TASK: load objects
+ LOADER_TASK ->> STORAGE: store objects
+ end
+
+ LOADER_TASK -> DEPOSIT: PUT /{collection}/{deposit_id}/status
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
+ LOADER_TASK ->> CELERY: done
+ deactivate LOADER_TASK
+ CELERY ->> SCHEDULER: done
+@enduml
diff --git a/docs/images/deposit-workflow-reception.uml b/docs/images/deposit-workflow-reception.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-reception.uml
@@ -0,0 +1,37 @@
+@startuml
+ participant CLIENT as "SWORD client"
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant STORAGE as "swh-storage"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate CLIENT
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate STORAGE
+ activate SCHEDULER
+
+ CLIENT ->> DEPOSIT: Atom and/or archive
+ DEPOSIT ->> DEPOSIT_DATABASE: create new deposit
+ DEPOSIT_DATABASE -->> DEPOSIT: return deposit_id
+ DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
+
+ loop while the previous request has "In-Progress: true"
+ DEPOSIT ->> CLIENT: deposit receipt\n("partial")
+ CLIENT ->> DEPOSIT: Atom and/or archive
+ DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
+ end
+
+
+ alt if metadata-only
+ DEPOSIT ->> STORAGE: target exists?
+ STORAGE ->> DEPOSIT: true
+ DEPOSIT ->> STORAGE: insert metadata
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
+ else
+ DEPOSIT ->> SCHEDULER: schedule checks
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "loading"
+ end
+
+ DEPOSIT ->> CLIENT: deposit receipt\n("done" or "loading")
+@enduml
diff --git a/docs/internals/index.rst b/docs/internals/index.rst
--- a/docs/internals/index.rst
+++ b/docs/internals/index.rst
@@ -12,3 +12,4 @@
dev-environment
prod-environment
authentication
+ loading-workflow
diff --git a/docs/internals/loading-workflow.rst b/docs/internals/loading-workflow.rst
new file mode 100644
--- /dev/null
+++ b/docs/internals/loading-workflow.rst
@@ -0,0 +1,91 @@
+Loading workflow
+================
+
+This section complements the :ref:`deposit-use-case` documentation,
+by detailing how deposits are handled internally after clients deposited them.
+
+Reception
+---------
+
+For every HTTP request sent by a client, the deposit API checks some simple properties,
+then creates a :class:`swh.deposit.models.DepositRequest`
+object containing the data uploaded by the client verbatim (archive and/or metadata),
+and inserts in the database
+A corresponding :class:`swh.deposit.models.Deposit` object is also created
+and inserted, if this is the initial request creating a deposit.
+
+Upon receiving the last request, identified by the lack of the ``In-Progress: true``
+header, the deposit server either:
+
+* checks the targeting objects exists in :ref:`swh-storage <swh-storage>`,
+ then sends a request to swh-storage with the Atom metadata and updates the
+ deposit status to ``done``,
+ if it is a :ref:`metadata-only deposit <use-case-metadata-only-deposit>`
+* updates the deposit status and schedules a checking task by querying
+ :ref:`swh-scheduler <swh-scheduler>`, otherwise
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-reception.svg
+ :alt:
+
+For metadata-only deposits, this is the end of the story.
+The next section narrates what happens next for "normal" deposits.
+
+Checking
+--------
+
+As we saw above, the deposit API server's synchronous work ends after sending
+a checking task.
+This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`;
+which is simply an other call to the deposit API,
+implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`.
+
+This API performs longer checks, which require inspecting the deposited archive
+(or archives, for clients depositing archives in multiple steps).
+This is why it is run by an asynchronous task instead of being checked immediately
+when the client sent a query.
+
+When it is done, it sets the deposit's status to "verified" (so clients polling
+for the status know this step succeeded) and schedule a loading task.
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-checking.svg
+ :alt:
+
+Note that the check task is actually just a thin wrapper around an API call.
+While the checks could be done in the task itself, it would mean sending
+all archives from the deposit API to the celery worker, which would be inefficient.
+And the gains would not be great, as checking tasks only need to decompress archives,
+which is not resource intensive.
+Instead, this long-running call to the API proved to be a simpler
+and more efficient solution at the current scale of the deposit.
+
+Loading
+-------
+
+When the check task finished, it scheduled a load task, implemented by
+:class:`swh.loader.package.deposit.loader.DepositLoader`.
+
+It is part of the ``swh.loader.package`` package instead of ``swh-deposit``,
+because its design is close to other :ref:`package loaders <swh-loader-core>`:
+
+1. fetch a tarball
+2. extract it
+3. use :mod:`swh.model.from_disk` to build SWH objects from it
+4. load these objects in :ref:`swh-storage <swh-storage>`
+
+The only difference in this process is fetching the tarball from the deposit server,
+instead of external repositories.
+This tarball is returned by :class:`swh.deposit.api.private.deposit_read`,
+which creates it by aggregating all archives sent by the client (usually
+only one, but the SWORD protocol allows more).
+
+Finally, when it is done, the loader updates the deposit status via the deposit API.
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-loading.svg
+ :alt:
+

File Metadata

Mime Type
text/plain
Expires
Nov 5 2024, 8:35 AM (11 w, 18 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3215537

Event Timeline