Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7066388
D5495.id.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
8 KB
Subscribers
None
D5495.id.diff
View Options
diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst
--- a/docs/api/use-cases.rst
+++ b/docs/api/use-cases.rst
@@ -226,6 +226,7 @@
the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the
deposit's content (returned after the deposit finished loading).
+.. _use-case-metadata-only-deposit:
Metadata-only deposit
---------------------
diff --git a/docs/images/deposit-workflow-checking.uml b/docs/images/deposit-workflow-checking.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-checking.uml
@@ -0,0 +1,34 @@
+@startuml
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant CHECKER_TASK as "checker task"
+ participant CELERY as "celery"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate CELERY
+ activate SCHEDULER
+
+ SCHEDULER ->> CELERY: new "check-deposit"\ntask available
+ CELERY ->> CHECKER_TASK: start task
+ activate CHECKER_TASK
+
+ CHECKER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/check/
+
+ DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests
+ DEPOSIT_DATABASE ->> DEPOSIT: deposit requests
+
+ loop for each request
+ DEPOSIT ->> DEPOSIT_DATABASE: get archive
+ DEPOSIT_DATABASE ->> DEPOSIT: archive content
+ DEPOSIT ->> DEPOSIT: check archive in the request
+ end
+
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "verified"
+ DEPOSIT ->> SCHEDULER: schedule load
+ DEPOSIT ->> CHECKER_TASK: done
+ CHECKER_TASK ->> CELERY: done
+ deactivate CHECKER_TASK
+ CELERY ->> SCHEDULER: done
+@enduml
diff --git a/docs/images/deposit-workflow-loading.uml b/docs/images/deposit-workflow-loading.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-loading.uml
@@ -0,0 +1,44 @@
+@startuml
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant LOADER_TASK as "loader task"
+ participant STORAGE as "swh-storage"
+ participant CELERY as "celery"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate STORAGE
+ activate CELERY
+ activate SCHEDULER
+
+ SCHEDULER ->> CELERY: new "load-deposit"\ntask available
+ CELERY ->> LOADER_TASK: start task
+ activate LOADER_TASK
+
+ LOADER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/raw/
+
+ DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests
+ DEPOSIT_DATABASE ->> DEPOSIT: deposit requests
+
+ loop for each request
+ DEPOSIT ->> DEPOSIT_DATABASE: get archive
+ DEPOSIT_DATABASE ->> DEPOSIT: archive content
+ DEPOSIT ->> DEPOSIT: aggregate
+ end
+
+ DEPOSIT ->> LOADER_TASK: tarball
+
+ LOADER_TASK ->> LOADER_TASK: unpack on disk
+
+ loop
+ LOADER_TASK ->> LOADER_TASK: load objects
+ LOADER_TASK ->> STORAGE: store objects
+ end
+
+ LOADER_TASK -> DEPOSIT: PUT /{collection}/{deposit_id}/status
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
+ LOADER_TASK ->> CELERY: done
+ deactivate LOADER_TASK
+ CELERY ->> SCHEDULER: done
+@enduml
diff --git a/docs/images/deposit-workflow-reception.uml b/docs/images/deposit-workflow-reception.uml
new file mode 100644
--- /dev/null
+++ b/docs/images/deposit-workflow-reception.uml
@@ -0,0 +1,37 @@
+@startuml
+ participant CLIENT as "SWORD client"
+ participant DEPOSIT as "deposit API"
+ participant DEPOSIT_DATABASE as "deposit DB"
+ participant STORAGE as "swh-storage"
+ participant SCHEDULER as "swh-scheduler"
+
+ activate CLIENT
+ activate DEPOSIT
+ activate DEPOSIT_DATABASE
+ activate STORAGE
+ activate SCHEDULER
+
+ CLIENT ->> DEPOSIT: Atom and/or archive
+ DEPOSIT ->> DEPOSIT_DATABASE: create new deposit
+ DEPOSIT_DATABASE -->> DEPOSIT: return deposit_id
+ DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
+
+ loop while the previous request has "In-Progress: true"
+ DEPOSIT ->> CLIENT: deposit receipt\n("partial")
+ CLIENT ->> DEPOSIT: Atom and/or archive
+ DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
+ end
+
+
+ alt if metadata-only
+ DEPOSIT ->> STORAGE: target exists?
+ STORAGE ->> DEPOSIT: true
+ DEPOSIT ->> STORAGE: insert metadata
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
+ else
+ DEPOSIT ->> SCHEDULER: schedule checks
+ DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "loading"
+ end
+
+ DEPOSIT ->> CLIENT: deposit receipt\n("done" or "loading")
+@enduml
diff --git a/docs/internals/index.rst b/docs/internals/index.rst
--- a/docs/internals/index.rst
+++ b/docs/internals/index.rst
@@ -12,3 +12,4 @@
dev-environment
prod-environment
authentication
+ loading-workflow
diff --git a/docs/internals/loading-workflow.rst b/docs/internals/loading-workflow.rst
new file mode 100644
--- /dev/null
+++ b/docs/internals/loading-workflow.rst
@@ -0,0 +1,91 @@
+Loading workflow
+================
+
+This section complements the :ref:`deposit-use-case` documentation,
+by detailing how deposits are handled internally after clients deposited them.
+
+Reception
+---------
+
+For every HTTP request sent by a client, the deposit API checks some simple properties,
+then creates a :class:`swh.deposit.models.DepositRequest`
+object containing the data uploaded by the client verbatim (archive and/or metadata),
+and inserts in the database
+A corresponding :class:`swh.deposit.models.Deposit` object is also created
+and inserted, if this is the initial request creating a deposit.
+
+Upon receiving the last request, identified by the lack of the ``In-Progress: true``
+header, the deposit server either:
+
+* checks the targeting objects exists in :ref:`swh-storage <swh-storage>`,
+ then sends a request to swh-storage with the Atom metadata and updates the
+ deposit status to ``done``,
+ if it is a :ref:`metadata-only deposit <use-case-metadata-only-deposit>`
+* updates the deposit status and schedules a checking task by querying
+ :ref:`swh-scheduler <swh-scheduler>`, otherwise
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-reception.svg
+ :alt:
+
+For metadata-only deposits, this is the end of the story.
+The next section narrates what happens next for "normal" deposits.
+
+Checking
+--------
+
+As we saw above, the deposit API server's synchronous work ends after sending
+a checking task.
+This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`;
+which is simply an other call to the deposit API,
+implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`.
+
+This API performs longer checks, which require inspecting the deposited archive
+(or archives, for clients depositing archives in multiple steps).
+This is why it is run by an asynchronous task instead of being checked immediately
+when the client sent a query.
+
+When it is done, it sets the deposit's status to "verified" (so clients polling
+for the status know this step succeeded) and schedule a loading task.
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-checking.svg
+ :alt:
+
+Note that the check task is actually just a thin wrapper around an API call.
+While the checks could be done in the task itself, it would mean sending
+all archives from the deposit API to the celery worker, which would be inefficient.
+And the gains would not be great, as checking tasks only need to decompress archives,
+which is not resource intensive.
+Instead, this long-running call to the API proved to be a simpler
+and more efficient solution at the current scale of the deposit.
+
+Loading
+-------
+
+When the check task finished, it scheduled a load task, implemented by
+:class:`swh.loader.package.deposit.loader.DepositLoader`.
+
+It is part of the ``swh.loader.package`` package instead of ``swh-deposit``,
+because its design is close to other :ref:`package loaders <swh-loader-core>`:
+
+1. fetch a tarball
+2. extract it
+3. use :mod:`swh.model.from_disk` to build SWH objects from it
+4. load these objects in :ref:`swh-storage <swh-storage>`
+
+The only difference in this process is fetching the tarball from the deposit server,
+instead of external repositories.
+This tarball is returned by :class:`swh.deposit.api.private.deposit_read`,
+which creates it by aggregating all archives sent by the client (usually
+only one, but the SWORD protocol allows more).
+
+Finally, when it is done, the loader updates the deposit status via the deposit API.
+
+Graphically:
+
+.. figure:: ../images/deposit-workflow-loading.svg
+ :alt:
+
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Nov 5 2024, 7:30 AM (8 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3215537
Attached To
D5495: Document the loading workflow.
Event Timeline
Log In to Comment