diff --git a/docs/README.rst b/docs/README.rst index 602e7f61..d934e7c0 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -1,71 +1,71 @@ Software Heritage - Deposit =========================== Simple Web-Service Offering Repository Deposit (S.W.O.R.D) is an interoperability standard for digital file deposit. This repository is both the `SWORD v2`_ Server and a deposit command-line client implementations. This implementation allows interaction between a client (a repository) and a server (SWH repository) to deposit software source code archives and associated metadata. Description ----------- Most of the software source code artifacts present in the SWH Archive are gathered by the mean of :term:`loader ` workers run by the SWH project from sourve code origins identified by :term:`lister ` workers. This is a pull mechanism: it's the responsibility of the SWH project to gather and collect source code artifacts that way. Alternatively, SWH allows its partners to push source code artifacts and metadata directly into the Archive with a push-based mechanism. By using this possibility different actors, holding software artifacts or metadata, can preserve their assets without having to pass through an intermediate collaborative development platform, which is already harvested by SWH (e.g GitHub, Gitlab, etc.). This mechanism is the `deposit`. The main idea is the deposit is an authenticated access to an API allowing the user to provide source code artifacts -- with metadata -- to be ingested in the SWH Archive. The result of that is a :ref:`SWHID ` that can be used to uniquely and persistently identify that very piece of source code. This unique identifier can then be used to `reference the source code `_ (e.g. in a `scientific paper `_) and retrieve it using the :ref:`vault ` feature of the SWH Archive platform. The differences between a piece of code uploaded using the deposit rather than simply asking SWH to archive a repository using the `save code now `_ feature are: - a deposited artifact is provided from one of the SWH partners which is regarded as a trusted authority, - a deposited artifact requires metadata properties describing the source code artifact, - a deposited artifact has a codemeta_ metadata entry attached to it, - a deposited artifact has the same visibility on the SWH Archive than a collected repository, - a deposited artifact can be searched with its provided url property on the SWH Archive, - the deposit API uses the `SWORD v2`_ API, thus requires some tooling to send deposits to SWH. These tools are provided with this repository. See the :ref:`deposit-user-manual` page for more details on how to use the deposit client command line tools to push a deposit in the SWH Archive. See the :ref:`deposit-api-specifications` reference pages of the SWORDv2 API implementation in `swh.deposit` if you want to do upload deposits using HTTP requests. -Read the :ref:`metadata` chapter to get more details on what metadata are supported when -doing a deposit. +Read the :ref:`deposit-metadata` chapter to get more details on what metadata +are supported when doing a deposit. See :ref:`swh-deposit-dev-env` if you want to hack the code of the `swh.deposit` module. See :ref:`swh-deposit-prod-env` if you want to deploy your own copy of the `swh.deposit` stack. .. _codemeta: https://codemeta.github.io/ .. _`SWORD v2`: http://swordapp.org/sword-v2/ diff --git a/docs/api/api-documentation.rst b/docs/api/api-documentation.rst index 94049986..5c4f58da 100644 --- a/docs/api/api-documentation.rst +++ b/docs/api/api-documentation.rst @@ -1,115 +1,111 @@ .. _deposit-api-specifications: API Documentation ================= This is `Software Heritage `__'s `SWORD 2.0 `__ Server implementation. **S.W.O.R.D** (**S**\ imple **W**\ eb-Service **O**\ ffering **R**\ epository **D**\ eposit) is an interoperability standard for digital file deposit. This implementation will permit interaction between a client (a repository) and a server (SWH repository) to push deposits of software source code archives with associated metadata. *Note:* * In the following document, we will use the ``archive`` or ``software source code archive`` interchangeably. * The supported archive formats are: * zip: common zip archive (no multi-disk zip files). * tar: tar archive without compression or optionally any of the following compression algorithm gzip (.tar.gz, .tgz), bzip2 (.tar.bz2) , or lzma (.tar.lzma) .. _swh-deposit-collection: Collection ---------- SWORD defines a ``collection`` concept. In SWH's case, this collection refers to a group of deposits. A ``deposit`` is some form of software source code archive(s) associated with metadata. By default the client's collection will have the client's name. Limitations ----------- * upload limitation of 100Mib * no mediation API overview ------------ API access is over HTTPS. The API is protected through basic authentication. Endpoints --------- The API endpoints are rooted at https://deposit.softwareheritage.org/1/. Data is sent and received as XML (as specified in the SWORD 2.0 specification). -.. include:: ../endpoints/service-document.rst - -.. include:: ../endpoints/collection.rst - -.. include:: ../endpoints/update-media.rst - -.. include:: ../endpoints/update-metadata.rst - -.. include:: ../endpoints/status.rst - -.. include:: ../endpoints/content.rst +.. toctree:: + ../endpoints/service-document.rst + ../endpoints/collection.rst + ../endpoints/update-media.rst + ../endpoints/update-metadata.rst + ../endpoints/status.rst + ../endpoints/content.rst Possible errors: ---------------- * common errors: * :http:statuscode:`401`:if a client does not provide credential or provide wrong ones * :http:statuscode:`403` a client tries access to a collection it does not own * :http:statuscode:`404` if a client tries access to an unknown collection * :http:statuscode:`404` if a client tries access to an unknown deposit * :http:statuscode:`415` if a wrong media type is provided to the endpoint * archive/binary deposit: * :http:statuscode:`403` the length of the archive exceeds the max size configured * :http:statuscode:`412` the length or hash provided mismatch the reality of the archive. * :http:statuscode:`415` if a wrong media type is provided * multipart deposit: * :http:statuscode:`412` the md5 hash provided mismatch the reality of the archive * :http:statuscode:`415` if a wrong media type is provided * Atom entry deposit: * :http:statuscode:`400` if the request's body is empty (for creation only) Sources ------- * `SWORD v2 specification `__ * `arxiv documentation `__ * `Dataverse example `__ * `SWORD used on HAL `__ * `xml examples for CCSD `__ diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst index 170923be..c71f6c64 100644 --- a/docs/api/use-cases.rst +++ b/docs/api/use-cases.rst @@ -1,247 +1,247 @@ .. _deposit-use-cases: Use cases ========= The general idea is that a deposit can be created either in a single request or by multiple requests to allow the user to add elements to the deposit piece by piece (be it the deposited data or the metadata describing it). An update request that does not have the `In-Progress: true` HTTP header will de facto declare the deposit as *completed* (aka in the `deposited` status; see below) and thus ready for ingestion. Once the deposit is declared *complete* by the user, the server performs a few validation checks. Then, if valid, schedule the ingestion of the deposited data in the Software Heritage Archive (SWH). There is a `status` property attached to a deposit allowing to follow the processing workflow of the deposit. For example, when this ingestion task completes successfully, the deposit is marked as `done`. Possible deposit statuses are: partial The deposit is partially received, since it can be done in multiple requests. expired Deposit was there too long and is new deemed ready to be garbage-collected. deposited Deposit is complete, ready to be checked. rejected Deposit failed the checks. verified Deposit passed the checks and is ready for loading. loading Injection is ongoing on SWH's side. done Loading is successful. failed Loading failed. .. figure:: ../images/status.svg :alt: This document describes the possible scenarios for creating or updating a deposit. Deposit creation ---------------- From client's deposit repository server to SWH's repository server: 1. The client requests for the server's abilities and its associated :ref:`collections ` using the *SD/service document uri* (:http:get:`/1/servicedocument/`). 2. The server answers the client with the service document which lists the *collections* linked to the user account (most of the time, there will one and only one collection linked to the user's account). Each of these collection can be used to push a deposit via its *COL/collection IRI*. 3. The client sends a deposit (a zip archive, some metadata or both) through the *COL/collection uri*. This can be done in: * one POST request (metadata + archive) without the `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` * one POST request (metadata or archive) **with** `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` plus one or more PUT or POST requests *to the update uris* (*edit-media iri* or *edit iri*): - :http:post:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:post:`/1/(str:collection-name)/(int:deposit-id)/metadata/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/metadata/` Then: a. Server validates the client's input or returns detailed error if any. b. Server stores information received (metadata or software archive source code or both). 4. The server creates a loading task and submits it to the :ref:`Job Scheduler ` 5. The server notifies the client it acknowledged the client's request. An ``http 201 Created`` response with a deposit receipt in the body response is sent back. That deposit receipt will hold the necessary information to eventually complete the deposit later on if it was incomplete (also known as status ``partial``). Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: pushing a deposit via the SWORDv2_ protocol (nominal scenario): .. figure:: ../images/deposit-create-chart.svg :alt: Deposit update -------------- 6. Client updates existing deposit through the *update uris* (one or more POST or PUT requests to either the *edit-media iri* or *edit iri*). 1. Server validates the client's input or returns detailed error if any 2. Server stores information received (metadata or software archive source code or both) This would be the case for example if the client initially posted a ``partial`` deposit (e.g. only metadata with no archive, or an archive without metadata, or a split archive because the initial one exceeded the limit size imposed by swh repository deposit). The content of a deposit can only be updated while it is in the ``partial`` state; this causes the content to be **replaced** (the old version is discarded). Its metadata, however, can also be updated while in the ``done`` state; see below. Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: updating a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-update-chart.svg :alt: Deposit deletion (or associated archive, or associated metadata) ---------------------------------------------------------------- 7. Deposit deletion is possible as long as the deposit is still in ``partial`` state. 1. Server validates the client's input or returns detailed error if any 2. Server actually delete information according to request Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: deleting a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-delete-chart.svg :alt: Client asks for operation status -------------------------------- At any time during the next step, operation status can be read through a GET query to the *state iri*. Deposit loading --------------- In one of the previous steps, when a deposit was created or loaded without ``In-Progress: true``, the deposit server created a load task and submitted it to :ref:`swh-scheduler `. This triggers the following steps: Server: Triggering deposit checks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``deposited`` is reached for a deposit, checks for the associated archive(s) and metadata will be triggered. If those checks fail, the status is changed to ``rejected`` and nothing more happens there. Otherwise, the status is changed to ``verified``. Server: Triggering deposit load ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``verified`` is reached for a deposit, loading the deposit with its associated metadata will be triggered. The loading will result on status update, either ``done`` or ``failed`` (depending on the loading's status). This is described in the :ref:`loading specifications document `. Completing the deposit ---------------------- When this is all done, the loaders notify the deposit server, which sets the deposit status to ``done``. This can then be polled by deposit clients, using the *state iri*. Deposit metadata updates ------------------------ We saw earlier that a deposit can only be updated when in ``partial`` state. This is one exception to this rule: its metadata can be updated while in the ``done`` state; which adds a new version of the metadata in the SWH archive, **in addition to** the old one(s). In this state, ``In-Progress`` is not allowed, so the deposit cannot go back in the ``partial`` state, but only to ``deposited``. As a failsafe, to avoid accidentally updating the wrong deposit, this requires the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the deposit's content (returned after the deposit finished loading). .. _use-case-metadata-only-deposit: Metadata-only deposit --------------------- Finally, as an extension to the SWORD protocol, swh-deposit allows a special type of deposit: metadata-only deposits. Unlike regular deposit (described above), they do not have a code archive. Instead, they describe an existing :term:`software artifact` present in the archive. This use case is triggered by a ```` tag in the Atom document, -see the :ref:`protocol reference ` for details. +see the :ref:`protocol reference ` for details. In the current implementation, these deposits are loaded (or rejected) immediately after a request without ``In-Progress: true`` is made, ie. they skip the ``loading`` state. This may change in a future version. .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html diff --git a/docs/api/user-manual.rst b/docs/api/user-manual.rst index ceafd48a..e2bacf04 100644 --- a/docs/api/user-manual.rst +++ b/docs/api/user-manual.rst @@ -1,486 +1,486 @@ .. _deposit-user-manual: User Manual =========== This is a guide for how to prepare and push a software deposit with the `swh deposit` commands. Requirements ------------ You need to have an account on the Software Heritage deposit application to be able to use the service. Please `contact the Software Heritage team `_ for more information on how to get access to this service. For testing purpose, a test instance `is available `_ [#f1]_ and will be used in the examples below. Once you have an account, you should get a set of access credentials as a `login` and a `password` (identified as ```` and ```` in the remaining of this document). A deposit account also comes with a "provider URL" which is used by SWH to build the :term:`Origin URL` of deposits created using this account. Installation ------------ To install the `swh.deposit` command line tools, you need a working Python 3.7+ environment. It is strongly recommended you use a `virtualenv `_ for this. .. code:: console $ python3 -m virtualenv deposit [...] $ source deposit/bin/activate (deposit)$ pip install swh.deposit [...] (deposit)$ swh deposit --help Usage: swh deposit [OPTIONS] COMMAND [ARGS]... Deposit main command Options: -h, --help Show this message and exit. Commands: admin Server administration tasks (manipulate user or... status Deposit's status upload Software Heritage Public Deposit Client Create/Update... (deposit)$ Note: in the examples below, we use the `jq`_ tool to make json outputs nicer. If you do have it already, you may install it using your distribution's packaging system. For example, on a Debian system: .. _jq: https://stedolan.github.io/jq/ .. code:: console $ sudo apt install jq -.. _prepare_deposit +.. _prepare-deposit: Prepare a deposit ----------------- * compress the files in a supported archive format: - zip: common zip archive (no multi-disk zip files). - tar: tar archive without compression or optionally any of the following compression algorithm gzip (`.tar.gz`, `.tgz`), bzip2 (`.tar.bz2`) , or lzma (`.tar.lzma`) * (Optional) prepare a metadata file (more details :ref:`deposit-metadata`): Example: Assuming you want to deposit the source code of `belenios `_ version 1.12 .. code:: console (deposit)$ wget https://gitlab.inria.fr/belenios/belenios/-/archive/1.12/belenios-1.12.zip [...] 2020-10-28 11:40:37 (4,56 MB/s) - ‘belenios-1.12.zip’ saved [449880/449880] (deposit)$ Then you need to prepare a metadata file allowing you to give detailed information on your deposited source code. A rather minimal Atom with Codemeta file could be: .. code:: console (deposit)$ cat metadata.xml Verifiable online voting system belenios-01243065 https://gitlab.inria.fr/belenios/belenios test Online voting Verifiable online voting system 1.12 opam stable ocaml GNU Affero General Public License Belenios belenios@example.com Belenios Test User (deposit)$ Please read the :ref:`deposit-metadata` page for a more detailed view on the metadata file formats and semantics. Push a deposit -------------- You can push a deposit with: * a single deposit (archive + metadata): The user posts in one query a software source code archive and associated metadata. The deposit is directly marked with status ``deposited``. * a multisteps deposit: 1. Create an incomplete deposit (marked with status ``partial``) 2. Add data to a deposit (in multiple requests if needed) 3. Finalize deposit (the status becomes ``deposited``) * a metadata-only deposit: The user posts in one query an associated metadata file on a :ref:`SWHID ` object. The deposit is directly marked with status ``done``. Overall, a deposit can be a in series of steps as follow: .. figure:: ../images/status.svg :alt: The important things to notice for now is that it can be: partial: the deposit is partially received expired: deposit has been there too long and is now deemed ready to be garbage collected deposited: deposit is complete and is ready to be checked to ensure data consistency verified: deposit is fully received, checked, and ready for loading loading: loading is ongoing on swh's side done: loading is successful failed: loading is a failure When you push a deposit, it is either in the `deposited` state or in the `partial` state if you asked for a partial upload. Single deposit ^^^^^^^^^^^^^^ Once the files are ready for deposit, we want to do the actual deposit in one shot, i.e. sending both the archive (zip) file and the metadata file. * 1 archive (content-type ``application/zip`` or ``application/x-tar``) * 1 metadata file in atom xml format (``content-type: application/atom+xml;type=entry``) For this, we need to provide the: * arguments: ``--username 'name' --password 'pass'`` as credentials * archive's path (example: ``--archive path/to/archive-name.tgz``) * metadata file path (example: ``--metadata path/to/metadata.xml``) to the `swh deposit upload` command. Example: To push the Belenios 1.12 we prepared previously on the testing instance of the deposit: .. code:: console (deposit)$ ls belenios-1.12.zip metadata.xml deposit (deposit)$ swh deposit upload --username --password \ --url https://deposit.staging.swh.network/1 \ --slug belenios-01243065 \ --archive belenios.zip \ --metadata metadata.xml \ --format json | jq { 'deposit_status': 'deposited', 'deposit_id': '1', 'deposit_date': 'Oct. 28, 2020, 1:52 p.m.', 'deposit_status_detail': None } (deposit)$ You just posted a deposit to your main collection on Software Heritage (staging area)! The returned value is a JSON dict, in which you will notably find the deposit id (needed to check for its status later on) and the current status, which should be `deposited` if no error has occurred. Note: As the deposit is in ``deposited`` status, you can no longer update the deposit after this query. It will be answered with a 403 (Forbidden) answer. If something went wrong, an equivalent response will be given with the `error` and `detail` keys explaining the issue, e.g.: .. code:: console { 'error': 'Unknown collection name xyz', 'detail': None, 'deposit_status': None, 'deposit_status_detail': None, 'deposit_swh_id': None, 'status': 404 } Once the deposit has been done, you can check its status using the `swh deposit status` command: .. code:: console (deposit)$ swh deposit status --username --password \ --url https://deposit.staging.swh.network/1 \ --deposit-id 1 -f json | jq { "deposit_id": "1", "deposit_status": "done", "deposit_status_detail": "The deposit has been successfully loaded into the Software Heritage archive", "deposit_swh_id": "swh:1:dir:63a6fc0ed8f69bf66ccbf99fc0472e30ef0a895a", "deposit_swh_id_context": "swh:1:dir:63a6fc0ed8f69bf66ccbf99fc0472e30ef0a895a;origin=https://softwareheritage.org/belenios-01234065;visit=swh:1:snp:0ae536667689da7047bfb7aa9f37f5958e9f4647;anchor=swh:1:rev:17ad98c940104d45b6b6bd6fba9aa832eeb95638;path=/", "deposit_external_id": "belenios-01234065" } Metadata-only deposit ^^^^^^^^^^^^^^^^^^^^^ This allows to deposit only metadata information on a :ref:`SWHID reference `. Prepare a metadata file as described in the :ref:`prepare deposit section ` Ensure this metadata file also declares a :ref:`SWHID reference `: .. code:: xml - For this, we then need to provide the following information: * arguments: ``--username 'name' --password 'pass'`` as credentials * metadata file path (example: ``--metadata path/to/metadata.xml``) to the `swh deposit metadata-only` command. Example: .. code:: console (deposit) swh deposit metadata-only --username --password \ --url https://deposit.staging.swh.network/1 \ --metadata ../deposit-swh.metadata-only.xml \ --format json | jq . { "deposit_id": "29", "deposit_status": "done", "deposit_date": "Dec. 15, 2020, 11:37 a.m." } For details on the metadata-only deposit, see the :ref:`metadata-only deposit protocol reference ` Multisteps deposit ^^^^^^^^^^^^^^^^^^ In this case, the deposit is created by several requests, uploading objects piece by piece. The steps to create a multisteps deposit: 1. Create an partial deposit """""""""""""""""""""""""""" First use the ``--partial`` argument to declare there is more to come .. code:: console $ swh deposit upload --username name --password secret \ --archive foo.tar.gz \ --partial 2. Add content or metadata to the deposit """"""""""""""""""""""""""""""""""""""""" Continue the deposit by using the ``--deposit-id`` argument given as a response for the first step. You can continue adding content or metadata while you use the ``--partial`` argument. To only add one new archive to the deposit: .. code:: console $ swh deposit upload --username name --password secret \ --archive add-foo.tar.gz \ --deposit-id 42 \ --partial To only add metadata to the deposit: .. code:: console $ swh deposit upload --username name --password secret \ --metadata add-foo.tar.gz.metadata.xml \ --deposit-id 42 \ --partial 3. Finalize deposit """"""""""""""""""" On your last addition (same command as before), by not declaring it ``--partial``, the deposit will be considered completed. Its status will be changed to ``deposited``: .. code:: console $ swh deposit upload --username name --password secret \ --metadata add-foo.tar.gz.metadata.xml \ --deposit-id 42 Update deposit -------------- * Update deposit metadata: - only possible if the deposit status is ``done``, ``--deposit-id `` and ``--swhid `` are provided - by using the ``--metadata`` flag, a path to an xml file .. code:: console $ swh deposit upload \ --username name --password secret \ --deposit-id 11 \ --swhid swh:1:dir:2ddb1f0122c57c8479c28ba2fc973d18508e6420 \ --metadata ../deposit-swh.update-metadata.xml * Replace deposit: - only possible if the deposit status is ``partial`` and ``--deposit-id `` is provided - by using the ``--replace`` flag - ``--metadata-deposit`` replaces associated existing metadata - ``--archive-deposit`` replaces associated archive(s) - by default, with no flag or both, you'll replace associated metadata and archive(s): .. code:: console $ swh deposit upload --username name --password secret \ --deposit-id 11 \ --archive updated-je-suis-gpl.tgz \ --replace * Update a loaded deposit with a new version (this creates a new deposit): - by using the external-id with the ``--slug`` argument, you will link the new deposit with its parent deposit: .. code:: console $ swh deposit upload --username name --password secret \ --archive je-suis-gpl-v2.tgz \ --slug 'je-suis-gpl' Check the deposit's status -------------------------- You can check the status of the deposit by using the ``--deposit-id`` argument: .. code:: console $ swh deposit status --username name --password secret \ --deposit-id 11 .. code:: json { "deposit_id": 11, "deposit_status": "deposited", "deposit_swh_id": null, "deposit_status_detail": "Deposit is ready for additional checks \ (tarball ok, metadata, etc...)" } When the deposit has been loaded into the archive, the status will be marked ``done``. In the response, will also be available the , . For example: .. code:: json { "deposit_id": 11, "deposit_status": "done", "deposit_swh_id": "swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9", "deposit_swh_id_context": "swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;\ origin=https://forge.softwareheritage.org/source/jesuisgpl/;\ visit=swh:1:snp:68c0d26104d47e278dd6be07ed61fafb561d0d20;\ anchor=swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;path=/", "deposit_status_detail": "The deposit has been successfully \ loaded into the Software Heritage archive" } .. rubric:: Footnotes .. [#f1] the test instance of the deposit is not yet available to external users, but it should be available soon. diff --git a/docs/cli.rst b/docs/cli.rst index d004c79a..ad1e2dd4 100644 --- a/docs/cli.rst +++ b/docs/cli.rst @@ -1,35 +1,35 @@ .. _swh-deposit-cli: Command-line interface ====================== Shared command-line interface ----------------------------- .. click:: swh.deposit.cli:deposit :prog: swh deposit :nested: short Administration utilities ------------------------ .. click:: swh.deposit.cli.admin:admin :prog: swh deposit admin :nested: full .. _swh-deposit-cli-client: Deposit client tools -------------------- .. click:: swh.deposit.cli.client:upload :prog: swh deposit :nested: full .. click:: swh.deposit.cli.client:status :prog: swh deposit :nested: full -.. click:: swh.deposit.cli.client:metadata-only +.. click:: swh.deposit.cli.client:metadata_only :prog: swh deposit :nested: full diff --git a/docs/internals/authentication.rst b/docs/internals/authentication.rst index 6d423cf4..e17f6ac1 100644 --- a/docs/internals/authentication.rst +++ b/docs/internals/authentication.rst @@ -1,44 +1,44 @@ .. _authentication: Authentication ============== This is a description of the authentication mechanism used in the deposit server. Both `basic authentication `_ and `keycloak`_ schemes are supported through configuration. Basic ----- The first implementation uses `basic authentication `_. The deposit server checks the authentication credentials sent by the deposit client using its own database. If authorized, the deposit client is allowed to continue its deposit. Otherwise, a 401 response is returned to the client. -.. figure:: images/deposit-authentication-basic.svg +.. figure:: ../images/deposit-authentication-basic.svg :alt: Basic Authentication Keycloak -------- Recent changes introduced `keycloak`_, an Open Source Identity and Access Management tool which is already used in other parts of the swh stack. The authentication is delegated to the `swh keycloak instance `_ using the `Resource Owner Password Credentials `_ scheme. Deposit clients still uses the deposit as before. Transparently for them, the deposit server forwards their credentials to keycloak for validation. If `keycloak`_ authorizes the deposit client, the deposit further checks that the deposit client has the proper permission "swh.deposit.api". If they do, they can post their deposits. If any issue arises during one of the authentication check, the client receives a 401 response (unauthorized). -.. figure:: images/deposit-authentication-keycloak.svg +.. figure:: ../images/deposit-authentication-keycloak.svg :alt: Keycloak Authentication .. _keycloak: https://www.keycloak.org/ diff --git a/docs/internals/loading-workflow.rst b/docs/internals/loading-workflow.rst index 4909ce16..b4fff2d0 100644 --- a/docs/internals/loading-workflow.rst +++ b/docs/internals/loading-workflow.rst @@ -1,91 +1,91 @@ Loading workflow ================ -This section complements the :ref:`deposit-use-case` documentation, +This section complements the :ref:`deposit-use-cases` documentation, by detailing how deposits are handled internally after clients deposited them. Reception --------- For every HTTP request sent by a client, the deposit API checks some simple properties, then creates a :class:`swh.deposit.models.DepositRequest` object containing the data uploaded by the client verbatim (archive and/or metadata), and inserts in the database A corresponding :class:`swh.deposit.models.Deposit` object is also created and inserted, if this is the initial request creating a deposit. Upon receiving the last request, identified by the lack of the ``In-Progress: true`` header, the deposit server either: * checks the targeting objects exists in :ref:`swh-storage `, then sends a request to swh-storage with the Atom metadata and updates the deposit status to ``done``, if it is a :ref:`metadata-only deposit ` * updates the deposit status and schedules a checking task by querying :ref:`swh-scheduler `, otherwise Graphically: .. figure:: ../images/deposit-workflow-reception.svg :alt: For metadata-only deposits, this is the end of the story. The next section narrates what happens next for "normal" deposits. Checking -------- As we saw above, the deposit API server's synchronous work ends after sending a checking task. This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`; which is simply an other call to the deposit API, implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`. This API performs longer checks, which require inspecting the deposited archive (or archives, for clients depositing archives in multiple steps). This is why it is run by an asynchronous task instead of being checked immediately when the client sent a query. When it is done, it sets the deposit's status to "verified" (so clients polling for the status know this step succeeded) and schedule a loading task. Graphically: .. figure:: ../images/deposit-workflow-checking.svg :alt: Note that the check task is actually just a thin wrapper around an API call. While the checks could be done in the task itself, it would mean sending all archives from the deposit API to the celery worker, which would be inefficient. And the gains would not be great, as checking tasks only need to decompress archives, which is not resource intensive. Instead, this long-running call to the API proved to be a simpler and more efficient solution at the current scale of the deposit. Loading ------- When the check task finished, it scheduled a load task, implemented by :class:`swh.loader.package.deposit.loader.DepositLoader`. It is part of the ``swh.loader.package`` package instead of ``swh-deposit``, because its design is close to other :ref:`package loaders `: 1. fetch a tarball 2. extract it 3. use :mod:`swh.model.from_disk` to build SWH objects from it 4. load these objects in :ref:`swh-storage ` The only difference in this process is fetching the tarball from the deposit server, instead of external repositories. This tarball is returned by :class:`swh.deposit.api.private.deposit_read`, which creates it by aggregating all archives sent by the client (usually only one, but the SWORD protocol allows more). Finally, when it is done, the loader updates the deposit status via the deposit API. Graphically: .. figure:: ../images/deposit-workflow-loading.svg :alt: diff --git a/docs/specs/protocol-reference.rst b/docs/specs/protocol-reference.rst index 9deecb9a..9f9255dd 100644 --- a/docs/specs/protocol-reference.rst +++ b/docs/specs/protocol-reference.rst @@ -1,287 +1,287 @@ .. _deposit-protocol: Protocol reference ================== The swh-deposit protocol is an extension SWORDv2_ protocol, and the swh-deposit client and server should work with any other SWORDv2-compliant implementation which provides some :ref:`mandatory attributes ` However, we define some extensions by the means of extra tags in the Atom entries, that should be used when interacting with the server to use it optimally. This means the swh-deposit server should work with a generic SWORDv2 client, but works much better with these extensions. All these tags are in the ``https://www.softwareheritage.org/schema/2018/deposit`` XML namespace, denoted using the ``swhdeposit`` prefix in this section. Origin creation with the ```` tag ----------------------------------------------------------- Motivation ^^^^^^^^^^ This is the main extension we define. This tag is used after a deposit is completed, to load it in the Software Heritage archive. The SWH archive references source code repositories by an URI, called the :term:`origin` URL. This URI is clearly defined when SWH pulls source code from such a repository; but not for the push approach used by SWORD, as SWORD clients do not intrinsically have an URL. Usage ^^^^^ Instead, clients are expected to provide the origin URL themselves, by adding a tag in the Atom entry they submit to the server, like this: .. code:: xml This will create an origin in the Software Heritage archive, that will point to the source code artifacts of this deposit. Semantics of origin URLs ^^^^^^^^^^^^^^^^^^^^^^^^ Origin URLs must be unique to an origin, ie. to a software project. The exact definition of a "software project" is left to the clients of the deposit. They should be designed so that future releases of the same software will have the same origin URL. As a guideline, consider that every GitHub/GitLab project is an origin, and every package in Debian/NPM/PyPI is also an origin. While origin URLs are not required to resolve to a source code artifact, we recommend they point to a public resource describing the software project, including a link to download its source code. This is not a technical requirement, but it improves discoverability. Clients may not submit arbitrary URLs; the server will check the URLs they submit belongs a "namespace" they own, known as the ``provider_url`` of the client. For example, if a client has their ``provider_url`` set to ``https://example.org/foo/`` they will not be able to submit deposits to origins whose URL starts with ``https://example.org/foo/``. Fallbacks ^^^^^^^^^ If the ```` is not provided (either because they are generic SWORDv2 implementations or old implementations of an swh-deposit client), the server falls back to creating one based on the ``provider_url`` and the ``Slug`` header (as defined in the AtomPub_ specification) by concatenating them. If the ``Slug`` header is missing, the server generates one randomly. This fallback is provided for compliance with SWORDv2_ clients, but we do not recommend relying on it, as it usually creates origins URL that are not meaningful. Adding releases to an origin, with the ```` tag ------------------------------------------------------------------------- When depositing a source code artifact for an origin (ie. software project) that was already deposited before, clients should not use ````, as the origin was already created by the original deposit; and ```` should be used instead. It is used very similarly to ````: .. code:: xml This will create a new :term:`revision` object in the Software Heritage archive, with the last deposit on this origin as its parent revision, and reference it from the origin. If the origin does not exist, it will error. Metadata -------- Format ^^^^^^ While the SWORDv2 specification recommends the use of DublinCore_, we prefer the CodeMeta_ vocabulary, as we already use it in other components of Software Heritage. While CodeMeta is designed for use in JSON-LD, it is easy to reuse its vocabulary and embed it in an XML document, in three steps: 1. use the JSON-LD compact representation of the CodeMeta document 2. replace ``@context`` declarations with XML namespaces 3. unfold JSON lists to sibling XML subtrees For example, this CodeMeta document: .. code:: json { "@context": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0", "name": "My Software", "author": [ { "name": "Author 1", "email": "foo@example.org" }, { - "name": Author 2" + "name": "Author 2" } ] } becomes this XML document: .. code:: xml My Software Author 1 foo@example.org Author 2 Or, equivalently: .. code:: xml My Software Author 1 foo@example.org Author 2 .. _mandatory-attributes: Mandatory attributes ^^^^^^^^^^^^^^^^^^^^ All deposits must include: * an ```` tag with an ```` and ````, and * either ```` or ```` We also highly recommend their CodeMeta equivalent, and any other relevant metadata, but this is not enforced. -.. _metatadata-only-deposit: +.. _metadata-only-deposit: Metadata-only deposit --------------------- The swh-deposit server can also be without a source code artifact, but only to provide metadata that describes an arbitrary origin or object in Software Heritage; known as extrinsic metadata. Unlike regular deposits, there are no restricting on URL prefixes, so any client can provide metadata on any origin; and no restrictions on which objects can be described. This is done by simply omitting the binary file deposit request of a regular SWORDv2 deposit, and including information on which object the metadata describes, by adding a ```` tag in the Atom document. To describe an origin: .. code:: xml And to describe an object: .. code:: xml For details on the semantics, see the :ref:`metadata deposit specification ` Schema ------ Here is an XML schema to summarize the syntax described in this document: .. literalinclude:: swh.xsd :language: xml .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html .. _AtomPub: https://tools.ietf.org/html/rfc5023 .. _DublinCore: https://www.dublincore.org/ .. _CodeMeta: https://codemeta.github.io/ diff --git a/swh/deposit/utils.py b/swh/deposit/utils.py index 3482ff60..0bb94c86 100644 --- a/swh/deposit/utils.py +++ b/swh/deposit/utils.py @@ -1,234 +1,240 @@ # Copyright (C) 2018-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging from types import GeneratorType from typing import Any, Dict, Optional, Union import iso8601 import xmltodict from swh.model.exceptions import ValidationError from swh.model.identifiers import ( ExtendedSWHID, ObjectType, QualifiedSWHID, normalize_timestamp, ) logger = logging.getLogger(__name__) def parse_xml(stream, encoding="utf-8"): namespaces = { "http://www.w3.org/2005/Atom": "atom", "http://www.w3.org/2007/app": "app", "http://purl.org/dc/terms/": "dc", "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0": "codemeta", "http://purl.org/net/sword/terms/": "sword", "https://www.softwareheritage.org/schema/2018/deposit": "swh", } data = xmltodict.parse( stream, encoding=encoding, namespaces=namespaces, process_namespaces=True, dict_constructor=dict, ) if "atom:entry" in data: data = data["atom:entry"] return data def merge(*dicts): """Given an iterator of dicts, merge them losing no information. Args: *dicts: arguments are all supposed to be dict to merge into one Returns: dict merged without losing information """ def _extend(existing_val, value): """Given an existing value and a value (as potential lists), merge them together without repetition. """ if isinstance(value, (list, map, GeneratorType)): vals = value else: vals = [value] for v in vals: if v in existing_val: continue existing_val.append(v) return existing_val d = {} for data in dicts: if not isinstance(data, dict): raise ValueError("dicts is supposed to be a variable arguments of dict") for key, value in data.items(): existing_val = d.get(key) if not existing_val: d[key] = value continue if isinstance(existing_val, (list, map, GeneratorType)): new_val = _extend(existing_val, value) elif isinstance(existing_val, dict): if isinstance(value, dict): new_val = merge(existing_val, value) else: new_val = _extend([existing_val], value) else: new_val = _extend([existing_val], value) d[key] = new_val return d def normalize_date(date): """Normalize date fields as expected by swh workers. If date is a list, elect arbitrarily the first element of that list If date is (then) a string, parse it through dateutil.parser.parse to extract a datetime. Then normalize it through swh.model.identifiers.normalize_timestamp. Returns The swh date object """ if isinstance(date, list): date = date[0] if isinstance(date, str): date = iso8601.parse_date(date) return normalize_timestamp(date) def compute_metadata_context(swhid_reference: QualifiedSWHID) -> Dict[str, Any]: """Given a SWHID object, determine the context as a dict. """ metadata_context: Dict[str, Any] = {"origin": None} if swhid_reference.qualifiers(): metadata_context = { "origin": swhid_reference.origin, "path": swhid_reference.path, } snapshot = swhid_reference.visit if snapshot: metadata_context["snapshot"] = snapshot anchor = swhid_reference.anchor if anchor: metadata_context[anchor.object_type.name.lower()] = anchor return metadata_context ALLOWED_QUALIFIERS_NODE_TYPE = ( ObjectType.SNAPSHOT, ObjectType.REVISION, ObjectType.RELEASE, ObjectType.DIRECTORY, ) def parse_swh_reference(metadata: Dict,) -> Optional[Union[QualifiedSWHID, str]]: - """Parse swh reference within the metadata dict (or origin) reference if found, None - otherwise. + """Parse swh reference within the metadata dict (or origin) reference if found, + None otherwise. - - - - - + .. code-block:: xml + + + + + + or: - - - - + .. code-block:: xml + + + + + + + Args: + metadata: result of parsing an Atom document with :func:`parse_xml` Raises: ValidationError in case the swhid referenced (if any) is invalid Returns: Either swhid or origin reference if any. None otherwise. """ # noqa swh_deposit = metadata.get("swh:deposit") if not swh_deposit: return None swh_reference = swh_deposit.get("swh:reference") if not swh_reference: return None swh_origin = swh_reference.get("swh:origin") if swh_origin: url = swh_origin.get("@url") if url: return url swh_object = swh_reference.get("swh:object") if not swh_object: return None swhid = swh_object.get("@swhid") if not swhid: return None swhid_reference = QualifiedSWHID.from_string(swhid) if swhid_reference.qualifiers(): anchor = swhid_reference.anchor if anchor: if anchor.object_type not in ALLOWED_QUALIFIERS_NODE_TYPE: error_msg = ( "anchor qualifier should be a core SWHID with type one of " f"{', '.join(t.name.lower() for t in ALLOWED_QUALIFIERS_NODE_TYPE)}" ) raise ValidationError(error_msg) visit = swhid_reference.visit if visit: if visit.object_type != ObjectType.SNAPSHOT: raise ValidationError( f"visit qualifier should be a core SWHID with type snp, " f"not {visit.object_type.value}" ) if ( visit and anchor and visit.object_type == ObjectType.SNAPSHOT and anchor.object_type == ObjectType.SNAPSHOT ): logger.warn( "SWHID use of both anchor and visit targeting " f"a snapshot: {swhid_reference}" ) raise ValidationError( "'anchor=swh:1:snp:' is not supported when 'visit' is also provided." ) return swhid_reference def extended_swhid_from_qualified(swhid: QualifiedSWHID) -> ExtendedSWHID: """Used to get the target of a metadata object from a , as the latter uses a QualifiedSWHID.""" return ExtendedSWHID.from_string(str(swhid).split(";")[0])