diff --git a/docs/README.rst b/docs/README.rst index ba8cb27f..669519c5 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -1,71 +1,71 @@ Software Heritage - Deposit =========================== Simple Web-Service Offering Repository Deposit (S.W.O.R.D) is an interoperability standard for digital file deposit. This repository is both the `SWORD v2`_ Server and a deposit command-line client implementations. This implementation allows interaction between a client (a repository) and a server (SWH repository) to deposit software source code archives and associated metadata. Description ----------- Most of the software source code artifacts present in the SWH Archive are gathered by the mean of :term:`loader ` workers run by the SWH project from sourve code origins identified by :term:`lister ` workers. This is a pull mechanism: it's the responsibility of the SWH project to gather and collect source code artifacts that way. Alternatively, SWH allows its partners to push source code artifacts and metadata directly into the Archive with a push-based mechanism. By using this possibility different actors, holding software artifacts or metadata, can preserve their assets without having to pass through an intermediate collaborative development platform, which is already harvested by SWH (e.g GitHub, Gitlab, etc.). -This mechanism is the `deposit`. +This mechanism is the ``deposit``. The main idea is the deposit is an authenticated access to an API allowing the user to provide source code artifacts -- with metadata -- to be ingested in the SWH Archive. The result of that is a :ref:`SWHID ` that can be used to uniquely and persistently identify that very piece of source code. This unique identifier can then be used to `reference the source code `_ (e.g. in a `scientific paper `_) and retrieve it using the :ref:`vault ` feature of the SWH Archive platform. The differences between a piece of code uploaded using the deposit rather than simply asking SWH to archive a repository using the :swh_web:`save code now ` feature are: - a deposited artifact is provided from one of the SWH partners which is regarded as a trusted authority, - a deposited artifact requires metadata properties describing the source code artifact, - a deposited artifact has a codemeta_ metadata entry attached to it, - a deposited artifact has the same visibility on the SWH Archive than a collected repository, - a deposited artifact can be searched with its provided url property on the SWH Archive, - the deposit API uses the `SWORD v2`_ API, thus requires some tooling to send deposits to SWH. These tools are provided with this repository. See the :ref:`deposit-user-manual` page for more details on how to use the deposit client command line tools to push a deposit in the SWH Archive. See the :ref:`deposit-api-specifications` reference pages of the SWORDv2 API implementation in `swh.deposit` if you want to do upload deposits using HTTP requests. Read the :ref:`deposit-metadata` chapter to get more details on what metadata are supported when doing a deposit. -See :ref:`swh-deposit-dev-env` if you want to hack the code of the `swh.deposit` module. +See :ref:`swh-deposit-dev-env` if you want to hack the code of the ``swh.deposit`` module. See :ref:`swh-deposit-prod-env` if you want to deploy your own copy of the `swh.deposit` stack. .. _codemeta: https://codemeta.github.io/ -.. _`SWORD v2`: http://swordapp.org/sword-v2/ +.. _SWORD v2: http://swordapp.org/sword-v2/ diff --git a/docs/api/use-cases.rst b/docs/api/use-cases.rst index c71f6c64..c1c36d3e 100644 --- a/docs/api/use-cases.rst +++ b/docs/api/use-cases.rst @@ -1,247 +1,247 @@ .. _deposit-use-cases: Use cases ========= The general idea is that a deposit can be created either in a single request or by multiple requests to allow the user to add elements to the deposit piece by piece (be it the deposited data or the metadata describing it). -An update request that does not have the `In-Progress: true` HTTP header will -de facto declare the deposit as *completed* (aka in the `deposited` status; see +An update request that does not have the ``In-Progress: true`` HTTP header will +de facto declare the deposit as *completed* (aka in the ``deposited`` status; see below) and thus ready for ingestion. Once the deposit is declared *complete* by the user, the server performs a few validation checks. Then, if valid, schedule the ingestion of the deposited data in the Software Heritage Archive (SWH). -There is a `status` property attached to a deposit allowing to follow the +There is a ``status`` property attached to a deposit allowing to follow the processing workflow of the deposit. For example, when this ingestion task -completes successfully, the deposit is marked as `done`. +completes successfully, the deposit is marked as ``done``. Possible deposit statuses are: partial The deposit is partially received, since it can be done in multiple requests. expired Deposit was there too long and is new deemed ready to be garbage-collected. deposited Deposit is complete, ready to be checked. rejected Deposit failed the checks. verified Deposit passed the checks and is ready for loading. loading Injection is ongoing on SWH's side. done Loading is successful. failed Loading failed. .. figure:: ../images/status.svg :alt: This document describes the possible scenarios for creating or updating a deposit. Deposit creation ---------------- From client's deposit repository server to SWH's repository server: 1. The client requests for the server's abilities and its associated :ref:`collections ` using the *SD/service document uri* (:http:get:`/1/servicedocument/`). 2. The server answers the client with the service document which lists the *collections* linked to the user account (most of the time, there will one and only one collection linked to the user's account). Each of these collection can be used to push a deposit via its *COL/collection IRI*. 3. The client sends a deposit (a zip archive, some metadata or both) through the *COL/collection uri*. This can be done in: * one POST request (metadata + archive) without the `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` * one POST request (metadata or archive) **with** `In-Progress: true` header: - :http:post:`/1/(str:collection-name)/` plus one or more PUT or POST requests *to the update uris* (*edit-media iri* or *edit iri*): - :http:post:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/media/` - :http:post:`/1/(str:collection-name)/(int:deposit-id)/metadata/` - :http:put:`/1/(str:collection-name)/(int:deposit-id)/metadata/` Then: a. Server validates the client's input or returns detailed error if any. b. Server stores information received (metadata or software archive source code or both). 4. The server creates a loading task and submits it to the :ref:`Job Scheduler ` 5. The server notifies the client it acknowledged the client's request. An ``http 201 Created`` response with a deposit receipt in the body response is sent back. That deposit receipt will hold the necessary information to eventually complete the deposit later on if it was incomplete (also known as status ``partial``). Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: pushing a deposit via the SWORDv2_ protocol (nominal scenario): .. figure:: ../images/deposit-create-chart.svg :alt: Deposit update -------------- 6. Client updates existing deposit through the *update uris* (one or more POST or PUT requests to either the *edit-media iri* or *edit iri*). 1. Server validates the client's input or returns detailed error if any 2. Server stores information received (metadata or software archive source code or both) This would be the case for example if the client initially posted a ``partial`` deposit (e.g. only metadata with no archive, or an archive without metadata, or a split archive because the initial one exceeded the limit size imposed by swh repository deposit). The content of a deposit can only be updated while it is in the ``partial`` state; this causes the content to be **replaced** (the old version is discarded). Its metadata, however, can also be updated while in the ``done`` state; see below. Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: updating a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-update-chart.svg :alt: Deposit deletion (or associated archive, or associated metadata) ---------------------------------------------------------------- 7. Deposit deletion is possible as long as the deposit is still in ``partial`` state. 1. Server validates the client's input or returns detailed error if any 2. Server actually delete information according to request Schema representation ^^^^^^^^^^^^^^^^^^^^^ Scenario: deleting a deposit via SWORDv2_ protocol: .. figure:: ../images/deposit-delete-chart.svg :alt: Client asks for operation status -------------------------------- At any time during the next step, operation status can be read through a GET query to the *state iri*. Deposit loading --------------- In one of the previous steps, when a deposit was created or loaded without ``In-Progress: true``, the deposit server created a load task and submitted it to :ref:`swh-scheduler `. This triggers the following steps: Server: Triggering deposit checks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``deposited`` is reached for a deposit, checks for the associated archive(s) and metadata will be triggered. If those checks fail, the status is changed to ``rejected`` and nothing more happens there. Otherwise, the status is changed to ``verified``. Server: Triggering deposit load ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once the status ``verified`` is reached for a deposit, loading the deposit with its associated metadata will be triggered. The loading will result on status update, either ``done`` or ``failed`` (depending on the loading's status). This is described in the :ref:`loading specifications document `. Completing the deposit ---------------------- When this is all done, the loaders notify the deposit server, which sets the deposit status to ``done``. This can then be polled by deposit clients, using the *state iri*. Deposit metadata updates ------------------------ We saw earlier that a deposit can only be updated when in ``partial`` state. This is one exception to this rule: its metadata can be updated while in the ``done`` state; which adds a new version of the metadata in the SWH archive, **in addition to** the old one(s). In this state, ``In-Progress`` is not allowed, so the deposit cannot go back in the ``partial`` state, but only to ``deposited``. As a failsafe, to avoid accidentally updating the wrong deposit, this requires the ``X-Check-SWHID`` HTTP header to be set to the value of the SWHID of the deposit's content (returned after the deposit finished loading). .. _use-case-metadata-only-deposit: Metadata-only deposit --------------------- Finally, as an extension to the SWORD protocol, swh-deposit allows a special type of deposit: metadata-only deposits. Unlike regular deposit (described above), they do not have a code archive. Instead, they describe an existing :term:`software artifact` present in the archive. This use case is triggered by a ```` tag in the Atom document, see the :ref:`protocol reference ` for details. In the current implementation, these deposits are loaded (or rejected) immediately after a request without ``In-Progress: true`` is made, ie. they skip the ``loading`` state. This may change in a future version. .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html diff --git a/docs/api/user-manual.rst b/docs/api/user-manual.rst index e2bacf04..24be535c 100644 --- a/docs/api/user-manual.rst +++ b/docs/api/user-manual.rst @@ -1,486 +1,486 @@ .. _deposit-user-manual: User Manual =========== This is a guide for how to prepare and push a software deposit with -the `swh deposit` commands. +the ``swh deposit`` commands. Requirements ------------ You need to have an account on the Software Heritage deposit application to be able to use the service. Please `contact the Software Heritage team `_ for more information on how to get access to this service. For testing purpose, a test instance `is available `_ [#f1]_ and will be used in the examples below. Once you have an account, you should get a set of access credentials as a -`login` and a `password` (identified as ```` and ```` in the +``login`` and a ``password`` (identified as ```` and ```` in the remaining of this document). A deposit account also comes with a "provider URL" which is used by SWH to build the :term:`Origin URL` of deposits created using this account. Installation ------------ -To install the `swh.deposit` command line tools, you need a working Python 3.7+ +To install the ``swh.deposit`` command line tools, you need a working Python 3.7+ environment. It is strongly recommended you use a `virtualenv `_ for this. .. code:: console $ python3 -m virtualenv deposit [...] $ source deposit/bin/activate (deposit)$ pip install swh.deposit [...] (deposit)$ swh deposit --help Usage: swh deposit [OPTIONS] COMMAND [ARGS]... Deposit main command Options: -h, --help Show this message and exit. Commands: admin Server administration tasks (manipulate user or... status Deposit's status upload Software Heritage Public Deposit Client Create/Update... (deposit)$ Note: in the examples below, we use the `jq`_ tool to make json outputs nicer. If you do have it already, you may install it using your distribution's packaging system. For example, on a Debian system: .. _jq: https://stedolan.github.io/jq/ .. code:: console $ sudo apt install jq .. _prepare-deposit: Prepare a deposit ----------------- * compress the files in a supported archive format: - zip: common zip archive (no multi-disk zip files). - tar: tar archive without compression or optionally any of the - following compression algorithm gzip (`.tar.gz`, `.tgz`), bzip2 - (`.tar.bz2`) , or lzma (`.tar.lzma`) + following compression algorithm gzip (``.tar.gz``, ``.tgz``), bzip2 + (``.tar.bz2``) , or lzma (``.tar.lzma``) * (Optional) prepare a metadata file (more details :ref:`deposit-metadata`): Example: Assuming you want to deposit the source code of `belenios `_ version 1.12 .. code:: console (deposit)$ wget https://gitlab.inria.fr/belenios/belenios/-/archive/1.12/belenios-1.12.zip [...] 2020-10-28 11:40:37 (4,56 MB/s) - ‘belenios-1.12.zip’ saved [449880/449880] (deposit)$ Then you need to prepare a metadata file allowing you to give detailed information on your deposited source code. A rather minimal Atom with Codemeta file could be: .. code:: console (deposit)$ cat metadata.xml Verifiable online voting system belenios-01243065 https://gitlab.inria.fr/belenios/belenios test Online voting Verifiable online voting system 1.12 opam stable ocaml GNU Affero General Public License Belenios belenios@example.com Belenios Test User (deposit)$ Please read the :ref:`deposit-metadata` page for a more detailed view on the metadata file formats and semantics. Push a deposit -------------- You can push a deposit with: * a single deposit (archive + metadata): The user posts in one query a software source code archive and associated metadata. The deposit is directly marked with status ``deposited``. * a multisteps deposit: 1. Create an incomplete deposit (marked with status ``partial``) 2. Add data to a deposit (in multiple requests if needed) 3. Finalize deposit (the status becomes ``deposited``) * a metadata-only deposit: The user posts in one query an associated metadata file on a :ref:`SWHID ` object. The deposit is directly marked with status ``done``. Overall, a deposit can be a in series of steps as follow: .. figure:: ../images/status.svg :alt: The important things to notice for now is that it can be: partial: the deposit is partially received expired: deposit has been there too long and is now deemed ready to be garbage collected deposited: deposit is complete and is ready to be checked to ensure data consistency verified: deposit is fully received, checked, and ready for loading loading: loading is ongoing on swh's side done: loading is successful failed: loading is a failure -When you push a deposit, it is either in the `deposited` state or in the -`partial` state if you asked for a partial upload. +When you push a deposit, it is either in the ``deposited`` state or in the +``partial`` state if you asked for a partial upload. Single deposit ^^^^^^^^^^^^^^ Once the files are ready for deposit, we want to do the actual deposit in one shot, i.e. sending both the archive (zip) file and the metadata file. * 1 archive (content-type ``application/zip`` or ``application/x-tar``) * 1 metadata file in atom xml format (``content-type: application/atom+xml;type=entry``) For this, we need to provide the: * arguments: ``--username 'name' --password 'pass'`` as credentials * archive's path (example: ``--archive path/to/archive-name.tgz``) * metadata file path (example: ``--metadata path/to/metadata.xml``) -to the `swh deposit upload` command. +to the ``swh deposit upload`` command. Example: To push the Belenios 1.12 we prepared previously on the testing instance of the deposit: .. code:: console (deposit)$ ls belenios-1.12.zip metadata.xml deposit (deposit)$ swh deposit upload --username --password \ --url https://deposit.staging.swh.network/1 \ --slug belenios-01243065 \ --archive belenios.zip \ --metadata metadata.xml \ --format json | jq { 'deposit_status': 'deposited', 'deposit_id': '1', 'deposit_date': 'Oct. 28, 2020, 1:52 p.m.', 'deposit_status_detail': None } (deposit)$ You just posted a deposit to your main collection on Software Heritage (staging area)! The returned value is a JSON dict, in which you will notably find the deposit id (needed to check for its status later on) and the current status, which -should be `deposited` if no error has occurred. +should be ``deposited`` if no error has occurred. Note: As the deposit is in ``deposited`` status, you can no longer update the deposit after this query. It will be answered with a 403 (Forbidden) answer. If something went wrong, an equivalent response will be given with the -`error` and `detail` keys explaining the issue, e.g.: +``error`` and ``detail`` keys explaining the issue, e.g.: .. code:: console { 'error': 'Unknown collection name xyz', 'detail': None, 'deposit_status': None, 'deposit_status_detail': None, 'deposit_swh_id': None, 'status': 404 } -Once the deposit has been done, you can check its status using the `swh deposit -status` command: +Once the deposit has been done, you can check its status using the ``swh deposit +status`` command: .. code:: console (deposit)$ swh deposit status --username --password \ --url https://deposit.staging.swh.network/1 \ --deposit-id 1 -f json | jq { "deposit_id": "1", "deposit_status": "done", "deposit_status_detail": "The deposit has been successfully loaded into the Software Heritage archive", "deposit_swh_id": "swh:1:dir:63a6fc0ed8f69bf66ccbf99fc0472e30ef0a895a", "deposit_swh_id_context": "swh:1:dir:63a6fc0ed8f69bf66ccbf99fc0472e30ef0a895a;origin=https://softwareheritage.org/belenios-01234065;visit=swh:1:snp:0ae536667689da7047bfb7aa9f37f5958e9f4647;anchor=swh:1:rev:17ad98c940104d45b6b6bd6fba9aa832eeb95638;path=/", "deposit_external_id": "belenios-01234065" } Metadata-only deposit ^^^^^^^^^^^^^^^^^^^^^ This allows to deposit only metadata information on a :ref:`SWHID reference `. Prepare a metadata file as described in the :ref:`prepare deposit section ` Ensure this metadata file also declares a :ref:`SWHID reference `: .. code:: xml For this, we then need to provide the following information: * arguments: ``--username 'name' --password 'pass'`` as credentials * metadata file path (example: ``--metadata path/to/metadata.xml``) -to the `swh deposit metadata-only` command. +to the ``swh deposit metadata-only`` command. Example: .. code:: console (deposit) swh deposit metadata-only --username --password \ --url https://deposit.staging.swh.network/1 \ --metadata ../deposit-swh.metadata-only.xml \ --format json | jq . { "deposit_id": "29", "deposit_status": "done", "deposit_date": "Dec. 15, 2020, 11:37 a.m." } For details on the metadata-only deposit, see the :ref:`metadata-only deposit protocol reference ` Multisteps deposit ^^^^^^^^^^^^^^^^^^ In this case, the deposit is created by several requests, uploading objects piece by piece. The steps to create a multisteps deposit: 1. Create an partial deposit """""""""""""""""""""""""""" First use the ``--partial`` argument to declare there is more to come .. code:: console $ swh deposit upload --username name --password secret \ --archive foo.tar.gz \ --partial 2. Add content or metadata to the deposit """"""""""""""""""""""""""""""""""""""""" Continue the deposit by using the ``--deposit-id`` argument given as a response for the first step. You can continue adding content or metadata while you use the ``--partial`` argument. To only add one new archive to the deposit: .. code:: console $ swh deposit upload --username name --password secret \ --archive add-foo.tar.gz \ --deposit-id 42 \ --partial To only add metadata to the deposit: .. code:: console $ swh deposit upload --username name --password secret \ --metadata add-foo.tar.gz.metadata.xml \ --deposit-id 42 \ --partial 3. Finalize deposit """"""""""""""""""" On your last addition (same command as before), by not declaring it ``--partial``, the deposit will be considered completed. Its status will be changed to ``deposited``: .. code:: console $ swh deposit upload --username name --password secret \ --metadata add-foo.tar.gz.metadata.xml \ --deposit-id 42 Update deposit -------------- * Update deposit metadata: - only possible if the deposit status is ``done``, ``--deposit-id `` and ``--swhid `` are provided - by using the ``--metadata`` flag, a path to an xml file .. code:: console $ swh deposit upload \ --username name --password secret \ --deposit-id 11 \ --swhid swh:1:dir:2ddb1f0122c57c8479c28ba2fc973d18508e6420 \ --metadata ../deposit-swh.update-metadata.xml * Replace deposit: - only possible if the deposit status is ``partial`` and ``--deposit-id `` is provided - by using the ``--replace`` flag - ``--metadata-deposit`` replaces associated existing metadata - ``--archive-deposit`` replaces associated archive(s) - by default, with no flag or both, you'll replace associated metadata and archive(s): .. code:: console $ swh deposit upload --username name --password secret \ --deposit-id 11 \ --archive updated-je-suis-gpl.tgz \ --replace * Update a loaded deposit with a new version (this creates a new deposit): - by using the external-id with the ``--slug`` argument, you will link the new deposit with its parent deposit: .. code:: console $ swh deposit upload --username name --password secret \ --archive je-suis-gpl-v2.tgz \ --slug 'je-suis-gpl' Check the deposit's status -------------------------- You can check the status of the deposit by using the ``--deposit-id`` argument: .. code:: console $ swh deposit status --username name --password secret \ --deposit-id 11 .. code:: json { "deposit_id": 11, "deposit_status": "deposited", "deposit_swh_id": null, "deposit_status_detail": "Deposit is ready for additional checks \ (tarball ok, metadata, etc...)" } When the deposit has been loaded into the archive, the status will be marked ``done``. In the response, will also be available the , . For example: .. code:: json { "deposit_id": 11, "deposit_status": "done", "deposit_swh_id": "swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9", "deposit_swh_id_context": "swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;\ origin=https://forge.softwareheritage.org/source/jesuisgpl/;\ visit=swh:1:snp:68c0d26104d47e278dd6be07ed61fafb561d0d20;\ anchor=swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;path=/", "deposit_status_detail": "The deposit has been successfully \ loaded into the Software Heritage archive" } .. rubric:: Footnotes .. [#f1] the test instance of the deposit is not yet available to external users, but it should be available soon. diff --git a/docs/endpoints/collection.rst b/docs/endpoints/collection.rst index c7edc745..e9caf941 100644 --- a/docs/endpoints/collection.rst +++ b/docs/endpoints/collection.rst @@ -1,88 +1,88 @@ .. _API-create-deposit: Create deposit ^^^^^^^^^^^^^^^ .. http:post:: /1/(str:collection-name)/ - Create deposit in a collection which name is `collection-name`. + Create deposit in a collection which name is ``collection-name``. The client sends a deposit request to a specific collection with: * an archive holding the software source code (binary upload) * an envelop with metadata describing information regarding a deposit (atom entry deposit) Also known as: COL-IRI **Example query**: .. code:: shell curl -i -u hal: \ -F "file=@deposit.json;type=application/zip;filename=payload" \ -F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ -H 'In-Progress: false' \ -XPOST https://deposit.softwareheritage.org/1/hal/ .. code:: http POST /1/hal/ HTTP/1.1 Host: deposit.softwareheritage.org Authorization: Basic xxxxxxxxxxxx= In-Progress: false Content-Length: 123456 Content-Type: multipart/form-data; boundary=----------------------123456798 **Example response**: .. code:: http HTTP/1.1 201 Created Date: Tue, 26 Sep 2017 10:32:35 GMT Server: WSGIServer/0.2 CPython/3.5.3 Vary: Accept, Cookie Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS Location: /1/hal/10/metadata/ X-Frame-Options: SAMEORIGIN Content-Type: application/xml 10 Sept. 26, 2017, 10:32 a.m. None deposited http://purl.org/net/sword/package/SimpleZip Note: older versions of the deposit used the ``http://www.w3.org/2005/Atom`` namespace instead of ``https://www.softwareheritage.org/schema/2018/deposit``. Tags in the Atom namespace are still provided for backward compatibility, but are deprecated. :reqheader Authorization: Basic authentication token :reqheader Content-Type: accepted mimetype :reqheader Content-Length: tarball size :reqheader Content-MD5: md5 checksum hex encoded of the tarball :reqheader Content-Disposition: attachment; filename=[filename]; the filename parameter must be text (ascii); for the metadata file set name parameter to 'atom'. - :reqheader In-progress: `true` if not final; `false` when final request. + :reqheader In-progress: ``true`` if not final; ``false`` when final request. :statuscode 201: success for deposit on POST :statuscode 401: Unauthorized :statuscode 404: access to an unknown collection :statuscode 415: unsupported media type diff --git a/docs/endpoints/update-media.rst b/docs/endpoints/update-media.rst index 3b275576..67811e16 100644 --- a/docs/endpoints/update-media.rst +++ b/docs/endpoints/update-media.rst @@ -1,27 +1,27 @@ Update content ^^^^^^^^^^^^^^^ .. http:post:: /1/(str:collection-name)/(int:deposit-id)/media/ Add archive(s) to a deposit. Only possible if the deposit's status is partial. .. http:put:: /1/(str:collection-name)/(int:deposit-id)/media/ Replace all content by submitting a new archive. Only possible if the deposit's status is partial. Also known as: *update iri* (EM-IRI) :reqheader Authorization: Basic authentication token :reqheader Content-Type: accepted mimetype :reqheader Content-Length: tarball size :reqheader Content-MD5: md5 checksum hex encoded of the tarball :reqheader Content-Disposition: attachment; filename=[filename] ; the filename parameter must be text (ascii) - :reqheader In-progress: `true` if not final; `false` when final request. + :reqheader In-progress: ``true`` if not final; ``false`` when final request. :statuscode 204: success without payload on PUT :statuscode 201: success for deposit on POST :statuscode 401: Unauthorized :statuscode 415: unsupported media type diff --git a/docs/internals/prod-environment.rst b/docs/internals/prod-environment.rst index 69a4d28c..8f4010d7 100644 --- a/docs/internals/prod-environment.rst +++ b/docs/internals/prod-environment.rst @@ -1,115 +1,115 @@ .. _swh-deposit-prod-env: Production deployment ===================== The deposit is architectured around 3 parts: - server: a django application exposing an xml api, discussing with a postgresql backend (and optionally a keycloak instance) - worker(s): 1 worker service dedicated to check the deposit archive and metadata are correct (the checker), another worker service dedicated to actually ingest the deposit into the swh archive. - - client: a python script `swh deposit` command line interface. + - client: a python script ``swh deposit`` command line interface. All those are packaged in 3 separated debian packages, created and uploaded to the swh debian repository. The deposit server and workers configuration are managed by puppet (cf. puppet-environment/swh-site, puppet-environment/swh-role, puppet-environment/swh-profile) In the following document, we will focus on the server actions that may be needed once the server is installed or upgraded. Prepare the database setup (existence, connection, etc...). ----------------------------------------------------------- This is defined through the packaged module ``swh.deposit.settings.production`` and the expected **/etc/softwareheritage/deposit/server.yml** configuration file. Environment (production/staging) -------------------------------- -`SWH_CONFIG_FILENAME` must be defined and target the deposit server configuration file. +``SWH_CONFIG_FILENAME`` must be defined and target the deposit server configuration file. So either 1. prefix the following commands or 2. export the environment variable in your shell session. For the remaining part of the documentation, we assume 2. has been configured. .. code:: shell export SWH_CONFIG_FILENAME=/etc/softwareheritage/deposit/server.yml Migrate the db schema --------------------- The debian package may integrate some new schema modifications. To run them: .. code:: shell sudo django-admin migrate --settings=swh.deposit.settings.production Add client and collection ------------------------- The deposit can be configured to use either the 1. django basic authentication framework or the 2. swh keycloak instance. If the server uses 2., the password is managed by -keycloak so the option `--password`` is ignored. +keycloak so the option ``--password`` is ignored. * basic .. code:: shell swh deposit admin \ --config-file $SWH_CONFIG_FILENAME \ --platform production \ user create \ --collection \ --username \ --password This adds a user ```` which can access the collection ````. The password will be used for checking the authentication access to the deposit api (if 1. is used). Note: - If the collection does not exist, it is created alongside - The password, if required, is passed as plain text but stored encrypted Reschedule a deposit --------------------- If for some reason, the loading failed, after fixing and deploying the new deposit loader, you can reschedule the impacted deposit through: .. code:: shell swh deposit admin \ --config-file $SWH_CONFIG_FILENAME \ --platform production \ deposit reschedule \ --deposit-id This will: - check the deposit's status to something reasonable (failed or done). That means that the checks have passed but something went wrong during the loading (failed: loading failed, done: loading ok, still for some reasons as in bugs, we need to reschedule it) - reset the deposit's status to 'verified' (prior to any loading but after the checks which are fine) and removes the different archives' identifiers (swh-id, ...) - trigger back the loading task through the scheduler Integration checks ------------------ There exists icinga checks running periodically on `staging`_ and `production`_ instances. If any problem arises, expect those to notify the #swh-sysadm irc channel. .. _staging: https://icinga.softwareheritage.org/search?q=deposit#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=staging%20Check%20deposit%20end-to-end .. _production: https://icinga.softwareheritage.org/search?q=deposit#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=production%20Check%20deposit%20end-to-end diff --git a/docs/specs/spec-loading.rst b/docs/specs/spec-loading.rst index f88d1cce..d0d5ec69 100644 --- a/docs/specs/spec-loading.rst +++ b/docs/specs/spec-loading.rst @@ -1,472 +1,472 @@ .. _swh-loading-specs: Loading specification ===================== An important part of the deposit specifications is the loading procedure where a deposit is ingested into the Software Heritage Archive (SWH) using the deposit loader and the complete process of software artifacts creation in the archive. Deposit Loading --------------- The ``swh.loader.package.deposit`` module is able to inject zipfile/tarball's content in SWH with its metadata. The loading of the deposit will use the deposit's associated data: * the metadata * the archive file(s) Artifacts creation ------------------ Deposit to artifacts mapping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a global view of the deposit ingestion +------------------------------------+-----------------------------------------+ | swh artifact | representation in deposit | +====================================+=========================================+ | origin | https://hal.inria.fr/hal-id | +------------------------------------+-----------------------------------------+ | raw_extrinsic_metadata | aggregated metadata | +------------------------------------+-----------------------------------------+ | snapshot | reception of all occurrences (branches) | +------------------------------------+-----------------------------------------+ | branches | master & tags for releases | | | (not yet implemented) | +------------------------------------+-----------------------------------------+ | release | (optional) synthetic release created | | | from metadata (not yet implemented) | +------------------------------------+-----------------------------------------+ | revision | synthetic revision pointing to | | | the directory (see below) | +------------------------------------+-----------------------------------------+ | directory | root directory of the expanded submitted| | | tarball | +------------------------------------+-----------------------------------------+ Origin artifact ~~~~~~~~~~~~~~~ If the ```` is missing, we create an origin URL by concatenating the client's `provider_url` and the value of the Slug header of the initial POST request of the deposit (or a randomly generated slug if it is missing). For examples: .. code-block:: bash $ http -pb https://archive.softwareheritage.org/api/1/origin/https://hal.archives-ouvertes.fr/hal-02560320/get/ would result in: .. code-block:: json { "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://hal.archives-ouvertes.fr/hal-02560320/visits/", "url": "https://hal.archives-ouvertes.fr/hal-02560320" } Visits ~~~~~~ We identify with a visit each deposit push of the same origin. Here in the example below, two snapshots are identified by two different visits. For examples: .. code-block:: bash $ http -pb https://archive.softwareheritage.org/api/1/origin/https://hal.archives-ouvertes.fr/hal-02560320/visits/ would result in: .. code-block:: json [ { "date": "2020-05-14T11:59:55.942964+00:00", "metadata": {}, "origin": "https://hal.archives-ouvertes.fr/hal-02560320", "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://hal.archives-ouvertes.fr/hal-02560320/visit/2/", "snapshot": "e5e82d064a9c3df7464223042e0c55d72ccff7f0", "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/e5e82d064a9c3df7464223042e0c55d72ccff7f0/", "status": "full", "type": "deposit", "visit": 2 }, { "date": "2020-05-14T11:59:41.094260+00:00", "metadata": {}, "origin": "https://hal.archives-ouvertes.fr/hal-02560320", "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://hal.archives-ouvertes.fr/hal-02560320/visit/1/", "snapshot": "3e95ef6e04c381a34cc2f314576bc5644f2c797f", "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/3e95ef6e04c381a34cc2f314576bc5644f2c797f/", "status": "full", "type": "deposit", "visit": 1 } ] Snapshot artifact ~~~~~~~~~~~~~~~~~ The snapshot represents one deposit push. The ``HEAD`` branch points to a synthetic revision. For example: .. code-block:: bash $ http -pb https://archive.softwareheritage.org/api/1/snapshot/3e95ef6e04c381a34cc2f314576bc5644f2c797f/ would result in: .. code-block:: json { "branches": { "HEAD": { "target": "2122424b547a8eca9282ba3131ec61ff1d8df7d4", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/2122424b547a8eca9282ba3131ec61ff1d8df7d4/" } }, "id": "3e95ef6e04c381a34cc2f314576bc5644f2c797f", "next_branch": null } Note that previous versions of the deposit-loader named the branch ``master`` instead, and created release branches under certain conditions. Release artifact ~~~~~~~~~~~~~~~~ .. warning:: This part of the specification is not implemented yet, only revisions are currently being created. The content is deposited with a set of descriptive metadata in the CodeMeta vocabulary. The following CodeMeta terms implies that the artifact is a release: -- `releaseNotes` -- `softwareVersion` +- ``releaseNotes`` +- ``softwareVersion`` If present, a release artifact will be created with the mapping below: +-------------------+-----------------------------------+-----------------+----------------+ | SWH release field | Description | CodeMeta term | Fallback value | +===================+===================================+=================+================+ | target | revision containing all metadata | X |X | +-------------------+-----------------------------------+-----------------+----------------+ | target_type | revision | X |X | +-------------------+-----------------------------------+-----------------+----------------+ | name | release or tag name (mandatory) | softwareVersion | X | +-------------------+-----------------------------------+-----------------+----------------+ | message | message associated with release | releaseNotes | X | +-------------------+-----------------------------------+-----------------+----------------+ | date | release date = publication date | datePublished | deposit_date | +-------------------+-----------------------------------+-----------------+----------------+ | author | deposit client | author | X | +-------------------+-----------------------------------+-----------------+----------------+ .. code-block:: json { "release": { "author": { "email": "hal@ccsd.cnrs.fr", "fullname": "HAL ", "name": "HAL" }, "author_url": "/api/1/person/x/", "date": "2019-05-27T16:28:33+02:00", "id": "a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97", "message": "AffectationRO Version 1.1 - added new feature\n", "name": "1.1", "synthetic": true, "target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", "target_type": "revision", "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" } } Revision artifact ~~~~~~~~~~~~~~~~~ The metadata sent with the deposit is stored outside the revision, and does not affect the hash computation. It contains the same fields as any revision object; in particular: +-------------------+-----------------------------------------+ | SWH revision field| Description | +===================+=========================================+ | message | synthetic message, containing the name | | | of the deposit client and an internal | | | identifier of the deposit. For example: | | | ``hal: Deposit 817 in collection hal`` | +-------------------+-----------------------------------------+ | author | synthetic author (SWH itself, for now) | +-------------------+-----------------------------------------+ | committer | same as the author (for now) | +-------------------+-----------------------------------------+ | date | see below | +-------------------+-----------------------------------------+ | committer_date | see below | +-------------------+-----------------------------------------+ The date mapping ^^^^^^^^^^^^^^^^ A deposit may contain 4 different dates concerning the software artifacts. The deposit's revision will reflect the most accurate point in time available. Here are all dates that can be available in a deposit: +----------------+---------------------------------+------------------------------------------------+ | dates | location | Description | +================+=================================+================================================+ | reception_date | On SWORD reception (automatic) | the deposit was received at this ts | +----------------+---------------------------------+------------------------------------------------+ | complete_date | On SWH ingestion (automatic) | the ingestion was completed by SWH at this ts | +----------------+---------------------------------+------------------------------------------------+ | dateCreated | metadata in codeMeta (optional) | the software artifact was created at this ts | +----------------+---------------------------------+------------------------------------------------+ | datePublished | metadata in codeMeta (optional) | the software was published (contributed in HAL)| +----------------+---------------------------------+------------------------------------------------+ A visit targeting a snapshot contains one date: +-------------------+----------------------------------------------+----------------+ | SWH visit field | Description | value | +===================+==============================================+================+ | date | the origin pushed the deposit at this date | reception_date | +-------------------+----------------------------------------------+----------------+ A revision contains two dates: +-------------------+-----------------------------------------+----------------+----------------+ | SWH revision field| Description | CodeMeta term | Fallback value | +===================+=========================================+================+================+ | date | date of software artifact modification | dateCreated | reception_date | +-------------------+-----------------------------------------+----------------+----------------+ | committer_date | date of the commit in VCS | datePublished | reception_date | +-------------------+-----------------------------------------+----------------+----------------+ A release contains one date: +-------------------+----------------------------------+----------------+-----------------+ | SWH release field |Description | CodeMeta term | Fallback value | +===================+==================================+================+=================+ | date |release date = publication date | datePublished | reception_date | +-------------------+----------------------------------+----------------+-----------------+ .. code-block:: json { "revision": { "author": { "email": "robot@softwareheritage.org", "fullname": "Software Heritage", "id": 18233048, "name": "Software Heritage" }, "author_url": "/api/1/person/18233048/", "committer": { "email": "robot@softwareheritage.org", "fullname": "Software Heritage", "id": 18233048, "name": "Software Heritage" }, "committer_date": "2019-05-27T16:28:33+02:00", "committer_url": "/api/1/person/18233048/", "date": "2012-01-01T00:00:00+00:00", "directory": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", "directory_url": "/api/1/directory/fb13b51abbcfd13de85d9ba8d070a23679576cd7/", "history_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/log/", "id": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", "merge": false, "message": "hal: Deposit 282 in collection hal", "metadata": { "@xmlns": "http://www.w3.org/2005/Atom", "@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0", "author": { "email": "hal@ccsd.cnrs.fr", "name": "HAL" }, "codemeta:applicationCategory": "info", "codemeta:author": { "codemeta:name": "Morane Gruenpeter" }, "codemeta:codeRepository": "www.code-repository.com", "codemeta:contributor": "Morane Gruenpeter", "codemeta:dateCreated": "2012", "codemeta:datePublished": "2019-05-27T16:28:33+02:00", "codemeta:description": "description\\_en test v2", "codemeta:developmentStatus": "Inactif", "codemeta:keywords": "mot_cle_en,mot_cle_2_en,mot_cle_fr", "codemeta:license": [ { "codemeta:name": "MIT License" }, { "codemeta:name": "CeCILL Free Software License Agreement v1.1" } ], "codemeta:name": "Test\\_20190527\\_01", "codemeta:operatingSystem": "OS", "codemeta:programmingLanguage": "Java", "codemeta:referencePublication": null, "codemeta:relatedLink": null, "codemeta:releaseNotes": "releaseNote", "codemeta:runtimePlatform": "outil", "codemeta:softwareVersion": "1.0.1", "codemeta:url": "https://hal.archives-ouvertes.fr/hal-02140606", "codemeta:version": "2", "id": "hal-02140606", "original_artifact": [ { "archive_type": "zip", "blake2s256": "96be3ddedfcee9669ad9c42b0bb3a706daf23824d04311c63505a4d8db02df00", "length": 193072, "name": "archive.zip", "sha1": "5b6ecc9d5bb113ff69fc275dcc9b0d993a8194f1", "sha1_git": "bd10e4d3ede17162692d7e211e08e87e67994488", "sha256": "3e2ce93384251ce6d6da7b8f2a061a8ebdaf8a28b8d8513223ca79ded8a10948" } ] }, "parents": [ { "id": "a9fdc3937d2b704b915852a64de2ab1b4b481003", "url": "/api/1/revision/a9fdc3937d2b704b915852a64de2ab1b4b481003/" } ], "synthetic": true, "type": "tar", "url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" } } Directory artifact ~~~~~~~~~~~~~~~~~~ The directory artifact is the archive(s)' raw content deposited. .. code-block:: json { "directory": [ { "dir_id": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", "length": null, "name": "AffectationRO", "perms": 16384, "target": "fbc418f9ac2c39e8566b04da5dc24b14e65b23b1", "target_url": "/api/1/directory/fbc418f9ac2c39e8566b04da5dc24b14e65b23b1/", "type": "dir" } ] } Questions raised concerning loading ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - A deposit has one origin, yet an origin can have multiple deposits? No, an origin can have multiple requests for the same deposit. Which should end up in one single deposit (when the client pushes its final request saying deposit 'done' through the header In-Progress). Only update of existing 'partial' deposit is permitted. Other than that, the deposit 'update' operation. To create a new version of a software (already deposited), the client must prior to this create a new deposit. Illustration First deposit loading: HAL's deposit 01535619 = SWH's deposit **01535619-1** :: + 1 origin with url:https://hal.inria.fr/medihal-01535619 + 1 synthetic revision + 1 directory HAL's update on deposit 01535619 = SWH's deposit **01535619-2** (\*with HAL updates can only be on the metadata and a new version is required if the content changes) :: + 1 origin with url:https://hal.inria.fr/medihal-01535619 + new synthetic revision (with new metadata) + same directory HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** :: + same origin + new revision + new directory Scheduling loading ~~~~~~~~~~~~~~~~~~ All ``archive`` and ``metadata`` deposit requests should be aggregated before loading. The loading should be scheduled via the scheduler's api. Only ``deposited`` deposit are concerned by the loading. When the loading is done and successful, the deposit entry is updated: - ``status`` is updated to ``done`` - ``swh-id`` is populated with the resulting :ref:`SWHID ` - ``complete_date`` is updated to the loading's finished time When the loading has failed, the deposit entry is updated: - ``status`` is updated to ``failed`` - ``swh-id`` and ``complete_data`` remains as is *Note:* As a further improvement, we may prefer having a retry policy with graceful delays for further scheduling. Metadata loading ~~~~~~~~~~~~~~~~ - the metadata received with the deposit are kept in a dedicated table ``raw_extrinsic_metadata``, distinct from the ``revision`` and ``origin`` tables. - ``authority`` is computed from the deposit client information, and ``fetcher`` is the deposit loader. diff --git a/docs/specs/spec-meta-deposit.rst b/docs/specs/spec-meta-deposit.rst index b72a4301..82fd767b 100644 --- a/docs/specs/spec-meta-deposit.rst +++ b/docs/specs/spec-meta-deposit.rst @@ -1,132 +1,132 @@ .. _spec-metadata-deposit: The metadata-only deposit ========================= Goal ---- A client may wish to deposit only metadata about an origin or object already present in the Software Heritage archive. The metadata-only deposit is a special deposit where no content is provided and the data transferred to Software Heritage is only the metadata about an object in the archive. Requirements ------------ 1. Create a metadata-only deposit through a :ref:`POST request` 2. It is composed of ONLY one Atom XML document 3. It MUST comply with :ref:`the metadata requirements` 4. It MUST reference an **object** or an **origin** in a deposit tag 5. The reference SHOULD exist in the SWH archive 6. The **object** reference MUST be a SWHID on one of the following artifact types: - origin - snapshot - release - revision - directory - content 7. The SWHID MAY be a :ref:`core identifier ` with or without :ref:`qualifiers ` -8. The SWHID MUST NOT reference a fragment of code with the classifier `lines` +8. The SWHID MUST NOT reference a fragment of code with the classifier ``lines`` A complete metadata example --------------------------- The reference element is included in the metadata xml atomEntry under the swh namespace: .. code:: xml HAL hal@ccsd.cnrs.fr The assignment problem https://hal.archives-ouvertes.fr/hal-01243573 other identifier, DOI, ARK Domain description Author1 Inria UPMC Author2 Inria UPMC References ---------- The metadata reference can be either on: - an origin - a graph object (core SWHID with or without qualifiers) Origins ^^^^^^^ The metadata may be on an origin, identified by the origin's URL: .. code:: xml Graph objects ^^^^^^^^^^^^^ It may also reference an object in the `SWH graph `: contents, directories, revisions, releases, and snapshots: .. code:: xml .. code:: xml The value of the ``swhid`` attribute must be a `SWHID `, with any context qualifiers in this list: * ``origin`` * ``visit`` * ``anchor`` * ``path`` and they should be provided whenever relevant, especially ``origin``. Other qualifiers are not allowed (for example, ``line`` isn't because SWH cannot store metadata at a finer level than entire contents). Loading procedure ----------------- In this case, the metadata-deposit will be injected as a metadata entry of the relevant object, with the information about the contributor of the deposit.