diff --git a/docs/dev-info.md b/docs/dev-info.md deleted file mode 100644 index bce40fe7..00000000 --- a/docs/dev-info.md +++ /dev/null @@ -1,162 +0,0 @@ -# Develop on swh-deposit - -There are multiple modes to run and test the server locally: -- development-like (automatic reloading when code changes) -- production-like (no reloading) -- integration tests (no side effects) - -Except for the tests which are mostly side effects free (except for -the database access), the other modes will need some configuration -files (up to 2) to run properly. - -## Database - -swh-deposit uses a database to store the state of a deposit. -The default db is expected to be called swh-deposit-dev. - -To simplify the use, the following makefile targets can be used: - -### schema - -``` Shell -make db-create db-prepare db-migrate -``` - -### data - -Once the db is created, you need some data to be injected (request -types, client, collection, etc...): - -``` Shell -make db-load-data db-load-private-data -``` - -The private data are about having a user (`hal`) with a password -(`hal`) who can access a collection (`hal`). - -Add the following to `../private-data.yaml`: - -``` YAML -- model: deposit.depositclient - fields: - user_ptr_id: 1 - collections: - - 1 -- model: auth.User - pk: 1 - fields: - first_name: hal - last_name: hal - username: hal - password: "pbkdf2_sha256$30000$8lxjoGc9PiBm$DO22vPUJCTM17zYogBgBg5zr/97lH4pw10Mqwh85yUM=" -- model: deposit.depositclient - fields: - user_ptr_id: 1 - collections: - - 1 - url: https://hal.inria.fr - -``` - -### drop - -For information, you can drop the db: - -``` Shell -make db-drop -``` - -## Development-like environment - -Development-like environment needs one configuration file to work -properly. - -### Configuration - -**`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/deposit/server.yml**: - -``` YAML -# dev option for running the server locally -host: 127.0.0.1 -port: 5006 - -# production -authentication: - activated: true - white-list: - GET: - - / - -# 20 Mib max size -max_upload_size: 20971520 - -``` - -### Run - -Run the local server, using the default configuration file: - -``` Shell -make run-dev -``` - -## Production-like environment - -Production-like environment needs two configuration files to work -properly. - -This is more close to what's actually running in production. - -### Configuration - -This expects the same file describes in the previous chapter. Plus, -an additional private **settings.yml** file containing secret -information that is not in the source code repository. - -**`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/deposit/private.yml**: - -``` YAML -secret_key: production-local -db: - name: swh-deposit-dev -``` - -A production configuration file would look like: - -``` YAML -secret_key: production-secret-key -db: - name: swh-deposit-dev - host: db - port: 5467 - user: user - password: user-password -``` - -### Run - -``` Shell -make run -``` - -Note: This expects gunicorn3 package installed on the system - -## Tests - -To run the tests: -``` Shell -make test -``` - -As explained, those tests are mostly side-effect free. The db part is -dealt with by django. The remaining part which patches those -side-effect behavior is dealt with in the -`swh/deposit/tests/__init__.py` module. - -## Sum up - -Prepare everything for your user to run: - -``` Shell -make db-drop db-create db-prepare db-migrate db-load-private-data run-dev -``` diff --git a/docs/dev-info.rst b/docs/dev-info.rst new file mode 100644 index 00000000..459ecf49 --- /dev/null +++ b/docs/dev-info.rst @@ -0,0 +1,174 @@ +Develop on swh-deposit +====================== + +There are multiple modes to run and test the server locally: + +* development-like (automatic reloading when code changes) +* production-like (no reloading) +* integration tests (no side effects) + +Except for the tests which are mostly side effects free (except for the +database access), the other modes will need some configuration files (up to 2) +to run properly. + +Database +-------- + +swh-deposit uses a database to store the state of a deposit. The default +db is expected to be called swh-deposit-dev. + +To simplify the use, the following makefile targets can be used: + +schema +~~~~~~ + +.. code:: shell + + make db-create db-prepare db-migrate + +data +~~~~ + +Once the db is created, you need some data to be injected (request +types, client, collection, etc...): + +.. code:: shell + + make db-load-data db-load-private-data + +The private data are about having a user (``hal``) with a password +(``hal``) who can access a collection (``hal``). + +Add the following to ``../private-data.yaml``: + +.. code:: yaml + + - model: deposit.depositclient + fields: + user_ptr_id: 1 + collections: + - 1 + - model: auth.User + pk: 1 + fields: + first_name: hal + last_name: hal + username: hal + password: "pbkdf2_sha256$30000$8lxjoGc9PiBm$DO22vPUJCTM17zYogBgBg5zr/97lH4pw10Mqwh85yUM=" + - model: deposit.depositclient + fields: + user_ptr_id: 1 + collections: + - 1 + url: https://hal.inria.fr + +drop +~~~~ + +For information, you can drop the db: + +.. code:: shell + + make db-drop + +Development-like environment +---------------------------- + +Development-like environment needs one configuration file to work +properly. + +Configuration +~~~~~~~~~~~~~ + +**``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/server.yml**: + +.. code:: yaml + + # dev option for running the server locally + host: 127.0.0.1 + port: 5006 + + # production + authentication: + activated: true + white-list: + GET: + - / + + # 20 Mib max size + max_upload_size: 20971520 + +Run +~~~ + +Run the local server, using the default configuration file: + +.. code:: shell + + make run-dev + +Production-like environment +--------------------------- + +Production-like environment needs two configuration files to work +properly. + +This is more close to what's actually running in production. + +Configuration +~~~~~~~~~~~~~ + +This expects the same file describes in the previous chapter. Plus, an +additional private **settings.yml** file containing secret information +that is not in the source code repository. + +**``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/private.yml**: + +.. code:: yaml + + secret_key: production-local + db: + name: swh-deposit-dev + +A production configuration file would look like: + +.. code:: yaml + + secret_key: production-secret-key + db: + name: swh-deposit-dev + host: db + port: 5467 + user: user + password: user-password + +Run +~~~ + +.. code:: shell + + make run + +Note: This expects gunicorn3 package installed on the system + +Tests +----- + +To run the tests: + +.. code:: shell + + make test + +As explained, those tests are mostly side-effect free. The db part is +dealt with by django. The remaining part which patches those side-effect +behavior is dealt with in the ``swh/deposit/tests/__init__.py`` module. + +Sum up +------ + +Prepare everything for your user to run: + +.. code:: shell + + make db-drop db-create db-prepare db-migrate db-load-private-data run-dev diff --git a/docs/getting-started.md b/docs/getting-started.md deleted file mode 100644 index 83a1435a..00000000 --- a/docs/getting-started.md +++ /dev/null @@ -1,333 +0,0 @@ -# Getting Started - -This is a getting started to demonstrate the deposit api use case with -a shell client. - -The api is rooted at https://deposit.softwareheritage.org. - -For more details, see the [main documentation](./index.html). - -## Requirements - -You need to be referenced on SWH's client list to have: -- a credential (needed for the basic authentication step). -- an associated collection - -[Contact us for more information.](https://www.softwareheritage.org/contact/) - -## Demonstration - -For the rest of the document, we will: -- reference `` as the client and `` as its -associated authentication password. -- use curl as example on how to request the api. -- present the main deposit use cases. - -The use cases are: - -- one single deposit step: The user posts in one query (one deposit) a - software source code archive and associated metadata (deposit is - finalized with status `deposited`). - - This will demonstrate the multipart query. - -- another 3-steps deposit (which can be extended as more than 2 - steps): - 1. Create an incomplete deposit (status `partial`) - 2. Update a deposit (and finalize it, so the status becomes - `deposited`) - 3. Check the deposit's state - - This will demonstrate the stateful nature of the sword protocol. - -Those use cases share a common part, they must start by requesting the -`service document iri` (internationalized resource identifier) for -information about the collection's location. - -### Common part - Start with the service document - -First, to determine the *collection iri* onto which deposit data, the -client needs to ask the server where is its *collection* located. That -is the role of the *service document iri*. - -For example: - -``` Shell -curl -i --user : https://deposit.softwareheritage.org/1/servicedocument/ -``` - -If everything went well, you should have received a response similar -to this: - -``` Shell -HTTP/1.0 200 OK -Server: WSGIServer/0.2 CPython/3.5.3 -Content-Type: application/xml - - - - - 2.0 - 209715200 - - - The Software Heritage (SWH) Archive - - Software Collection - application/zip - application/x-tar - Collection Policy - Software Heritage Archive - Collect, Preserve, Share - false - http://purl.org/net/sword/package/SimpleZip - https://deposit.softwareheritage.org/1// - - - -``` - -Explaining the response: -- `HTTP/1.0 200 OK`: the query is successful and returns a body response -- `Content-Type: application/xml`: The body response is in xml format -- `body response`: it is a service document describing that the client - `` has a collection named ``. That - collection is available at the *collection iri* - `/1//` (through POST query). - -At this level, if something went wrong, this should be authentication related. -So the response would have been a 401 Unauthorized access. -Something like: - -``` Shell -curl -i https://deposit.softwareheritage.org/1// -HTTP/1.0 401 Unauthorized -Server: WSGIServer/0.2 CPython/3.5.3 -Content-Type: application/xml -WWW-Authenticate: Basic realm="" -X-Frame-Options: SAMEORIGIN - - - - Access to this api needs authentication - processing failed - - -``` - -### Single deposit - -A single deposit translates to a multipart deposit request. - -This means, in swh's deposit's terms, sending exactly one POST query -with: -- 1 archive (content-type `application/zip` or `application/x-tar`) -- 1 atom xml content (`content-type: application/atom+xml;type=entry`) - -The supported archive, for now are limited to zip files. Those -archives are expected to contain some form of software source -code. The atom entry content is some xml defining metadata about that -software. - -Example of minimal atom entry file: - -``` XML - - - Title - urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a - 2005-10-07T17:17:08Z - Contributor - The abstract - - - The abstract - Access Rights - Alternative Title - Date Available - Bibliographic Citation - Contributor - Description - Has Part - Has Version - Identifier - Is Part Of - Publisher - References - Rights Holder - Source - Title - Type - -``` - -Once the files are ready for deposit, we want to do the actual deposit -in one shot. - -For this, we need to provide: -- the contents and their associated correct content-types -- either the header `In-Progress` to false (meaning, it's finished -after this query) or nothing (the server will assume it's not in -progress if not present). -- Optionally, the `Slug` header, which is a reference to a unique -identifier the client knows about and wants to provide us. - -You can do this with the following command: - -``` Shell -curl -i --user : \ - -F "file=@deposit.zip;type=application/zip;filename=payload" \ - -F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ - -H 'In-Progress: false' \ - -H 'Slug: some-external-id' \ - -XPOST https://deposit.softwareheritage.org/1// -``` - -You just posted a deposit to the collection -https://deposit.softwareheritage.org/1//. - -If everything went well, you should have received a response similar -to this: - -``` Shell -HTTP/1.0 201 Created -Server: WSGIServer/0.2 CPython/3.5.3 -Location: /1//10/metadata/ -Content-Type: application/xml - - - 9 - Sept. 26, 2017, 10:11 a.m. - payload - deposited - - - - - - - - - - - http://purl.org/net/sword/package/SimpleZip - -``` - -Explaining this response: -- `HTTP/1.0 201 Created`: the deposit is successful -- `Location: /1//10/metadata/`: the EDIT-SE-IRI through which we can - update a deposit -- body response: it is a deposit receipt detailing all endpoints - available to manipulate the deposit (update, replace, delete, - etc...) It also explains the deposit identifier to be 9 (which is - useful for the remaining example). - -Note: As the deposit is in `deposited` status, you cannot actually -update anything after this query. Well, the client can try, but it -will be answered with a 403 forbidden answer. - -### Multi-steps deposit - -#### Create a deposit - -We will use the collection IRI again as the starting point. - -We need to explicitely give to the server information about: -- the deposit's completeness (through header `In-Progress` to true, as - we want to do in multiple steps now). -- archive's md5 hash (through header `Content-MD5`) -- upload's type (through the headers `Content-Disposition` and - `Content-Type`) - -The following command: - -``` Shell -curl -i --user : \ - --data-binary @swh/deposit.tar.gz \ - -H 'In-Progress: true' \ - -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ - -H 'Content-Disposition: attachment; filename=[deposit.tar.gz]' \ - -H 'Slug: some-external-id' \ - -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ - -H 'Content-type: application/zip' \ - -XPOST https://deposit.softwareheritage.org/1// -``` - -The expected answer is the same as the previous sample. - -#### Update deposit's metadata - -To update a deposit, we can either add some more archives, some more -metadata or replace existing ones. - -As we don't have defined metadata yet (except for the `slug` header), -we can add some to the `EDIT-SE-IRI` endpoint (/1//10/metadata/). -That information is extracted from the deposit receipt sample. - -Using here the same atom-entry.xml file presented in previous chapter. - -For example, here is the command to update deposit metadata: - -``` Shell -curl -i --user : --data-binary @atom-entry.xml \ --H 'In-Progress: true' \ --H 'Slug: some-external-id' \ --H 'Content-Type: application/atom+xml;type=entry' \ --XPOST https://deposit.softwareheritage.org/1//10/metadata/ -HTTP/1.0 201 Created -Server: WSGIServer/0.2 CPython/3.5.3 -Location: /1//10/metadata/ -Content-Type: application/xml - - - 10 - Sept. 26, 2017, 10:32 a.m. - None - partial - - - - - - - - - - - http://purl.org/net/sword/package/SimpleZip - -``` - -#### Check the deposit's state - -You need to check the STATE-IRI endpoint (/1//10/status/). - -``` Shell -curl -i --user : https://deposit.softwareheritage.org/1//10/status/ -HTTP/1.0 200 OK -Date: Wed, 27 Sep 2017 08:25:53 GMT -Content-Type: application/xml -``` - -Response: - -``` XML - - 9 - deposited - deposit is fully received and ready for loading - - -``` diff --git a/docs/getting-started.rst b/docs/getting-started.rst new file mode 100644 index 00000000..8a1e6658 --- /dev/null +++ b/docs/getting-started.rst @@ -0,0 +1,342 @@ +Getting Started +=============== + +This is a getting started to demonstrate the deposit api use case with a +shell client. + +The api is rooted at https://deposit.softwareheritage.org. + +For more details, see the `main documentation <./index.html>`__. + +Requirements +------------ + +You need to be referenced on SWH's client list to have: + +* a credential (needed for the basic authentication step) +* an associated collection + +`Contact us for more +information. `__ + +Demonstration +------------- + +For the rest of the document, we will: + +* reference ```` as the client and ```` as its associated + authentication password. +* use curl as example on how to request the api. +* present the main deposit use cases. + +The use cases are: + +* one single deposit step: The user posts in one query (one deposit) a software + source code archive and associated metadata (deposit is finalized with status + ``deposited``). + +This will demonstrate the multipart query. + +* another 3-steps deposit (which can be extended as more than 2 steps): + + 1. Create an incomplete deposit (status ``partial``) + 2. Update a deposit (and finalize it, so the status becomes ``deposited``) + 3. Check the deposit's state + +This will demonstrate the stateful nature of the sword protocol. + +Those use cases share a common part, they must start by requesting the +``service document iri`` (internationalized resource identifier) for +information about the collection's location. + +Common part - Start with the service document +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First, to determine the *collection iri* onto which deposit data, the +client needs to ask the server where is its *collection* located. That +is the role of the *service document iri*. + +For example: + +.. code:: shell + + curl -i --user : https://deposit.softwareheritage.org/1/servicedocument/ + +If everything went well, you should have received a response similar to +this: + +.. code:: shell + + HTTP/1.0 200 OK + Server: WSGIServer/0.2 CPython/3.5.3 + Content-Type: application/xml + + + + + 2.0 + 209715200 + + + The Software Heritage (SWH) Archive + + Software Collection + application/zip + application/x-tar + Collection Policy + Software Heritage Archive + Collect, Preserve, Share + false + http://purl.org/net/sword/package/SimpleZip + https://deposit.softwareheritage.org/1// + + + + +* ``HTTP/1.0 200 OK``: the query is successful and returns a body response +* ``Content-Type: application/xml``: The body response is in xml format +* body: it is a service document describing that the client ```` + has a collection named ````. That collection is available at + the *collection iri* ``/1//`` (through POST query). + +At this level, if something went wrong, this should be authentication +related. So the response would have been a 401 Unauthorized access. +Something like: + +.. code:: shell + + curl -i https://deposit.softwareheritage.org/1// + HTTP/1.0 401 Unauthorized + Server: WSGIServer/0.2 CPython/3.5.3 + Content-Type: application/xml + WWW-Authenticate: Basic realm="" + X-Frame-Options: SAMEORIGIN + + + + Access to this api needs authentication + processing failed + + + +Single deposit +~~~~~~~~~~~~~~ + +A single deposit translates to a multipart deposit request. + +This means, in swh's deposit's terms, sending exactly one POST query +with: + +* 1 archive (content-type ``application/zip`` or ``application/x-tar``) +* 1 atom xml content (``content-type: application/atom+xml;type=entry``) + +The supported archive, for now are limited to zip files. Those archives +are expected to contain some form of software source code. The atom +entry content is some xml defining metadata about that software. + +Example of minimal atom entry file: + +.. code:: xml + + + + Title + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 2005-10-07T17:17:08Z + Contributor + The abstract + + + The abstract + Access Rights + Alternative Title + Date Available + Bibliographic Citation + Contributor + Description + Has Part + Has Version + Identifier + Is Part Of + Publisher + References + Rights Holder + Source + Title + Type + + +Once the files are ready for deposit, we want to do the actual deposit +in one shot. + +For this, we need to provide: + +* the contents and their associated correct content-types +* either the header ``In-Progress`` to false (meaning, it's finished after this + query) or nothing (the server will assume it's not in progress if not + present). +* Optionally, the ``Slug`` header, which is a reference to a unique identifier + the client knows about and wants to provide us. + +You can do this with the following command: + +.. code:: shell + + curl -i --user : \ + -F "file=@deposit.zip;type=application/zip;filename=payload" \ + -F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ + -H 'In-Progress: false' \ + -H 'Slug: some-external-id' \ + -XPOST https://deposit.softwareheritage.org/1// + +You just posted a deposit to the collection +https://deposit.softwareheritage.org/1//. + +If everything went well, you should have received a response similar to +this: + +.. code:: shell + + HTTP/1.0 201 Created + Server: WSGIServer/0.2 CPython/3.5.3 + Location: /1//10/metadata/ + Content-Type: application/xml + + + 9 + Sept. 26, 2017, 10:11 a.m. + payload + deposited + + + + + + + + + + + http://purl.org/net/sword/package/SimpleZip + + +* ``HTTP/1.0 201 Created``: the deposit is successful +* ``Location: /1//10/metadata/``: the EDIT-SE-IRI through + which we can update a deposit +* body: it is a deposit receipt detailing all endpoints available to manipulate + the deposit (update, replace, delete, etc...) It also explains the deposit + identifier to be 9 (which is useful for the remaining example). + +Note: As the deposit is in ``deposited`` status, you cannot actually +update anything after this query. Well, the client can try, but it will +be answered with a 403 forbidden answer. + +Multi-steps deposit +~~~~~~~~~~~~~~~~~~~ + +Create a deposit +^^^^^^^^^^^^^^^^ + +We will use the collection IRI again as the starting point. + +We need to explicitely give to the server information about: + +* the deposit's completeness (through header ``In-Progress`` to true, as we + want to do in multiple steps now). +* archive's md5 hash (through header ``Content-MD5``) +* upload's type (through the headers ``Content-Disposition`` and + ``Content-Type``) + +The following command: + +.. code:: shell + + curl -i --user : \ + --data-binary @swh/deposit.tar.gz \ + -H 'In-Progress: true' \ + -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ + -H 'Content-Disposition: attachment; filename=[deposit.tar.gz]' \ + -H 'Slug: some-external-id' \ + -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ + -H 'Content-type: application/zip' \ + -XPOST https://deposit.softwareheritage.org/1// + +The expected answer is the same as the previous sample. + +Update deposit's metadata +^^^^^^^^^^^^^^^^^^^^^^^^^ + +To update a deposit, we can either add some more archives, some more +metadata or replace existing ones. + +As we don't have defined metadata yet (except for the ``slug`` header), +we can add some to the ``EDIT-SE-IRI`` endpoint (/1//10/metadata/). That +information is extracted from the deposit receipt sample. + +Using here the same atom-entry.xml file presented in previous chapter. + +For example, here is the command to update deposit metadata: + +.. code:: shell + + curl -i --user : --data-binary @atom-entry.xml \ + -H 'In-Progress: true' \ + -H 'Slug: some-external-id' \ + -H 'Content-Type: application/atom+xml;type=entry' \ + -XPOST https://deposit.softwareheritage.org/1//10/metadata/ + HTTP/1.0 201 Created + Server: WSGIServer/0.2 CPython/3.5.3 + Location: /1//10/metadata/ + Content-Type: application/xml + + + 10 + Sept. 26, 2017, 10:32 a.m. + None + partial + + + + + + + + + + + http://purl.org/net/sword/package/SimpleZip + + +Check the deposit's state +^^^^^^^^^^^^^^^^^^^^^^^^^ + +You need to check the STATE-IRI endpoint (/1//10/status/). + +.. code:: shell + + curl -i --user : https://deposit.softwareheritage.org/1//10/status/ + HTTP/1.0 200 OK + Date: Wed, 27 Sep 2017 08:25:53 GMT + Content-Type: application/xml + +Response: + +.. code:: xml + + + 9 + deposited + deposit is fully received and ready for loading + + diff --git a/docs/index.rst b/docs/index.rst index 9ec3e948..bfb35d51 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,22 +1,22 @@ .. _swh-deposit: Software Heritage Deposit ========================= .. toctree:: :maxdepth: 3 :caption: Contents: - getting-started.md - spec-api.md - metadata.md - spec-loading.md - dev-info.md - sys-info.md + getting-started.rst + spec-api.rst + metadata.rst + spec-loading.rst + dev-info.rst + sys-info.rst Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/docs/metadata.md b/docs/metadata.md deleted file mode 100644 index c3cb6073..00000000 --- a/docs/metadata.md +++ /dev/null @@ -1,166 +0,0 @@ -# Deposit metadata - -When making a software deposit into the SWH archive, one can add information -describing the software artifact and the software project. -and the metadata will be translated to the [CodeMeta v.2](https://doi.org/10.5063/SCHEMA/CODEMETA-2.0) vocabulary -if possible. - -## Metadata requirements - -MUST -- **the schema/vocabulary** used *MUST* be specified with a persistent url -(DublinCore, DOAP, CodeMeta, etc.) -```XML - -or - -or - -``` -- **the url** representing the location of the source *MUST* be provided -under the url tag. The url will be used for creating an origin object in the -archive. -```XML -www.url-example.com -or -www.url-example.com -or -www.url-example.com -``` -- **the external_identifier** *MUST* be provided as an identifier -- **the name** of the software deposit *MUST* be provided -[atom:title, codemeta:name, dcterms:title] -- **the authors** of the software deposit *MUST* be provided - - -SHOULD -- **the external_identifier** *SHOULD* match the Slug external-identifier in -the header -- **the description** of the software deposit *SHOULD* be provided -[codemeta:description] - short or long description of the software -- **the license/s** of the software deposit *SHOULD* be provided -[codemeta:license] - - -MAY -- other metadata *MAY* be added with terms defined by the schema in use. - -## Examples -### Using only Atom -```XML - - - Awesome Compiler - urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a - 1785io25c695 - 2017-10-07T15:17:08Z - some awesome author - -``` -### Using Atom with CodeMeta -```XML - - - Awesome Compiler - urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a - 1785io25c695 - 1785io25c695 - origin url - other identifier, DOI, ARK - Domain - - description - key-word 1 - key-word 2 - creation date - publication date - comment - - article name - article id - - - Collaboration/Projet - project name - id - - see also - Sponsor A - Sponsor B - Platform/OS - dependencies - Version - active - - license - url spdx - - .Net Framework 3.0 - Python2.3 - - author1 - Inria - UPMC - - - author2 - Inria - UPMC - - http://code.com - language 1 - language 2 - http://issuetracker.com - -``` -### Using Atom with DublinCore and CodeMeta (multi-schema entry) -``` XML - - - Awesome Compiler - hal - urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a - %s - hal-01587361 - doi:10.5281/zenodo.438684 - The assignment problem - AffectationRO - author - [INFO] Computer Science [cs] - [INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO] - SOFTWARE - Project in OR: The assignment problemA java implementation for the assignment problem first release - description fr - 2015-06-01 - 2017-10-19 - en - - - origin url - - 1.0.0 - key word - Comment - Rfrence interne - - link - Sponsor - - Platform/OS - dependencies - Ended - - license - url spdx - - - http://code.com - language 1 - language 2 - -``` diff --git a/docs/metadata.rst b/docs/metadata.rst new file mode 100644 index 00000000..6d777a12 --- /dev/null +++ b/docs/metadata.rst @@ -0,0 +1,182 @@ +Deposit metadata +================ + +When making a software deposit into the SWH archive, one can add +information describing the software artifact and the software project. +and the metadata will be translated to the `CodeMeta +v.2 `__ vocabulary if +possible. + +Metadata requirements +--------------------- + +- **the schema/vocabulary** used *MUST* be specified with a persistent url + (DublinCore, DOAP, CodeMeta, etc.) + + .. code:: xml + + + or + + or + + +- **the url** representing the location of the source *MUST* be provided under + the url tag. The url will be used for creating an origin object in the + archive. + + .. code:: xml + + www.url-example.com + or + www.url-example.com + or + www.url-example.com + +- **the external\_identifier** *MUST* be provided as an identifier + +- **the name** of the software deposit *MUST* be provided [atom:title, + codemeta:name, dcterms:title] + +- **the authors** of the software deposit *MUST* be provided + +- **the external\_identifier** *SHOULD* match the Slug external-identifier in + the header + +- **the description** of the software deposit *SHOULD* be provided + [codemeta:description] + +- short or long description of the software - **the license/s** of the software + deposit *SHOULD* be provided [codemeta:license] + +- other metadata *MAY* be added with terms defined by the schema in use. + +Examples +-------- + +Using only Atom +~~~~~~~~~~~~~~~ + +.. code:: xml + + + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 2017-10-07T15:17:08Z + some awesome author + + +Using Atom with CodeMeta +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: xml + + + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 1785io25c695 + origin url + other identifier, DOI, ARK + Domain + + description + key-word 1 + key-word 2 + creation date + publication date + comment + + article name + article id + + + Collaboration/Projet + project name + id + + see also + Sponsor A + Sponsor B + Platform/OS + dependencies + Version + active + + license + url spdx + + .Net Framework 3.0 + Python2.3 + + author1 + Inria + UPMC + + + author2 + Inria + UPMC + + http://code.com + language 1 + language 2 + http://issuetracker.com + + +Using Atom with DublinCore and CodeMeta (multi-schema entry) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: xml + + + + Awesome Compiler + hal + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + %s + hal-01587361 + doi:10.5281/zenodo.438684 + The assignment problem + AffectationRO + author + [INFO] Computer Science [cs] + [INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO] + SOFTWARE + Project in OR: The assignment problemA java implementation for the assignment problem first release + description fr + 2015-06-01 + 2017-10-19 + en + + + origin url + + 1.0.0 + key word + Comment + Rfrence interne + + link + Sponsor + + Platform/OS + dependencies + Ended + + license + url spdx + + + http://code.com + language 1 + language 2 + diff --git a/docs/spec-api.md b/docs/spec-api.md deleted file mode 100644 index b57785d3..00000000 --- a/docs/spec-api.md +++ /dev/null @@ -1,810 +0,0 @@ -# API Specification - -This is [Software Heritage](https://www.softwareheritage.org)'s -[SWORD 2.0](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) Server -implementation. - -**S.W.O.R.D** (**S**imple **W**eb-Service **O**ffering **R**epository -**D**eposit) is an interoperability standard for digital file deposit. - -This implementation will permit interaction between a client (a -repository) and a server (SWH repository) to permit deposits of -software source code archives and associated metadata. - -*Note:* - - In the following document, we will use the `archive` or `software - source code archive` interchangeably. - - The supported archive formats are: - - zip: common zip archive (no multi-disk zip files). - - tar: tar archive without compression or optionally any of the - following compression algorithm gzip (.tar.gz, .tgz), bzip2 - (.tar.bz2) , or lzma (.tar.lzma) - - -## Collection - -SWORD defines a `collection` concept. In SWH's case, this collection -refers to a group of deposits. A `deposit` is some form of software -source code archive(s) associated with metadata. - -*Note:* It may be multiple archives if one archive is too big and must be -splitted into multiple smaller ones. - -### Example - -As part of the -[HAL](https://hal.archives-ouvertes.fr/)-[SWH](https://www.softwareheritage.org) -collaboration, we define a `HAL collection` to which the `hal` client -will have access to. - -## Limitations - -We will not have a fully compliant SWORD 2.0 protocol at first, so -voluntary implementation shortcomings can exist, for example, only zip -tarballs will be accepted. - -Other more permanent limitations exists: -- upload limitation of 100Mib -- no mediation - -## Endpoints - -Here are the defined endpoints this document will refer to from this -point on: - -- `/1/servicedocument/` *service document iri* (a.k.a [SD-IRI](#sd-iri-the-service-document-iri)) - - *Goal:* For a client to discover its collection's location - -- `/1//` *collection iri* (a.k.a [COL-IRI](#col-iri-the-collection-iri)) - - *Goal:*: create deposit to a collection - -- `/1///media/` *update iri* (a.k.a [EM-IRI](#em-iri-the-atom-edit-media-iri)) - - *Goal:*: Add or replace archive(s) to a deposit - -- `/1///metadata/` *update iri* (a.k.a [EDIT-IRI](#edit-iri-the-atom-entry-edit-iri) merged with [SE-IRI](#se-iri-the-sword-edit-iri)) - - *Goal:*: Add or replace metadata (and optionally archive(s) to a - deposit - - -- `/1///status/` *state iri* (a.k.a [STATE-IRI](#state-iri-the-sword-statement-iri)) - - *Goal:*: Display deposit's status in regards to loading - -- `/1///content/` *content iri* (a.k.a [CONT-FILE-IRI](#cont-iri-the-content-iri)) - - *Goal:*: Display information on the content's representation in the - sword server - -## Use cases - -### Deposit creation - -From client's deposit repository server to SWH's repository server: - -[1.] The client requests for the server's abilities and its associated -collection (GET query to the *SD/service document uri*) - -[2.] The server answers the client with the service document which gives - the *collection uri* (also known as *COL/collection IRI*). - -[3.] The client sends a deposit (optionally a zip archive, some metadata -or both) through the *collection uri*. - -This can be done in: -- one POST request (metadata + archive). -- one POST request (metadata or archive) + other PUT or POST request - to the *update uris* (*edit-media iri* or *edit iri*) - - [3.1.] Server validates the client's input or returns detailed error if any - - [3.2.] Server stores information received (metadata or software - archive source code or both) - -[4.] The server notifies the client it acknowledged the client's -request. An `http 201 Created` response with a deposit receipt in the -body response is sent back. That deposit receipt will hold the -necessary information to eventually complete the deposit later on if -it was incomplete (also known as status `partial`). - -#### Schema representation - - - -![](/images/deposit-create-chart.png) - -### Updating an existing deposit - -[5.] Client updates existing deposit through the *update uris* (one or -more POST or PUT requests to either the *edit-media iri* or *edit -iri*). - - [5.1.] Server validates the client's input or returns detailed error - if any - - [5.2.] Server stores information received (metadata or software - archive source code or both) - -This would be the case for example if the client initially posted a -`partial` deposit (e.g. only metadata with no archive, or an archive -without metadata, or a splitted archive because the initial one -exceeded the limit size imposed by swh repository deposit) - -#### Schema representation - - - -![](/images/deposit-update-chart.png) - -### Deleting deposit (or associated archive, or associated metadata) - -[6.] Deposit deletion is possible as long as the deposit is still in - `partial` state. - - [6.1.] Server validates the client's input or returns detailed error - if any - - [6.2.] Server actually delete information according to request - -#### Schema representation - - - -![](/images/deposit-delete-chart.png) - -### Client asks for operation status - -[7.] Operation status can be read through a GET query to the *state - iri*. - -### Server: Triggering deposit checks - -Once the status `deposited` is reached for a deposit, checks for the -associated archive(s) and metadata will be triggered. If those checks -fail, the status is changed to `rejected` and nothing more happens -there. Otherwise, the status is changed to `verified`. - -### Server: Triggering deposit load - -Once the status `verified` is reached for a deposit, loading the -deposit with its associated metadata will be triggered. - -The loading will result on status update, either `done` or `failed` -(depending on the loading's status). - -This is described in the [loading document](./spec-loading.html). - -## API overview - -API access is over HTTPS. - -The API is protected through basic authentication. - -The API endpoints are rooted at -[https://deposit.softwareheritage.org/1/](https://deposit.softwareheritage.org/1/). - -Data is sent and received as XML (as specified in the SWORD 2.0 specification). - -In the following chapters, we will described the different endpoints -[through the use cases described previously.](#use-cases) - -### [2] Service document - -Endpoint: GET /1/servicedocument/ - -This is the starting endpoint for the client to discover its initial -collection. The answer to this query will describes: -- the server's abilities -- connected client's collection information - -Also known as: [SD-IRI - The Service Document IRI](#sd-iri-the-service-document-iri). - -#### Sample request - -``` Shell -GET https://deposit.softwareheritage.org/1/servicedocument/ HTTP/1.1 -Host: deposit.softwareheritage.org -``` - -The server returns its abilities with the service document in xml format: -- protocol sword version v2 -- accepted mime types: application/zip (zip), application/x-tar (tar - archive with any of the following optional compression algorithm - gzip, bzip2, or lzma) -- upload max size accepted. Beyond that point, it's expected the - client splits its tarball into multiple ones -- the collection the client can act upon (swh supports only one - software collection per client) -- mediation is not supported -- etc... - -The current answer for example for the -[hal archive](https://hal.archives-ouvertes.fr/) is: - -``` XML - - - - 2.0 - 20971520 - - - The Software Heritage (SWH) archive - - SWH Software Archive - application/zip - application/x-tar - Collection Policy - Software Heritage Archive - false - false - Collect, Preserve, Share - http://purl.org/net/sword/package/SimpleZip - https://deposit.softwareheritage.org/1/hal/ - - - -``` - -### [3|5] Deposit creation/update - -The client can send deposit creation/update through a series of -deposit requests to the following endpoints: -- *collection iri* (COL-IRI) to initialize a deposit -- *update iris* (EM-IRI, EDIT-SE-IRI) to complete/finalize a deposit - -The deposit creation/update can also happens in one request. - -The deposit request can contain: -- an archive holding the software source code (binary upload) -- an envelop with metadata describing information regarding a deposit - (atom entry deposit) -- or both (multipart deposit, exactly one archive and one envelop). - -#### Request Types - -##### Binary deposit - -The client can deposit a binary archive, supplying the following headers: -- Content-Type (text): accepted mimetype -- Content-Length (int): tarball size -- Content-MD5 (text): md5 checksum hex encoded of the tarball -- Content-Disposition (text): attachment; filename=[filename] ; the filename - parameter must be text (ascii) -- Packaging (IRI): http://purl.org/net/sword/package/SimpleZip -- In-Progress (bool): true to specify it's not the last request, false - to specify it's a final request and the server can go on with - processing the request's information (if not provided, this is - considered false, so final). - -This is a single zip archive deposit. Almost no metadata is associated -with the archive except for the unique external identifier. - -*Note:* This kind of deposit should be `partial` (In-Progress: True) as -almost no metadata can be associated with the uploaded archive. - -##### API endpoints concerned - -POST /1// Create a first deposit with one - archive -PUT /1///media/ Replace existing archives -POST /1///media/ Add new archive - -##### Sample request - -``` Shell -curl -i -u hal: \ - --data-binary @swh/deposit.zip \ - -H 'In-Progress: false' -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ - -H 'Content-Disposition: attachment; filename=[deposit.zip]' \ - -H 'Slug: some-external-id' \ - -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ - -H 'Content-type: application/zip' \ - -XPOST https://deposit.softwareheritage.org/1/hal/ -``` - -#### Atom entry deposit - -The client can deposit an xml body holding metadata information on the -deposit. - -*Note:* This kind of deposit is mostly expected to be `partial` -(In-Progress: True) since no archive will be associated to those -metadata. - -##### API endpoints concerned - -POST /1// Create a first atom deposit entry -PUT /1///metadata/ Replace existing metadata -POST /1///metadata/ Add new metadata to deposit - -##### Sample request - -Sample query: - -``` Shell -curl -i -u hal: --data-binary @atom-entry.xml \ --H 'In-Progress: false' \ --H 'Slug: some-external-id' \ --H 'Content-Type: application/atom+xml;type=entry' \ --XPOST https://deposit.softwareheritage.org/1/hal/ - -HTTP/1.0 201 Created -Date: Tue, 26 Sep 2017 10:32:35 GMT -Server: WSGIServer/0.2 CPython/3.5.3 -Vary: Accept, Cookie -Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS -Location: /1/hal/10/metadata/ -X-Frame-Options: SAMEORIGIN -Content-Type: application/xml - - - 10 - Sept. 26, 2017, 10:32 a.m. - None - deposited - - - - - - - - - - - http://purl.org/net/sword/package/SimpleZip - -``` - -Sample body: - -``` XML - - Title - urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a - 2005-10-07T17:17:08Z - Contributor - The abstract - - - The abstract - Access Rights - Alternative Title - Date Available - Bibliographic Citation # noqa - Contributor - Description - Has Part - Has Version - Identifier - Is Part Of - Publisher - References - Rights Holder - Source - Title - Type - - -``` - -#### One request deposit / Multipart deposit - -The one request deposit is a single request containing both the -metadata (as atom entry attachment) and the archive (as payload -attachment). Thus, it is a multipart deposit. - -Client provides: -- Content-Disposition (text): header of type 'attachment' on the Entry - Part with a name parameter set to 'atom' -- Content-Disposition (text): header of type 'attachment' on the Media - Part with a name parameter set to payload and a filename parameter - (the filename will be expressed in ASCII). -- Content-MD5 (text): md5 checksum hex encoded of the tarball -- Packaging (text): http://purl.org/net/sword/package/SimpleZip - (packaging format used on the Media Part) -- In-Progress (bool): true|false; true means `partial` upload and we can expect - other requests in the future, false means the deposit is done. -- add metadata formats or foreign markup to the atom:entry element - -##### API endpoints concerned - -POST /1// Create a full deposit (metadata + archive) -PUT /1///metadata/ Replace existing metadata and archive -POST /1///metadata/ Add new metadata and archive to deposit - -##### Sample request - -Sample query: - -``` Shell -curl -i -u hal: \ - -F "file=@../deposit.json;type=application/zip;filename=payload" \ - -F "atom=@../atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ - -H 'In-Progress: false' \ - -H 'Slug: some-external-id' \ - -XPOST https://deposit.softwareheritage.org/1/hal/ - -HTTP/1.0 201 Created -Date: Tue, 26 Sep 2017 10:11:55 GMT -Server: WSGIServer/0.2 CPython/3.5.3 -Vary: Accept, Cookie -Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS -Location: /1/hal/9/metadata/ -X-Frame-Options: SAMEORIGIN -Content-Type: application/xml - - - 9 - Sept. 26, 2017, 10:11 a.m. - payload - deposited - - - - - - - - - - - http://purl.org/net/sword/package/SimpleZip - -``` - -Sample content: - -``` XML -POST deposit HTTP/1.1 -Host: deposit.softwareheritage.org -Content-Length: [content length] -Content-Type: multipart/related; - boundary="===============1605871705=="; - type="application/atom+xml" -In-Progress: false -MIME-Version: 1.0 - -Media Post ---===============1605871705== -Content-Type: application/atom+xml; charset="utf-8" -Content-Disposition: attachment; name="atom" -MIME-Version: 1.0 - - - - Title - hal-or-other-archive-id - 2005-10-07T17:17:08Z - Contributor - - - The abstract - Access Rights - Alternative Title - Date Available - Bibliographic Citation # noqa - Contributor - Description - Has Part - Has Version - Identifier - Is Part Of - Publisher - References - Rights Holder - Source - Title - Type - ---===============1605871705== -Content-Type: application/zip -Content-Disposition: attachment; name=payload; filename=[filename] -Packaging: http://purl.org/net/sword/package/SimpleZip -Content-MD5: [md5-digest] -MIME-Version: 1.0 - -[...binary package data...] ---===============1605871705==-- -``` - -## Deposit Creation - server point of view - -The server receives the request(s) and does minimal checking on the -input prior to any saving operations. - -### [3|5|6.1] Validation of the header and body request - -Any kind of errors can happen, here is the list depending on the -situation: - -- common errors: - - 401 (unauthenticated) if a client does not provide credential or - provide wrong ones - - 403 (forbidden) if a client tries access to a collection it does - not own - - 404 (not found) if a client tries access to an unknown collection - - 404 (not found) if a client tries access to an unknown deposit - - 415 (unsupported media type) if a wrong media type is - provided to the endpoint - -- archive/binary deposit: - - 403 (forbidden) if the length of the archive exceeds the - max size configured - - 412 (precondition failed) if the length or hash provided - mismatch the reality of the archive. - - 415 (unsupported media type) if a wrong media type is - provided - -- multipart deposit: - - 412 (precondition failed) if the md5 hash provided mismatch the - reality of the archive - - 415 (unsupported media type) if a wrong media type is - provided - -- Atom entry deposit: - - 400 (bad request) if the request's body is empty (for creation only) - -### [3|5|6.2] Server uploads the content in a temporary location - -Using an objstorage, the server stores the archive in a temporary -location. It's deemed temporary the time the deposit is completed -(status becomes `deposited`) and the loading finishes. - -The server also persists requests' information in a database. - -### [4] Servers answers the client - -If everything went well, the server answers either with a 200, 201 or -204 response (depending on the actual endpoint) - -A `http 200` response is returned for GET endpoints. - -A `http 201 Created` response is returned for POST endpoints. The -body holds the deposit receipt. The headers holds the EDIT-IRI in the -Location header of the response. - -A `http 204 No Content` response is returned for PUT, DELETE -endpoints. - -If something went wrong, the server answers with one of the -[error status code and associated message mentioned](#possible errors)). - - -### [5] Deposit Update - -The client previously deposited a `partial` document (through an -archive, metadata, or both). The client wants to update information -for that previous deposit (possibly in multiple steps as well). - -The important thing to note here is that, as long as the deposit is in -status `partial`, the loading did not start. Thus, the client can -update information (replace or add new archive, new metadata, even -delete) for that same `partial` deposit. - -When the deposit status changes to `deposited`, the client can -no longer change the deposit's information (a 403 will be returned in -that case). - -Then aggregation of all those deposit's information will later be used -for the actual loading. - -Providing the collection name, and the identifier of the previous -deposit id received from the deposit receipt, the client executes a -POST or PUT request on the *update iris*. - -After validation of the body request, the server: -- uploads such content in a temporary location - -- answers the client an `http 204 (No content)`. In the Location - header of the response lies an iri to permit further update. - -- Asynchronously, the server will inject the archive uploaded and the - associated metadata. An operation status endpoint *state iri* - permits the client to query the loading operation status. - -#### Possible update endpoints - -PUT /1///media/ Replace existing archives for the deposit -POST /1///media/ Add new archives to the deposit -PUT /1///metadata/ Replace existing metadata (and possible archives) -POST /1///metadata/ Add new metadata - -### [6] Deposit Removal - -As long as the deposit's status remains `partial`, it's possible to -remove the deposit entirely or remove only the deposit's archive(s). - -If the deposit has been removed, further querying that deposit will -return a *404* response. - -If the deposit's archive(s) has been removed, we can still ensue other -query to update that deposit. - -### Operation Status - -Providing a collection name and a deposit id, the client asks the -operation status of a prior deposit. - -URL: GET /1///status/ - -This returns: -- *201* response with the actual status -- *404* if the deposit does not exist (or no longer does) - -## Possible errors - -### sword:ErrorContent - -IRI: `http://purl.org/net/sword/error/ErrorContent` - -The supplied format is not the same as that identified in the -Packaging header and/or that supported by the server Associated HTTP - -Associated HTTP status: *415 (Unsupported Media Type)* - -### sword:ErrorChecksumMismatch - -IRI: `http://purl.org/net/sword/error/ErrorChecksumMismatch` - -Checksum sent does not match the calculated checksum. - -Associated HTTP status: *412 Precondition Failed* - -### sword:ErrorBadRequest - -IRI: `http://purl.org/net/sword/error/ErrorBadRequest` - -Some parameters sent with the POST/PUT were not understood. - -Associated HTTP status: *400 Bad Request* - -### sword:MediationNotAllowed - -IRI: `http://purl.org/net/sword/error/MediationNotAllowed` - -Used where a client has attempted a mediated deposit, but this is not -supported by the server. - -Associated HTTP status: *412 Precondition Failed* - -### sword:MethodNotAllowed - -IRI: `http://purl.org/net/sword/error/MethodNotAllowed` - -Used when the client has attempted one of the HTTP update verbs (POST, -PUT, DELETE) but the server has decided not to respond to such -requests on the specified resource at that time. - -Associated HTTP Status: *405 Method Not Allowed* - -### sword:MaxUploadSizeExceeded - -IRI: `http://purl.org/net/sword/error/MaxUploadSizeExceeded` - -Used when the client has attempted to supply to the server a file -which exceeds the server's maximum upload size limit - -Associated HTTP Status: *413 (Request Entity Too Large)* - -### sword:Unauthorized - -IRI: `http://purl.org/net/sword/error/ErrorUnauthorized` - -The access to the api is through authentication. - -Associated HTTP status: *401* - -### sword:Forbidden - -IRI: `http://purl.org/net/sword/error/ErrorForbidden` - -The action is forbidden (access to another collection for example). - -Associated HTTP status: *403* - -## Nomenclature - -SWORD uses IRI notion, Internationalized Resource Identifier. In this -chapter, we will describe SWH's IRIs. - -### SD-IRI - The Service Document IRI - -The Service Document IRI. This is the IRI from which the client can -discover its collection IRI. - -HTTP verbs supported: *GET* - -### Col-IRI - The Collection IRI - -The software collection associated to one user. - -The SWORD Collection IRI is the IRI to which the initial deposit will -take place, and which is listed in the Service Document. - -Following our previous example, this is: -https://deposit.softwareheritage.org/1/hal/. - -HTTP verbs supported: *POST* - -### Cont-IRI - The Content IRI - -This is the endpoint which permits the client to retrieve -representations of the object as it resides in the SWORD server. - -This will display information about the content and its associated -metadata. - -HTTP verbs supported: *GET* - -*Note:* We also refer to it as *Cont-File-IRI*. - -### EM-IRI - The Atom Edit Media IRI - -This is the endpoint to upload other related archives for the same -deposit. - -It is used to change a `partial` deposit in regards of archives, in -particular: -- replace existing archives with new ones -- add new archives -- delete archives from a deposit - -Example use case: -A first archive to put exceeds the deposit's limit size. -The client can thus split the archives in multiple ones. -Post a first `partial` archive to the Col-IRI (with In-Progress: - -True). Then, in order to complete the deposit, POST the other -remaining archives to the EM-IRI (the last one with the In-Progress -header to False). - -HTTP verbs supported: *POST*, *PUT*, *DELETE* - -### Edit-IRI - The Atom Entry Edit IRI - -This is the endpoint to change a `partial` deposit in regards of -metadata. In particular: -- replace existing metadata (and archives) with new ones -- add new metadata (and archives) -- delete deposit - -HTTP verbs supported: *POST*, *PUT*, *DELETE* - -*Note:* We also refer to it as *Edit-SE-IRI*. - -### SE-IRI - The SWORD Edit IRI - -The sword specification permits to merge this with EDIT-IRI, so we -did. - -*Note:* We also refer to it as *Edit-SE-IRI*. - -### State-IRI - The SWORD Statement IRI - -This is the IRI which can be used to retrieve a description of the -object from the sword server, including the structure of the object -and its state. This will be used as the operation status endpoint. - -HTTP verbs supported: *GET* - -## Sources - -- [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) -- [arxiv documentation](https://arxiv.org/help/submit_sword) -- [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) -- [SWORD used on HAL](https://api.archives-ouvertes.fr/docs/sword) -- [xml examples for CCSD](https://github.com/CCSDForge/HAL/tree/master/Sword) diff --git a/docs/spec-api.rst b/docs/spec-api.rst new file mode 100644 index 00000000..7a668e6b --- /dev/null +++ b/docs/spec-api.rst @@ -0,0 +1,885 @@ +API Specification +================= + +This is `Software Heritage `__'s +`SWORD +2.0 `__ +Server implementation. + +**S.W.O.R.D** (**S**\ imple **W**\ eb-Service **O**\ ffering +**R**\ epository **D**\ eposit) is an interoperability standard for +digital file deposit. + +This implementation will permit interaction between a client (a repository) and +a server (SWH repository) to permit deposits of software source code archives +and associated metadata. + +*Note:* + +* In the following document, we will use the ``archive`` or ``software source + code archive`` interchangeably. +* The supported archive formats are: + + * zip: common zip archive (no multi-disk zip files). + * tar: tar archive without compression or optionally any of the following + compression algorithm gzip (.tar.gz, .tgz), bzip2 (.tar.bz2) , or lzma + (.tar.lzma) + +Collection +---------- + +SWORD defines a ``collection`` concept. In SWH's case, this collection +refers to a group of deposits. A ``deposit`` is some form of software +source code archive(s) associated with metadata. + +*Note:* It may be multiple archives if one archive is too big and must +be splitted into multiple smaller ones. + +Example +~~~~~~~ + +As part of the +`HAL `__-`SWH `__ +collaboration, we define a ``HAL collection`` to which the ``hal`` +client will have access to. + +Limitations +----------- + +We will not have a fully compliant SWORD 2.0 protocol at first, so +voluntary implementation shortcomings can exist, for example, only zip +tarballs will be accepted. + +Other more permanent limitations exists: + +* upload limitation of 100Mib +* no mediation + +Endpoints +--------- + +Here are the defined endpoints this document will refer to from this +point on: + +* ``/1/servicedocument/`` *service document iri* (a.k.a `SD-IRI + <#sd-iri-the-service-document-iri>`__) + + *Goal:* For a client to discover its collection's location + +* ``/1//`` *collection iri* (a.k.a `COL-IRI + <#col-iri-the-collection-iri>`__) + + *Goal:*: create deposit to a collection + +* ``/1///media/`` *update iri* (a.k.a + `EM-IRI <#em-iri-the-atom-edit-media-iri>`__) + + *Goal:*: Add or replace archive(s) to a deposit + +* ``/1///metadata/`` *update iri* (a.k.a `EDIT-IRI + <#edit-iri-the-atom-entry-edit-iri>`__ merged with `SE-IRI + <#se-iri-the-sword-edit-iri>`__) + + *Goal:*: Add or replace metadata (and optionally archive(s) to a deposit + +* ``/1///status/`` *state iri* (a.k.a `STATE-IRI + <#state-iri-the-sword-statement-iri>`__) + + *Goal:*: Display deposit's status in regards to loading + +* ``/1///content/`` *content iri* (a.k.a + `CONT-FILE-IRI <#cont-iri-the-content-iri>`__) + + *Goal:*: Display information on the content's representation in the sword + server + +Use cases +--------- + +Deposit creation +~~~~~~~~~~~~~~~~ + +From client's deposit repository server to SWH's repository server: + +1. The client requests for the server's abilities and its associated collection + (GET query to the *SD/service document uri*) + +2. The server answers the client with the service document which gives the + *collection uri* (also known as *COL/collection IRI*). + +3. The client sends a deposit (optionally a zip archive, some metadata or both) + through the *collection uri*. + + This can be done in: + + * one POST request (metadata + archive). + * one POST request (metadata or archive) + other PUT or POST request to the + *update uris* (*edit-media iri* or *edit iri*) + + 1. Server validates the client's input or returns detailed error if any + + 2. Server stores information received (metadata or software archive source + code or both) + +4. The server notifies the client it acknowledged the client's request. An + ``http 201 Created`` response with a deposit receipt in the body response is + sent back. That deposit receipt will hold the necessary information to + eventually complete the deposit later on if it was incomplete (also known as + status ``partial``). + +Schema representation +^^^^^^^^^^^^^^^^^^^^^ + +.. raw:: html + + + +.. figure:: /images/deposit-create-chart.png + :alt: + +Updating an existing deposit +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +5. Client updates existing deposit through the *update uris* (one or more POST + or PUT requests to either the *edit-media iri* or *edit iri*). + + 1. Server validates the client's input or returns detailed error if any + + 2. Server stores information received (metadata or software archive source + code or both) + + This would be the case for example if the client initially posted a + ``partial`` deposit (e.g. only metadata with no archive, or an archive + without metadata, or a splitted archive because the initial one exceeded + the limit size imposed by swh repository deposit) + +Schema representation +^^^^^^^^^^^^^^^^^^^^^ + +.. raw:: html + + + +.. figure:: /images/deposit-update-chart.png + :alt: + +Deleting deposit (or associated archive, or associated metadata) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +6. Deposit deletion is possible as long as the deposit is still in ``partial`` + state. + + 1. Server validates the client's input or returns detailed error if any + 2. Server actually delete information according to request + +Schema representation +^^^^^^^^^^^^^^^^^^^^^ + +.. raw:: html + + + +.. figure:: /images/deposit-delete-chart.png + :alt: + +Client asks for operation status +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +7. Operation status can be read through a GET query to the *state iri*. + +Server: Triggering deposit checks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once the status ``deposited`` is reached for a deposit, checks for the +associated archive(s) and metadata will be triggered. If those checks +fail, the status is changed to ``rejected`` and nothing more happens +there. Otherwise, the status is changed to ``verified``. + +Server: Triggering deposit load +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once the status ``verified`` is reached for a deposit, loading the +deposit with its associated metadata will be triggered. + +The loading will result on status update, either ``done`` or ``failed`` +(depending on the loading's status). + +This is described in the `loading document <./spec-loading.html>`__. + +API overview +------------ + +API access is over HTTPS. + +The API is protected through basic authentication. + +The API endpoints are rooted at https://deposit.softwareheritage.org/1/. + +Data is sent and received as XML (as specified in the SWORD 2.0 +specification). + +In the following chapters, we will described the different endpoints +`through the use cases described previously. <#use-cases>`__ + +[2] Service document +~~~~~~~~~~~~~~~~ + +Endpoint: GET /1/servicedocument/ + +This is the starting endpoint for the client to discover its initial +collection. The answer to this query will describes: + +* the server's abilities +* connected client's collection information + + Also known as: `SD-IRI - The Service Document IRI + <#sd-iri-the-service-document-iri>`__. + +Sample request +^^^^^^^^^^^^^^ + +.. code:: shell + + GET https://deposit.softwareheritage.org/1/servicedocument/ HTTP/1.1 + Host: deposit.softwareheritage.org + +The server returns its abilities with the service document in xml format: + +* protocol sword version v2 +* accepted mime types: application/zip (zip), application/x-tar (tar archive + with any of the following optional compression algorithm gzip, bzip2, or + lzma) +* upload max size accepted. Beyond that point, it's expected the client splits + its tarball into multiple ones +* the collection the client can act upon (swh supports only one software + collection per client) +* mediation is not supported + +The current answer for example for the `HAL archive +`__ is: + +.. code:: xml + + + + + 2.0 + 20971520 + + + The Software Heritage (SWH) archive + + SWH Software Archive + application/zip + application/x-tar + Collection Policy + Software Heritage Archive + false + false + Collect, Preserve, Share + http://purl.org/net/sword/package/SimpleZip + https://deposit.softwareheritage.org/1/hal/ + + + + +[3\|5] Deposit creation/update +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The client can send deposit creation/update through a series of deposit +requests to the following endpoints: + +* *collection iri* (COL-IRI) to initialize a deposit +* *update iris* (EM-IRI, EDIT-SE-IRI) to complete/finalize a deposit + +The deposit creation/update can also happens in one request. + +The deposit request can contain: + +* an archive holding the software source code (binary upload) +* an envelop with metadata describing information regarding a deposit (atom + entry deposit) +* or both (multipart deposit, exactly one archive and one envelop). + +Request Types +^^^^^^^^^^^^^ + +Binary deposit +'''''''''''''' + +The client can deposit a binary archive, supplying the following +headers: + +* Content-Type (text): accepted mimetype +* Content-Length (int): tarball size +* Content-MD5 (text): md5 checksum hex encoded of the tarball +* Content-Disposition (text): attachment; filename=[filename] ; the filename + parameter must be text (ascii) +* Packaging (IRI): http://purl.org/net/sword/package/SimpleZip +* In-Progress (bool): true to specify it's not the last request, false to + specify it's a final request and the server can go on with processing the + request's information (if not provided, this is considered false, so final). + +This is a single zip archive deposit. Almost no metadata is associated +with the archive except for the unique external identifier. + +*Note:* This kind of deposit should be ``partial`` (In-Progress: True) +as almost no metadata can be associated with the uploaded archive. + +API endpoints concerned +''''''''''''''''''''''' + +POST /1// Create a first deposit with one archive PUT /1///media/ +Replace existing archives POST /1///media/ Add new archive + +Sample request +'''''''''''''' + +.. code:: shell + + curl -i -u hal: \ + --data-binary @swh/deposit.zip \ + -H 'In-Progress: false' -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ + -H 'Content-Disposition: attachment; filename=[deposit.zip]' \ + -H 'Slug: some-external-id' \ + -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ + -H 'Content-type: application/zip' \ + -XPOST https://deposit.softwareheritage.org/1/hal/ + +Atom entry deposit +^^^^^^^^^^^^^^^^^^ + +The client can deposit an xml body holding metadata information on the +deposit. + +*Note:* This kind of deposit is mostly expected to be ``partial`` +(In-Progress: True) since no archive will be associated to those +metadata. + +API endpoints concerned +''''''''''''''''''''''' + +POST /1// Create a first atom deposit entry PUT /1///metadata/ Replace +existing metadata POST /1///metadata/ Add new metadata to deposit + +Sample request +'''''''''''''' + +Sample query: + +.. code:: shell + + curl -i -u hal: --data-binary @atom-entry.xml \ + -H 'In-Progress: false' \ + -H 'Slug: some-external-id' \ + -H 'Content-Type: application/atom+xml;type=entry' \ + -XPOST https://deposit.softwareheritage.org/1/hal/ + + HTTP/1.0 201 Created + Date: Tue, 26 Sep 2017 10:32:35 GMT + Server: WSGIServer/0.2 CPython/3.5.3 + Vary: Accept, Cookie + Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS + Location: /1/hal/10/metadata/ + X-Frame-Options: SAMEORIGIN + Content-Type: application/xml + + + 10 + Sept. 26, 2017, 10:32 a.m. + None + deposited + + + + + + + + + + + http://purl.org/net/sword/package/SimpleZip + + +Sample body: + +.. code:: xml + + + Title + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 2005-10-07T17:17:08Z + Contributor + The abstract + + + The abstract + Access Rights + Alternative Title + Date Available + Bibliographic Citation # noqa + Contributor + Description + Has Part + Has Version + Identifier + Is Part Of + Publisher + References + Rights Holder + Source + Title + Type + + + +One request deposit / Multipart deposit +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The one request deposit is a single request containing both the metadata +(as atom entry attachment) and the archive (as payload attachment). +Thus, it is a multipart deposit. + +Client provides: + +* Content-Disposition (text): header of type 'attachment' on the Entry Part + with a name parameter set to 'atom' +* Content-Disposition (text): header of type 'attachment' on the Media Part + with a name parameter set to payload and a filename parameter (the filename + will be expressed in ASCII). +* Content-MD5 (text): md5 checksum hex encoded of the tarball +* Packaging (text): http://purl.org/net/sword/package/SimpleZip (packaging + format used on the Media Part) +* In-Progress (bool): true\|false; true means ``partial`` upload and we can + expect other requests in the future, false means the deposit is done. +* add metadata formats or foreign markup to the atom:entry element + +API endpoints concerned +''''''''''''''''''''''' + +POST /1// Create a full deposit (metadata + archive) PUT /1///metadata/ +Replace existing metadata and archive POST /1///metadata/ Add new +metadata and archive to deposit + +Sample request +'''''''''''''' + +Sample query: + +.. code:: shell + + curl -i -u hal: \ + -F "file=@../deposit.json;type=application/zip;filename=payload" \ + -F "atom=@../atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ + -H 'In-Progress: false' \ + -H 'Slug: some-external-id' \ + -XPOST https://deposit.softwareheritage.org/1/hal/ + + HTTP/1.0 201 Created + Date: Tue, 26 Sep 2017 10:11:55 GMT + Server: WSGIServer/0.2 CPython/3.5.3 + Vary: Accept, Cookie + Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS + Location: /1/hal/9/metadata/ + X-Frame-Options: SAMEORIGIN + Content-Type: application/xml + + + 9 + Sept. 26, 2017, 10:11 a.m. + payload + deposited + + + + + + + + + + + http://purl.org/net/sword/package/SimpleZip + + +Sample content: + +.. code:: xml + + POST deposit HTTP/1.1 + Host: deposit.softwareheritage.org + Content-Length: [content length] + Content-Type: multipart/related; + boundary="===============1605871705=="; + type="application/atom+xml" + In-Progress: false + MIME-Version: 1.0 + + Media Post + --===============1605871705== + Content-Type: application/atom+xml; charset="utf-8" + Content-Disposition: attachment; name="atom" + MIME-Version: 1.0 + + + + Title + hal-or-other-archive-id + 2005-10-07T17:17:08Z + Contributor + + + The abstract + Access Rights + Alternative Title + Date Available + Bibliographic Citation # noqa + Contributor + Description + Has Part + Has Version + Identifier + Is Part Of + Publisher + References + Rights Holder + Source + Title + Type + + --===============1605871705== + Content-Type: application/zip + Content-Disposition: attachment; name=payload; filename=[filename] + Packaging: http://purl.org/net/sword/package/SimpleZip + Content-MD5: [md5-digest] + MIME-Version: 1.0 + + [...binary package data...] + --===============1605871705==-- + +Deposit Creation - server point of view +--------------------------------------- + +The server receives the request(s) and does minimal checking on the +input prior to any saving operations. + +[3\|5\|6.1] Validation of the header and body request +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Any kind of errors can happen, here is the list depending on the +situation: + +* common errors: + + * 401 (unauthenticated) if a client does not provide credential or provide + wrong ones + * 403 (forbidden) if a client tries access to a collection it does not own + * 404 (not found) if a client tries access to an unknown collection + * 404 (not found) if a client tries access to an unknown deposit + * 415 (unsupported media type) if a wrong media type is provided to the + endpoint + +* archive/binary deposit: + + * 403 (forbidden) if the length of the archive exceeds the max size + configured + * 412 (precondition failed) if the length or hash provided mismatch the + reality of the archive. + * 415 (unsupported media type) if a wrong media type is provided + +* multipart deposit: + + * 412 (precondition failed) if the md5 hash provided mismatch the reality of + the archive + * 415 (unsupported media type) if a wrong media type is provided + +* Atom entry deposit: + + * 400 (bad request) if the request's body is empty (for creation only) + +[3\|5\|6.2] Server uploads the content in a temporary location +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using an objstorage, the server stores the archive in a temporary +location. It's deemed temporary the time the deposit is completed +(status becomes ``deposited``) and the loading finishes. + +The server also persists requests' information in a database. + +[4] Servers answers the client +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If everything went well, the server answers either with a 200, 201 or +204 response (depending on the actual endpoint) + +A ``http 200`` response is returned for GET endpoints. + +A ``http 201 Created`` response is returned for POST endpoints. The body +holds the deposit receipt. The headers holds the EDIT-IRI in the +Location header of the response. + +A ``http 204 No Content`` response is returned for PUT, DELETE +endpoints. + +If something went wrong, the server answers with one of the `error +status code and associated message mentioned <#possible%20errors>`__). + +[5] Deposit Update +~~~~~~~~~~~~~~~~~~ + +The client previously deposited a ``partial`` document (through an +archive, metadata, or both). The client wants to update information for +that previous deposit (possibly in multiple steps as well). + +The important thing to note here is that, as long as the deposit is in +status ``partial``, the loading did not start. Thus, the client can +update information (replace or add new archive, new metadata, even +delete) for that same ``partial`` deposit. + +When the deposit status changes to ``deposited``, the client can no +longer change the deposit's information (a 403 will be returned in that +case). + +Then aggregation of all those deposit's information will later be used +for the actual loading. + +Providing the collection name, and the identifier of the previous +deposit id received from the deposit receipt, the client executes a POST +or PUT request on the *update iris*. + +After validation of the body request, the server: + +- uploads such content in a temporary location + +- answers the client an ``http 204 (No content)``. In the Location header of + the response lies an iri to permit further update. + +- Asynchronously, the server will inject the archive uploaded and the + associated metadata. An operation status endpoint *state iri* permits the + client to query the loading operation status. + +Possible update endpoints +^^^^^^^^^^^^^^^^^^^^^^^^^ + +PUT /1///media/ Replace existing archives for the deposit POST +/1///media/ Add new archives to the deposit PUT /1///metadata/ Replace +existing metadata (and possible archives) POST /1///metadata/ Add new +metadata + +[6] Deposit Removal +~~~~~~~~~~~~~~~~~~~ + +As long as the deposit's status remains ``partial``, it's possible to +remove the deposit entirely or remove only the deposit's archive(s). + +If the deposit has been removed, further querying that deposit will +return a *404* response. + +If the deposit's archive(s) has been removed, we can still ensue other +query to update that deposit. + +Operation Status +~~~~~~~~~~~~~~~~ + +Providing a collection name and a deposit id, the client asks the +operation status of a prior deposit. + +URL: GET /1///status/ + +This returns: + +* *201* response with the actual status +* *404* if the deposit does not exist (or no longer does) + + Possible errors +---------------- + +sword:ErrorContent +~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/ErrorContent`` + +The supplied format is not the same as that identified in the Packaging +header and/or that supported by the server Associated HTTP + +Associated HTTP status: *415 (Unsupported Media Type)* + +sword:ErrorChecksumMismatch +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/ErrorChecksumMismatch`` + +Checksum sent does not match the calculated checksum. + +Associated HTTP status: *412 Precondition Failed* + +sword:ErrorBadRequest +~~~~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/ErrorBadRequest`` + +Some parameters sent with the POST/PUT were not understood. + +Associated HTTP status: *400 Bad Request* + +sword:MediationNotAllowed +~~~~~~~~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/MediationNotAllowed`` + +Used where a client has attempted a mediated deposit, but this is not +supported by the server. + +Associated HTTP status: *412 Precondition Failed* + +sword:MethodNotAllowed +~~~~~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/MethodNotAllowed`` + +Used when the client has attempted one of the HTTP update verbs (POST, +PUT, DELETE) but the server has decided not to respond to such requests +on the specified resource at that time. + +Associated HTTP Status: *405 Method Not Allowed* + +sword:MaxUploadSizeExceeded +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/MaxUploadSizeExceeded`` + +Used when the client has attempted to supply to the server a file which +exceeds the server's maximum upload size limit + +Associated HTTP Status: *413 (Request Entity Too Large)* + +sword:Unauthorized +~~~~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/ErrorUnauthorized`` + +The access to the api is through authentication. + +Associated HTTP status: *401* + +sword:Forbidden +~~~~~~~~~~~~~~~ + +IRI: ``http://purl.org/net/sword/error/ErrorForbidden`` + +The action is forbidden (access to another collection for example). + +Associated HTTP status: *403* + +Nomenclature +------------ + +SWORD uses IRI notion, Internationalized Resource Identifier. In this +chapter, we will describe SWH's IRIs. + +SD-IRI - The Service Document IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Service Document IRI. This is the IRI from which the client can +discover its collection IRI. + +HTTP verbs supported: *GET* + +Col-IRI - The Collection IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The software collection associated to one user. + +The SWORD Collection IRI is the IRI to which the initial deposit will +take place, and which is listed in the Service Document. + +Following our previous example, this is: +https://deposit.softwareheritage.org/1/hal/. + +HTTP verbs supported: *POST* + +Cont-IRI - The Content IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the endpoint which permits the client to retrieve +representations of the object as it resides in the SWORD server. + +This will display information about the content and its associated +metadata. + +HTTP verbs supported: *GET* + +*Note:* We also refer to it as *Cont-File-IRI*. + +EM-IRI - The Atom Edit Media IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the endpoint to upload other related archives for the same +deposit. + +It is used to change a ``partial`` deposit in regards of archives, in +particular: + +* replace existing archives with new ones +* add new archives +* delete archives from a deposit + +Example use case: A first archive to put exceeds the deposit's limit +size. The client can thus split the archives in multiple ones. Post a +first ``partial`` archive to the Col-IRI (with In-Progress: + +True). Then, in order to complete the deposit, POST the other remaining +archives to the EM-IRI (the last one with the In-Progress header to +False). + +HTTP verbs supported: *POST*, *PUT*, *DELETE* + +Edit-IRI - The Atom Entry Edit IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the endpoint to change a ``partial`` deposit in regards of +metadata. In particular: + +* replace existing metadata (and archives) with new ones +* add new metadata (and archives) +* delete deposit + +HTTP verbs supported: *POST*, *PUT*, *DELETE* + +*Note:* We also refer to it as *Edit-SE-IRI*. + +SE-IRI - The SWORD Edit IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sword specification permits to merge this with EDIT-IRI, so we did. + +*Note:* We also refer to it as *Edit-SE-IRI*. + +State-IRI - The SWORD Statement IRI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the IRI which can be used to retrieve a description of the +object from the sword server, including the structure of the object and +its state. This will be used as the operation status endpoint. + +HTTP verbs supported: *GET* + +Sources +------- + +* `SWORD v2 specification + `__ +* `arxiv documentation `__ +* `Dataverse example `__ +* `SWORD used on HAL `__ +* `xml examples for CCSD `__ diff --git a/docs/spec-loading.md b/docs/spec-loading.md deleted file mode 100644 index ef8780eb..00000000 --- a/docs/spec-loading.md +++ /dev/null @@ -1,221 +0,0 @@ -# Loading specification (draft) - -This part discusses the deposit loading part on the server side. - -## Tarball Loading - -The `swh-loader-tar` module is already able to inject tarballs in swh -with very limited metadata (mainly the origin). - -The loading of the deposit will use the deposit's associated data: -- the metadata -- the archive(s) - -We will use the `synthetic` revision notion. - -To that revision will be associated the metadata. Those will be -included in the hash computation, thus resulting in a unique -identifier. - -### Loading mapping - -Some of those metadata will also be included in the `origin_metadata` -table. - -``` -origin | https://hal.inria.fr/hal-id | -------------------------------------|----------------------------------------| -origin_visit | 1 :reception_date | -origin_metadata | aggregated metadata | -occurrence & occurrence_history | branch: client's version n° (e.g hal) | -revision | synthetic_revision (tarball) | -directory | upper level of the uncompressed archive| -``` - -### Questions raised concerning loading - -- A deposit has one origin, yet an origin can have multiple deposits? - -No, an origin can have multiple requests for the same deposit. -Which should end up in one single deposit (when the client pushes its final -request saying deposit 'done' through the header In-Progress). - -Only update of existing 'partial' deposit is permitted. -Other than that, the deposit 'update' operation. - -To create a new version of a software (already deposited), the client -must prior to this create a new deposit. - - -Illustration First deposit loading: - -HAL's deposit 01535619 = SWH's deposit **01535619-1** - - + 1 origin with url:https://hal.inria.fr/medihal-01535619 - - + 1 synthetic revision - - + 1 directory - -HAL's update on deposit 01535619 = SWH's deposit **01535619-2** - -(*with HAL updates can only be on the metadata and a new version is required -if the content changes) - - + 1 origin with url:https://hal.inria.fr/medihal-01535619 - - + new synthetic revision (with new metadata) - - + same directory - -HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** - - + same origin - - + new revision - - + new directory - - - -## Technical details - -### Requirements - -- one dedicated database to store the deposit's state - swh-deposit - -- one dedicated temporary objstorage to store archives before - loading - -- one client to test the communication with SWORD protocol - -### Deposit reception schema - -- SWORD imposes the use of basic authentication, so we need a way to -authenticate client. Also, a client can access collections: - -**deposit_client** table: - - id (bigint): Client's identifier - - username (str): Client's username - - password (pass): Client's crypted password - - collections ([id]): List of collections the client can access - -- Collections group deposits together: - -**deposit_collection** table: - - id (bigint): Collection's identifier - - name (str): Collection's human readable name - -- A deposit is the main object the repository is all about: - -**deposit** table: - - id (bigint): deposit's identifier - - reception_date (date): First deposit's reception date - - complete_data (date): Date when the deposit is deemed complete and ready for loading - - collection (id): The collection the deposit belongs to - - external id (text): client's internal identifier (e.g hal's id, etc...). - - client_id (id) : Client which did the deposit - - swh_id (str) : swh identifier result once the loading is complete - - status (enum): The deposit's current status - -- As mentioned, a deposit can have a status, whose possible values - are: - -``` text - 'partial', -- the deposit is new or partially received since it - -- can be done in multiple requests - 'expired', -- deposit has been there too long and is now deemed - -- ready to be garbage collected - 'deposited' -- deposit complete, it is ready to be checked to ensure data consistency - 'verified', -- deposit is fully received, checked, and ready for loading - 'loading', -- loading is ongoing on swh's side - 'done', -- loading is successful - 'failed' -- loading is a failure -``` - -A deposit is stateful and can be made in multiple requests: - -**deposit_request** table: - - id (bigint): identifier - - type (id): deposit request's type (possible values: 'archive', 'metadata') - - deposit_id (id): deposit whose request belongs to - - metadata: metadata associated to the request - - date (date): date of the requests - -Information sent along a request are stored in a `deposit_request` -row. - -They can be either of type `metadata` (atom entry, multipart's atom -entry part) or of type `archive` (binary upload, multipart's binary -upload part). - -When the deposit is complete (status `deposited`), those `metadata` -and `archive` deposit requests will be read and aggregated. They will -then be sent as parameters to the loading routine. - -During loading, some of those metadata are kept in the -`origin_metadata` table and some other are stored in the `revision` -table (see [metadata loading](#metadata-loading)). - -The only update actions occurring on the deposit table are in regards -of: -- status changing: - - `partial` -> {`expired`/`deposited`}, - - `deposited` -> {`rejected`/`verified`}, - - `verified` -> `loading` - - `loading` -> {`done`/`failed`} -- `complete_date` when the deposit is finalized (when the status is - changed to `deposited`) -- `swh-id` is populated once we have the loading result - -#### SWH Identifier returned - - The synthetic revision id - - e.g: 47dc6b4636c7f6cba0df83e3d5490bf4334d987e - -### Scheduling loading - -All `archive` and `metadata` deposit requests should be aggregated -before loading. - -The loading should be scheduled via the scheduler's api. - -Only `deposited` deposit are concerned by the loading. - -When the loading is done and successful, the deposit entry is -updated: -- `status` is updated to `done` -- `swh-id` is populated with the resulting hash - (cf. [swh identifier](#swh-identifier-returned)) -- `complete_date` is updated to the loading's finished time - -When the loading is failed, the deposit entry is updated: -- `status` is updated to `failed` -- `swh-id` and `complete_data` remains as is - -*Note:* As a further improvement, we may prefer having a retry policy -with graceful delays for further scheduling. - -### Metadata loading - -- the metadata received with the deposit should be kept in the -`origin_metadata` table before translation as part of the loading -process and an indexation process should be scheduled. - -- provider_id and tool_id are resolved by the prepare_metadata method in the -loader-core - -- the origin_metadata entry is sent to storage by the send_origin_metadata in -the loader-core - - -origin_metadata table: -``` -id bigint PK -origin bigint -discovery_date date -provider_id bigint FK // (from provider table) -tool_id bigint FK // indexer_configuration_id tool used for extraction -metadata jsonb // before translation -``` diff --git a/docs/spec-loading.rst b/docs/spec-loading.rst new file mode 100644 index 00000000..21c7a0f1 --- /dev/null +++ b/docs/spec-loading.rst @@ -0,0 +1,222 @@ +Loading specification (draft) +============================= + +This part discusses the deposit loading part on the server side. + +Tarball Loading +--------------- + +The ``swh-loader-tar`` module is already able to inject tarballs in swh +with very limited metadata (mainly the origin). + +The loading of the deposit will use the deposit's associated data: + +* the metadata +* the archive(s) + +We will use the ``synthetic`` revision notion. + +To that revision will be associated the metadata. Those will be included +in the hash computation, thus resulting in a unique identifier. + +Loading mapping +~~~~~~~~~~~~~~~ + +Some of those metadata will also be included in the ``origin_metadata`` +table. + +:: + + origin | https://hal.inria.fr/hal-id | + ------------------------------------|----------------------------------------| + origin_visit | 1 :reception_date | + origin_metadata | aggregated metadata | + occurrence & occurrence_history | branch: client's version n° (e.g hal) | + revision | synthetic_revision (tarball) | + directory | upper level of the uncompressed archive| + +Questions raised concerning loading +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- A deposit has one origin, yet an origin can have multiple deposits? + +No, an origin can have multiple requests for the same deposit. Which +should end up in one single deposit (when the client pushes its final +request saying deposit 'done' through the header In-Progress). + +Only update of existing 'partial' deposit is permitted. Other than that, +the deposit 'update' operation. + +To create a new version of a software (already deposited), the client +must prior to this create a new deposit. + +Illustration First deposit loading: + +HAL's deposit 01535619 = SWH's deposit **01535619-1** + +:: + + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + 1 synthetic revision + + + 1 directory + +HAL's update on deposit 01535619 = SWH's deposit **01535619-2** + +(\*with HAL updates can only be on the metadata and a new version is +required if the content changes) + +:: + + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + new synthetic revision (with new metadata) + + + same directory + +HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** + +:: + + + same origin + + + new revision + + + new directory + +Technical details +----------------- + +Requirements +~~~~~~~~~~~~ + +* one dedicated database to store the deposit's state - swh-deposit +* one dedicated temporary objstorage to store archives before loading +* one client to test the communication with SWORD protocol + +Deposit reception schema +~~~~~~~~~~~~~~~~~~~~~~~~ + +* SWORD imposes the use of basic authentication, so we need a way to + authenticate client. Also, a client can access collections: + + **deposit\_client** table: - id (bigint): Client's identifier - username + (str): Client's username - password (pass): Client's crypted password - + collections ([id]): List of collections the client can access + +* Collections group deposits together: + + **deposit\_collection** table: - id (bigint): Collection's identifier - name + (str): Collection's human readable name + +* A deposit is the main object the repository is all about: + + **deposit** table: + + * id (bigint): deposit's identifier + * reception\_date (date): First deposit's reception date + * complete\_data (date): Date when the deposit is deemed complete and ready + for loading + * collection (id): The collection the deposit belongs to + * external id (text): client's internal identifier (e.g hal's id, etc...). + * client\_id (id) : Client which did the deposit + * swh\_id (str) : swh identifier result once the loading is complete + * status (enum): The deposit's current status + +- As mentioned, a deposit can have a status, whose possible values are: + + .. code:: text + + 'partial', -- the deposit is new or partially received since it + -- can be done in multiple requests + 'expired', -- deposit has been there too long and is now deemed + -- ready to be garbage collected + 'deposited' -- deposit complete, it is ready to be checked to ensure data consistency + 'verified', -- deposit is fully received, checked, and ready for loading + 'loading', -- loading is ongoing on swh's side + 'done', -- loading is successful + 'failed' -- loading is a failure + +* A deposit is stateful and can be made in multiple requests: + + **deposit\_request** table: + * id (bigint): identifier + * type (id): deposit request's type (possible values: 'archive', 'metadata') + * deposit\_id (id): deposit whose request belongs to + * metadata: metadata associated to the request + * date (date): date of the requests + + Information sent along a request are stored in a ``deposit_request`` row. + + They can be either of type ``metadata`` (atom entry, multipart's atom entry + part) or of type ``archive`` (binary upload, multipart's binary upload part). + + When the deposit is complete (status ``deposited``), those ``metadata`` and + ``archive`` deposit requests will be read and aggregated. They will then be + sent as parameters to the loading routine. + + During loading, some of those metadata are kept in the ``origin_metadata`` + table and some other are stored in the ``revision`` table (see `metadata + loading <#metadata-loading>`__). + + The only update actions occurring on the deposit table are in regards of: - + status changing: - ``partial`` -> {``expired``/``deposited``}, - + ``deposited`` -> {``rejected``/``verified``}, - ``verified`` -> ``loading`` - + ``loading`` -> {``done``/``failed``} - ``complete_date`` when the deposit is + finalized (when the status is changed to ``deposited``) - ``swh-id`` is + populated once we have the loading result + +SWH Identifier returned +^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + The synthetic revision id + + e.g.: swh:1:rev:47dc6b4636c7f6cba0df83e3d5490bf4334d987e + +Scheduling loading +~~~~~~~~~~~~~~~~~~ + +All ``archive`` and ``metadata`` deposit requests should be aggregated before +loading. + +The loading should be scheduled via the scheduler's api. + +Only ``deposited`` deposit are concerned by the loading. + +When the loading is done and successful, the deposit entry is updated: - +``status`` is updated to ``done`` - ``swh-id`` is populated with the resulting +hash (cf. `swh identifier <#swh-identifier-returned>`__) - ``complete_date`` is +updated to the loading's finished time + +When the loading is failed, the deposit entry is updated: - ``status`` is +updated to ``failed`` - ``swh-id`` and ``complete_data`` remains as is + +*Note:* As a further improvement, we may prefer having a retry policy with +graceful delays for further scheduling. + +Metadata loading +~~~~~~~~~~~~~~~~ + +- the metadata received with the deposit should be kept in the + ``origin_metadata`` table before translation as part of the loading process + and an indexation process should be scheduled. + +- provider\_id and tool\_id are resolved by the prepare\_metadata method in the + loader-core + +- the origin\_metadata entry is sent to storage by the send\_origin\_metadata + in the loader-core + +origin\_metadata table: + +:: + + id bigint PK + origin bigint + discovery_date date + provider_id bigint FK // (from provider table) + tool_id bigint FK // indexer_configuration_id tool used for extraction + metadata jsonb // before translation diff --git a/docs/sys-info.md b/docs/sys-info.md deleted file mode 100644 index 25ab8dca..00000000 --- a/docs/sys-info.md +++ /dev/null @@ -1,47 +0,0 @@ -# Bootstrap swh-deposit on production - -As usual, the debian packaged is created and uploaded to the swh -debian repository. Once the package is installed, we need to do a few -things in regards to the database. - -## Prepare the database setup (existence, connection, etc...). - -This is defined through the packaged `swh.deposit.settings.production` -module and the expected **/etc/softwareheritage/deposit/private.yml**. - -As usual, the expected configuration files are deployed through our -puppet manifest (cf. puppet-environment/swh-site, -puppet-environment/swh-role, puppet-environment/swh-profile) - -## Migrate/bootstrap the db schema - -``` Shell -sudo django-admin migrate --settings=swh.deposit.settings.production -``` - -## Load minimum defaults data - -``` Shell -sudo django-admin loaddata --settings=swh.deposit.settings.production deposit_data -``` - -This adds the minimal: -- deposit request type 'archive' and 'metadata' -- 'hal' collection - -Note: swh.deposit.fixtures.deposit_data is packaged - -## Add client and collection - -``` Shell -python3 -m swh.deposit.create_user --platform production \ - --collection \ - --username \ - --password -``` - -This adds a user `` which can access the collection -``. The password will be used for the authentication -access to the deposit api. - -Note: This creation procedure needs to be improved. diff --git a/docs/sys-info.rst b/docs/sys-info.rst new file mode 100644 index 00000000..2e1e0ff6 --- /dev/null +++ b/docs/sys-info.rst @@ -0,0 +1,51 @@ +Bootstrap swh-deposit on production +=================================== + +As usual, the debian packaged is created and uploaded to the swh debian +repository. Once the package is installed, we need to do a few things in +regards to the database. + +Prepare the database setup (existence, connection, etc...). +----------------------------------------------------------- + +This is defined through the packaged ``swh.deposit.settings.production`` +module and the expected **/etc/softwareheritage/deposit/private.yml**. + +As usual, the expected configuration files are deployed through our +puppet manifest (cf. puppet-environment/swh-site, +puppet-environment/swh-role, puppet-environment/swh-profile) + +Migrate/bootstrap the db schema +------------------------------- + +.. code:: shell + + sudo django-admin migrate --settings=swh.deposit.settings.production + +Load minimum defaults data +-------------------------- + +.. code:: shell + + sudo django-admin loaddata --settings=swh.deposit.settings.production deposit_data + +This adds the minimal: - deposit request type 'archive' and 'metadata' - +'hal' collection + +Note: swh.deposit.fixtures.deposit\_data is packaged + +Add client and collection +------------------------- + +.. code:: shell + + python3 -m swh.deposit.create_user --platform production \ + --collection \ + --username \ + --password + +This adds a user ```` which can access the collection +````. The password will be used for the authentication +access to the deposit api. + +Note: This creation procedure needs to be improved.