diff --git a/README b/README --- a/README +++ b/README @@ -1,5 +1,615 @@ -swh-deposit -=========== +swh-deposit (draft) +=================== -SWH's SWORD Deposit Server +This is SWH's SWORD Server implementation. +SWORD (Simple Web-Service Offering Repository Deposit) is an +interoperability standard for digital file deposit. + +This protocol will be used to interact between a client (a repository) +and a server (swh repository) to permit the deposit of software +tarballs. + +In this document, we will refer to a client (e.g. HAL server) and a +server (SWH's). + +Table of contents +--------------------- +1. [use cases](#uc) +2. [api overview](#api) +3. [limitations](#limitations) +4. [scenarios](#scenarios) +5. [errors](#errors) +6. [tarball injection](#tarball) +7. [technical](#technical) +8. [sources](#sources) + +# Use cases + +## First deposit + +From client's deposit repository server to SWH's repository server +(aka deposit). + +-[\[1\]](#1) The client requests for the server's abilities. + (GET query to the *service document uri*) + +-[\[2\]](#2)The server answers the client with the service document + +-[\[3\]](#3) The client sends the deposit (an archive -> .zip, .tar.gz) +through the deposit *creation uri*. + (one or more POST requests since the archive and metadata can be sent + in multiple requests) + + +-[\[4\]](#4) The server notifies the client it acknowledged the +client's request. ('http 201 Created' with a deposit receipt id in +the Location header of the response) + + +## Updating an existing archive + +-[\[5\]](#5) Client updates existing archive through the deposit *update uri* + (one or more PUT requests, in effect chunking the artifact to deposit) + +## Deleting an existing archive + +-[\[6\]](#6) Document deletion will not be implemented, +cf. limitation paragraph for detail + +## Client asks for operation status and repository id + +-[\[7\]](#7) TODO: Detail this when clear + +# API overview + +API access is over HTTPS. + +service document accessible at: +https://archive.softwareheritage.org/api/1/servicedocument/ + +API endpoints: + + - without a specific collection, are rooted at + https://archive.softwareheritage.org/api/1/deposit/. + + - with a specific and unique collection dubbed 'software', are rooted at + https://archive.softwareheritage.org/api/1/software/. + + +TODO: Determine which one of those solutions according to sword possibilities +(cf. 'unclear points' chapter below) + +# Limitations + +Applying the SWORD protocol procedure will result with voluntary implementation +shortcomings during the first iteration: + +- upload limitation of 200Mib +- only tarballs (.zip, .tar.gz) will be accepted +- no removal (implementation-wise, this will possibly be a means + to hide the origin). +- no mediation (we do not know the other system's users) +- basic http authentication enforced at the application layer + on a per client basis (authentication: + http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#authenticationmediateddeposit) + +## unclear points + +- SWORD defines a 'collection' concpet. should we apply the 'collection' concept + even thought SWH is software archive having one 'software' collection? + - option A: + The collection refers to a group of documents to which the document sent + (aka deposit) is part of. In this process with HAL, HAL is the collection, + maybe tomorrow we will do the same with MIT and MIT could be + the collection (the logic of the answer above is a result of this + link: https://hal.inria.fr/USPC for the USPC collection) + + **result**: 1 client being equivalent as 1 collection in this case. + The is client pushes us software in 'their' one collection. + The collection name could show up in the uri endpoint. + + - option B: + Define none? (is it possible? i don't think it is due to the service + document part listing the collection to act upon...) + + **result**: the deposited software has no other entry point via + collection name + + +## Scenarios +### [1] Client request for Service Document + +This is the endpoint permitting the client to ask the server's abilities. + + +#### API endpoint + +GET api/1/servicedocument/ + +Answer: +- 200, Content-Type: application/atomserv+xml: OK, with the body + described below + +#### Sample request: + +``` shell +GET https://archive.softwareheritage.org/api/1/servicedocument HTTP/1.1 +Host: archive.softwareheritage.org +``` + +### [2] Sever respond for Service Document + +The server returns its abilities with the service document in xml format: +- protocol sword version v2 +- accepted mime types: application/zip, application/gzip +- upload max size accepted, beyond that, it's expected the client + chunk the tarball into multiple ones +- the collections the client can act upon (swh supports only one software collection) +- mediation not supported + +#### Sample answer: +``` xml + + + + 2.0 + ${max_upload_size} + + + The SWH archive + + + SWH Collection + application/gzip + application/gzip + Software Heritage Archive Deposit + false + http://purl.org/net/sword/package/SimpleZip + + + +``` + + +## Deposit Creation: client point of view + +Process of deposit creation: + +-> [3] client request + + - [3.1] server validation + - [3.2] server temporary upload + - [3.3] server injects deposit into archive* + +<- [4] server returns deposit receipt id + + +*[3.3] Asynchronously, the server will inject the archive uploaded and the + associated metadata. The operation status mentioned + earlier is a reference to that injection operation. + +The image bellow represent only the communication and creation of +a deposit: +{F2403754} + +### [3] client request + +The client can send a deposit through one request deposit or multiple requests deposit. + +The deposit can contain: +- an archive holding the software source code, +- an envelop with metadata describing information regarding a deposit, +- or both (Multipart deposit). + +the client can deposit a binary file, supplying the following headers: +- Content-Type (text): accepted mimetype +- Content-Length (int): tarball size +- Content-MD5 (text): md5 checksum hex encoded of the tarball +- Content-Disposition (text): attachment; filename=[filename] ; the filename + parameter must be text (ascii) +- Packaging (IRI): http://purl.org/net/sword/package/SimpleZip +- In-Progress (bool): true to specify it's not the last request, false + to specify it's a final request and the server can go on with + processing the request's information. + +if In-Progress is not present the server MUST assume that it is false + +#### API endpoint + +POST /api/1/deposit/ + +#### One request deposit + +The one request deposit is a single request containing both the metadata (body) +and the archive (attachment). + +A Multipart deposit is a request of an archive along with metadata about +that archive (can be applied in a one request deposit or multiple requests). + +Client provides: +- Content-Disposition (text): header of type 'attachment' on the Entry + Part with a name parameter set to 'atom' +- Content-Disposition (text): header of type 'attachment' on the Media + Part with a name parameter set to payload and a filename parameter + (the filename will be expressed in ASCII). +- Content-MD5 (text): md5 checksum hex encoded of the tarball +- Packaging (text): http://purl.org/net/sword/package/SimpleZip + (packaging format used on the Media Part) +- In-Progress (bool): true|false; true means partial upload and we can expect + other requests in the future, false means the deposit is done. +- add metadata formats or foreign markup to the atom:entry element + + +#### sample request for multipart deposit: + +``` xml +POST deposit HTTP/1.1 +Host: archive.softwareheritage.org +Content-Length: [content length] +Content-Type: multipart/related; + boundary="===============1605871705=="; + type="application/atom+xml" +In-Progress: false +MIME-Version: 1.0 + +Media Post +--===============1605871705== +Content-Type: application/atom+xml; charset="utf-8" +Content-Disposition: attachment; name="atom" +MIME-Version: 1.0 + + + + Title + hal-or-other-archive-id + 2005-10-07T17:17:08Z + Contributor + + + + +--===============1605871705== +Content-Type: application/zip +Content-Disposition: attachment; name=payload; filename=[filename] +Packaging: http://purl.org/net/sword/package/SimpleZip +Content-MD5: [md5-digest] +MIME-Version: 1.0 + +[...binary package data...] +--===============1605871705==-- +``` + +## Deposit Creation - server point of view + +The server receives the request and: + +### [3.1] Validation of the header and body request + + +### [3.2] Server uploads the content in a temporary location +(deposit table in a separated DB). +- saves the archives in a temporary location +- executes a md5 checksum on that archive and check it against the + same header information +- adds a deposit entry and retrieves the associated id + + +### [4] Servers answers the client +an 'http 201 Created' with a deposit receipt id in the Location header of +the response. + +##### The server possible answers are: +- OK: '201 created' + one header 'Location' holding the deposit receipt + id +- KO: with the error status code and associated message + (cf. [possible errors paragraph](#possible errors)). + + +### [5] Deposit Update + +The client previously uploaded an archive and wants to add either new +metadata information or a new version for that previous deposit +(possibly in multiple steps as well). The important thing to note +here is that for swh, this will result in a new version of the +previous deposit in any case. + +Providing the identifier of the previous version deposit received from +the status URI, the client executes a PUT request on the same URI as +the deposit one. + +After validation of the body request, the server: +- uploads such content in a temporary location (to be defined). + +- answers the client an 'http 204 (No content)'. In the Location + header of the response lies a deposit receipt id permitting the + client to check back the operation status later on. + +- Asynchronously, the server will inject the archive uploaded and the + associated metadata. The operation status mentioned earlier is a + reference to that injection operation. The fact that the version is + a new one is dealt with at the injection level. + +##### URL: PUT /1/deposit/ + +## [6] Deposit Removal + +[#limitation](As explained in the limitation paragraph), removal won't +be implemented. Nothing is removed from the SWH archive. + +The server answers a '405 Method not allowed' error. + +### [7] Operation Status + +Providing a deposit receipt id, the client asks the operation status +of a prior upload. + +URL: GET /1/collection/{deposit_receipt} + +or GET /1/deposit/{deposit_receipt} + +note: depends of the decision taken about collections + +## Possible errors + +### sword:ErrorContent + +IRI: http://purl.org/net/sword/error/ErrorContent + +The supplied format is not the same as that identified in the +Packaging header and/or that supported by the server Associated HTTP + +Status: 415 (Unsupported Media Type) or 406 (Not Acceptable) + +### sword:ErrorChecksumMismatch + +IRI: http://purl.org/net/sword/error/ErrorChecksumMismatch + +Checksum sent does not match the calculated checksum. The server MUST +also return a status code of 412 Precondition Failed + +### sword:ErrorBadRequest + +IRI: http://purl.org/net/sword/error/ErrorBadRequest + +Some parameters sent with the POST/PUT were not understood. The server +MUST also return a status code of 400 Bad Request. + +### sword:MediationNotAllowed + +IRI: http://purl.org/net/sword/error/MediationNotAllowed + +Used where a client has attempted a mediated deposit, but this is not +supported by the server. The server MUST also return a status code of +412 Precondition Failed. + +### sword:MethodNotAllowed + +IRI: http://purl.org/net/sword/error/MethodNotAllowed + +Used when the client has attempted one of the HTTP update verbs (POST, +PUT, DELETE) but the server has decided not to respond to such +requests on the specified resource at that time. The server MUST also +return a status code of 405 Method Not Allowed + +### sword:MaxUploadSizeExceeded + +IRI: http://purl.org/net/sword/error/MaxUploadSizeExceeded + +Used when the client has attempted to supply to the server a file +which exceeds the server's maximum upload size limit + +Associated HTTP Status: 413 (Request Entity Too Large) + +--------------- + +# Tarball Injection + +Providing we use indeed synthetic revision to represent a version of a +tarball injected through the sword use case, this needs to be improved +so that the synthetic revision is created with a parent revision (the +previous known one for the same 'origin'). + + +### Injection mapping +| origin | https://hal.inria.fr/hal-id | +|-------------------------------------|---------------------------------------| +| origin_visit | 1 :reception_date | +| occurrence & occurrence_history | branch: client's version n° (e.g hal) | +| revision | synthetic_revision (tarball) | +| directory | upper level of the uncompressed archive| + + +##### Questions raised concerning injection: +- A deposit has one origin, yet an origin can have multiple deposits ? + +No, an origin can have multiple requests for the same deposit, +which should end up in one single deposit (when the client pushes its final +request saying deposit 'done' through the header In-Progress). + +When an update of a deposit is requested, +the new version is identified with the external_id. + +Illustration First deposit injection: + +HAL's deposit 01535619 = SWH's deposit **01535619-1** + + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + 1 synthetic revision + + + 1 directory + +HAL's update on deposit 01535619 = SWH's deposit **01535619-2** + +(*with HAL updates can only be on the metadata and a new version is required +if the content changes) + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + new synthetic revision (with new metadata) + + + same directory + +HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** + + + same origin + + + new revision + + + new directory + + + +## Technical detail +We will need: +- one dedicated db to store state - swh-deposit + +- one dedicated temporary storage to store archives before injection + +- one client to test the communication with SWORD protocol + +### Deposit reception schema + +- **deposit** table: + - id (bigint): deposit receipt id + + - external id (text): client's internal identifier (e.g hal's id, etc...). + + - origin id : null before injection + - swh_id : swh identifier result once the injection is complete + + - reception_date: first deposit date + + - complete_date: reception date of the last deposit which makes the deposit + complete + + - status (enum): +``` + 'partial', -- the deposit is new or partially received since it + -- can be done in multiple requests + 'expired', -- deposit has been there too long and is now deemed + -- ready to be garbage collected + 'ready', -- deposit is fully received and ready for injection + 'scheduled', -- injection is scheduled on swh's side + 'success', -- injection successful + 'failure' -- injection failure +``` +- **deposit_request** table: + - id (bigint): identifier + - deposit_id: deposit concerned by the request + - metadata: metadata associated to the request + +- **client** table: + - id (bigint): identifier + - name (text): client's name (e.g HAL) + - credentials + + +All metadata (declared metadata) are stored in deposit_request (with the +request they were sent with). +When the deposit is complete metadata fields are aggregated and sent +to injection. During injection the metadata is kept in the +origin_metadata table (see [metadata injection](#metadata-injection)). + +The only update actions occurring on the deposit table are in regards of: + - status changing + - partial -> {expired/ready}, + - ready -> scheduled, + - scheduled -> {success/failure} + - complete_date when the deposit is finalized + (when the status is changed to ready) + - swh-id being populated once we have the result of the injection + +#### SWH Identifier returned? + + swh-- + + e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e + + We could have a specific dedicated 'client' table to reference client + identifier. + +### Scheduling injection +All data and metadata separated with multiple requests should be aggregated +before injection. + +TODO: injection modeling + +### Metadata injection +- the metadata received with the deposit should be kept in the origin_metadata +table before translation as part of the injection process and a indexation +process should be scheduled. + +origin_metadata table: +``` +origin bigint PK FK +date date PK FK +provenance_type text + // (enum: 'publisher', 'lister' needs to be completed) +raw_metadata jsonb + // before translation +indexer_configuration_id bigint FK + // tool used for translation +translated_metadata jsonb + // with codemeta schema and terms +``` + +# Nomenclature + +SWORD uses IRI. This means Internationalized Resource Identifier. In +this chapter, we will describe SWH's IRI. + +## SD-IRI - The Service Document IRI + +This is the IRI from which the root service document can be +located. + +## Col-IRI - The Collection IRI + +Only one collection of software is used in this repository. + +Note: +This is the IRI to which the initial deposit will take place, and +which are listed in the Service Document. +Discuss to check if we want to implement this or not. + +## Cont-IRI - The Content IRI + +This is the IRI from which the client will be able to retrieve +representations of the object as it resides in the SWORD server. + +## EM-IRI - The Atom Edit Media IRI + +To simplify, this is the same as the Cont-IRI. + +## Edit-IRI - The Atom Entry Edit IRI + +This is the IRI of the Atom Entry of the object, and therefore also of +the container within the SWORD server. + +## SE-IRI - The SWORD Edit IRI + +This is the IRI to which clients may POST additional content to an +Atom Entry Resource. This MAY be the same as the Edit-IRI, but is +defined separately as it supports HTTP POST explicitly while the +Edit-IRI is defined by [AtomPub] as limited to GET, PUT and DELETE +operations. + +## State-IRI - The SWORD Statement IRI + +This is the one of the IRIs which can be used to retrieve a +description of the object from the sword server, including the +structure of the object and its state. This will be used as the +operation status endpoint. + +# sources + +- [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) +- [arxiv documentation](https://arxiv.org/help/submit_sword) +- [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) +- [SWORD used on HAL](https://api.archives-ouvertes.fr/docs/sword) +- [xml examples for CCSD](https://github.com/CCSDForge/HAL/tree/master/Sword) diff --git a/doc/specs.md b/doc/specs.md deleted file mode 100644 --- a/doc/specs.md +++ /dev/null @@ -1,615 +0,0 @@ -swh-deposit (draft) -=================== - -This is SWH's SWORD Server implementation. - -SWORD (Simple Web-Service Offering Repository Deposit) is an -interoperability standard for digital file deposit. - -This protocol will be used to interact between a client (a repository) -and a server (swh repository) to permit the deposit of software -tarballs. - -In this document, we will refer to a client (e.g. HAL server) and a -server (SWH's). - -Table of contents ---------------------- -1. [use cases](#uc) -2. [api overview](#api) -3. [limitations](#limitations) -4. [scenarios](#scenarios) -5. [errors](#errors) -6. [tarball injection](#tarball) -7. [technical](#technical) -8. [sources](#sources) - -# Use cases - -## First deposit - -From client's deposit repository server to SWH's repository server (aka deposit). - --[\[1\]](#1) The client requests for the server's abilities. - (GET query to the *service document uri*) - -- [\[2\]](#2)The server answers the client with the service document - -- [\[3\]](#3) The client sends the deposit (an archive -> .zip, .tar.gz) through the deposit - *creation uri*. - (one or more POST requests since the archive and metadata can be sent in multiple times) - - -- [\[4\]](#4) The server notifies the client it acknowledged the client's request. - ('http 201 Created' with a deposit receipt id in the Location header of the response) - - -## Updating an existing archive - --[\[5\]](#5) Client updates existing archive through the deposit *update uri* - (one or more PUT requests, in effect chunking the artifact to deposit) - -## Deleting an existing archive - -- [\[6\]](#6) Document deletion will not be implemented, cf. limitation paragraph for - detail - -## Client asks for operation status and repository id - -I'm not sure yet as to how this goes in the sword protocol. -I speak of operation status but i've yet to find a reference to this in the sword spec. - -- [\[7\]](#7)TODO: Detail this when clear - -# API overview - -API access is over HTTPS. - -service document accessible at: https://archive.softwareheritage.org/api/1/servicedocument/ - -API endpoints: - - - without a specific collection, are rooted at https://archive.softwareheritage.org/api/1/deposit/. - - - with a specific and unique collection dubbed 'software', are rooted at https://archive.softwareheritage.org/api/1/software/. - - -TODO: Determine which one of those solutions according to sword possibilities (cf. 'unclear points' chapter below) - -# Limitations - -With this SWORD protocol procedure there will be some voluntary implementation shortcomings: - -- no removal -- no mediation (we do not know the other system's users) -- upload limitation of 200Mib -- only tarballs (.zip, .tar.gz) will be accepted -- no authentication enforced at the application layer -- basic authentication at the server layer - -## unclear points - -- SWORD defines a 'collection' notion. But, as SWH is a software archive, we have only one 'software' collection. - -I think the collection refers to a group of documents to which the document sent (aka deposit) is part of -in this process with HAL, HAL is the collection, maybe tomorrow we will do the same with MIT and MIT could be the collection -(the logic of the anwser above is a result of this link: https://hal.inria.fr/USPC the USPC collection) - -that makes sense. -Still, i don't think we want to do this. -Or, objectively, i don't see how to implement this correctly. - -Specifically, I think, the client can push directly the documents to us. -If for some reasons, we want to list the 'documents', we could distinguish then -(as this could help in reducing the length of documents per client, 1 client being equivalent as 1 collection in this case). - -What should we do with this? - - Define one? - - Define none? (is it possible? i don't think it is due to the service document part listing the collection to act upon...) - - -# Scenarios -## [1] Client request for Service Document - -This is the endpoint permitting the client to ask the server's abilities. - - -### API endpoint - -GET api/1/servicedocument/ - -Answer: -- 200, Content-Type: application/atomserv+xml: OK, with the body - described below - -### Sample request: - -``` shell -GET https://archive.softwareheritage.org/api/1/servicedocument HTTP/1.1 -Host: archive.softwareheritage.org -``` - -## [2] Sever respond for Service Document - -The server returns its abilities with the service document in xml format: -- protocol sword version v2 -- accepted mime types: application/zip, application/gzip -- upload max size accepted, beyond that, it's expected the client - chunk the tarball into multiple ones -- the collections the client can act upon (swh supports only one software collection) -- mediation not supported - -### Sample answer: -``` xml - - - - 2.0 - ${max_upload_size} - - - The SWH archive - - - SWH Collection - application/gzip - application/gzip - Software Heritage Archive Deposit - false - http://purl.org/net/sword/package/SimpleZip - - - -``` - - -## Deposit Creation: client point of view - -Process of deposit creation: - -> [3] client request -> - - ( [3.1] server validation -> [3.2] server temporary upload ) -> [3.3] server injects deposit into archive - - <- [4] server returns deposit receipt id - - -- [3.3] Asynchronously, the server will inject the archive uploaded and the - associated metadata. The operation status mentioned - earlier is a reference to that injection operation. - -## [[3] client request - -The client can send a deposit through one request deposit or multiple requests deposit. - -The deposit can contain: -- an archive holding the software source code, -- an envelop with metadata describing information regarding a deposit, -- or both (Multipart deposit). - -the client can deposit a binary file, supplying the following headers: -- Content-Type (text): accepted mimetype -- Content-Length (int): -- Content-MD5 (text): md5 checksum hex encoded of the tarball (we may need to check for the possibility to support a more secure hash) -- Content-Disposition (text): attachment; filename=[filename] ; the filename - parameter must be text (ascii) -- Packaging (IRI): http://purl.org/net/sword/package/SimpleZip -- In-Progress (bool): true to specify it's not the last request, false - to specify it's a final request and the server can go on with - processing the request's information - -TODO: required fields (MUST, SHOULD) - -I think the optional one is In-Progress, which if not there should be considered done (I'll check the spec for this). - -### API endpoint - -POST /api/1/deposit/ - -### One request deposit - -The one request deposit is a single request containing both the metadata (body) and the archive (attachment). - -A Multipart deposit is a request of an archive along with metadata about -that archive (can be applied in a one request deposit or multiple requests). - -Client provides: -- Content-Disposition (text): header of type 'attachment' on the Entry - Part with a name parameter set to 'atom' -- Content-Disposition (text): header of type 'attachment' on the Media - Part with a name parameter set to payload and a filename parameter - (the filename will be expressed in ASCII). -- Content-MD5 (text): md5 checksum hex encoded of the tarball -- Packaging (text): http://purl.org/net/sword/package/SimpleZip - (packaging format used on the Media Part) -- In-Progress (bool): true|false; true means partial upload and we can expect - other requests in the future, false means the deposit is done. -- add metadata formats or foreign markup to the atom:entry element - - -### sample request for multipart deposit: - -``` xml -POST deposit HTTP/1.1 -Host: archive.softwareheritage.org -Content-Length: [content length] -Content-Type: multipart/related; - boundary="===============1605871705=="; - type="application/atom+xml" -In-Progress: false -MIME-Version: 1.0 - -Media Post ---===============1605871705== -Content-Type: application/atom+xml; charset="utf-8" -Content-Disposition: attachment; name="atom" -MIME-Version: 1.0 - - - - Title - hal-or-other-archive-id - 2005-10-07T17:17:08Z - Contributor - - - - ---===============1605871705== -Content-Type: application/zip -Content-Disposition: attachment; name=payload; filename=[filename] -Packaging: http://purl.org/net/sword/package/SimpleZip -Content-MD5: [md5-digest] -MIME-Version: 1.0 - -[...binary package data...] ---===============1605871705==-- -``` - -## Deposit Creation - server point of view - -The server receives the request and: - -### [3.1] Validation of the header and body request - - -### [3.2] Server uploads such content in a temporary location (deposit table in a separated DB). -- saves the archives in a temporary location -- executes a md5 checksum on that archive and check it against the - same header information -- adds a deposit entry and retrieves the associated id - - -## [[4] Servers answers the client an 'http 201 Created' with a deposit receipt id in the Location header of - the response. - -The server possible answers are: -- OK: '201 created' + one header 'Location' holding the deposit receipt - id -- KO: with the error status code and associated message - (cf. [possible errors paragraph](#possible errors)). - - -## [5] Deposit Update - -The client previously uploaded an archive and wants to add either new -metadata information or a new version for that previous deposit -(possibly in multiple steps as well). The important thing to note -here is that for swh, this will result in a new version of the -previous deposit in any case. - -Providing the identifier of the previous version deposit received from -the status URI, the client executes a PUT request on the same URI as -the deposit one. - -After validation of the body request, the server: -- uploads such content in a temporary location (to be defined). - -- answers the client an 'http 204 (No content)'. In the Location - header of the response lies a deposit receipt id permitting the - client to check back the operation status later on. - -- Asynchronously, the server will inject the archive uploaded and the - associated metadata. The operation status mentioned earlier is a - reference to that injection operation. The fact that the version is - a new one is dealt with at the injection level. - -URL: PUT /1/deposit/ - -## [6] Deposit Removal - -[#limitation](As explained in the limitation paragraph), removal won't -be implemented. Nothing is removed from the SWH archive. - -The server answers a '405 Method not allowed' error. - - -## [7] Operation Status - -Providing a deposit receipt id, the client asks the operation status -of a prior upload. - -URL: GET /1/software/{deposit_receipt} - -# Possible errors - -## sword:ErrorContent - -IRI: http://purl.org/net/sword/error/ErrorContent - -The supplied format is not the same as that identified in the -Packaging header and/or that supported by the server Associated HTTP - -Status: 415 (Unsupported Media Type) or 406 (Not Acceptable) - -## sword:ErrorChecksumMismatch - -IRI: http://purl.org/net/sword/error/ErrorChecksumMismatch - -Checksum sent does not match the calculated checksum. The server MUST -also return a status code of 412 Precondition Failed - -## sword:ErrorBadRequest - -IRI: http://purl.org/net/sword/error/ErrorBadRequest - -Some parameters sent with the POST/PUT were not understood. The server -MUST also return a status code of 400 Bad Request. - -## sword:MediationNotAllowed - -IRI: http://purl.org/net/sword/error/MediationNotAllowed - -Used where a client has attempted a mediated deposit, but this is not -supported by the server. The server MUST also return a status code of -412 Precondition Failed. - -## sword:MethodNotAllowed - -IRI: http://purl.org/net/sword/error/MethodNotAllowed - -Used when the client has attempted one of the HTTP update verbs (POST, -PUT, DELETE) but the server has decided not to respond to such -requests on the specified resource at that time. The server MUST also -return a status code of 405 Method Not Allowed - -## sword:MaxUploadSizeExceeded - -IRI: http://purl.org/net/sword/error/MaxUploadSizeExceeded - -Used when the client has attempted to supply to the server a file -which exceeds the server's maximum upload size limit - -Associated HTTP Status: 413 (Request Entity Too Large) - -# Tarball Injection - -Providing we use indeed synthetic revision to represent a version of a -tarball injected through the sword use case, this needs to be improved -so that the synthetic revision is created with a parent revision (the -previous known one for the same 'origin'). - - -Note: -- origin may no longer be the right term (we may need a new 'at the - same level' notion, maybe 'deposit'?) - - * deposit is used for the information - - we agreed that for now origin seems fine enough - - -- As there are no authentication, everyone can push a new version for - the same origin so we might need to use the synthetic revision's - author (or committer?) date to discriminate which is the last known - version for the same 'origin'. - Note: - We'll do something simple, the last version is the last one injected. - The order should be enforced by the scheduling part of the injection, respecting the reception date. - We may need another date, the one when the deposit is considered complete and use that date. - - -## Injection path - - origin --> origin_visit --> occurrence & occurrence_history --> revision --> directory (upper level of the uncompressed archive) - ok for me - https://hal.inria.fr/hal-01327170 --> 1 :reception_date --> branch: client's version n° (e.g hal) --> synthetic_revision (tarball) - - -Questions: - - can an update be on a version without having a new version? - No, if something is pushed for the same origin via PUT (update), it will result in a new version (well when the deposit will be complete, injection triggered and done that is) - - For example, depositing only new metadata for the same hal deposit version without providing a new archive can result in a new version targetting the same previous archive. - And in that case, we won't need the archive again since the targetted directory tree won't have changed, we can simply reuse it. - That is, we'll create a new synthetic revision targetting the same tree whose parent revision is the last know revision for that origin. - Is it clear? :D - so we keep raw metadata in the synthetic revision, yes (we need those to have different hash on revision, the revision metadata column is used to compute its hash). - - That makes me think that for the creation (POST). - Once the client has said, deposit done for an origin. - Any further request for that origin should be refused (since they should pass by the PUT endpoint as update). - - Shortcoming: - what about concurrent deposit for the same origin? - How do we distinguish them? - - A: The client should identify each package sent if it belongs to a chuncked deposit or a new request for same deposit - - On SWH, we should treat each request separately as a new deposit ??? i think yes (I'm answering myself) because the date of reception should be new - - and the depposit receipt id should be new as well - - -Actions possible on HAL after deposit is public: - - modify metadata - - add file - - deposit new version - - link ressource - - share property - - use as model - - - - A deposit has one origin, yet an origin can have multiple deposits ? - No, not multiple deposits, multiple requests for the same origin, but in the end, this should end up in one single deposit - (when the client pushes its final request saying deposit 'done' through the header In-Progress). - When I say multiple deposits, I mean multiple versions/ updates on a deposit identified with external_id ok - you are talking about multiple requests in the sense of chuncked deposits yes - - - HAL's deposit 01535619 = SWH's deposit 01535619-1 - - + 1 origin with url:https://hal.inria.fr/medihal-01535619 - - + 1 revision - - + 1 directory - - deposit 01535619-v2 = SWH's deposit 01535619-2 - - + same origin - - + new revision - - + new directory - - - -## Technical - -We will need: -- one dedicated db to store state - swh-deposit - -- one dedicated temporary storage to store archives before injection - -- 'deposit' table: - - id (bigint): deposit receipt id - - external id (text): client's internal identifier (e.g hal's id, etc...). - - origin id : null before injection - - revision id : null before full injection I don't think we should store this as this will move at each new version... - - reception_date: first deposit date - - complete_date: reception date of the last deposit which makes the deposit complete - - metadata: jsonb (raw format before translation) - - status (enum): - -'partially-received', -- when only a part of the deposit was received (through multiple requests) - - -'received', -- deposit is fully received (last request arrived) - - -'injecting', -- injection is ongoing on swh's side - - -'injected', -- injection is successfully done - - - 'failed' -- injection failed due to some error - -- the metadata received with the deposit should be kept in the origin_metadata table - - after translation as part of the injection process - - - what's the origin_metadata table? - - This is the new table we talked with Zack about - - yes, but i wanted some more details - - it's in swh db? - - nothing about metadata is implemented yet - - but it should be in the main db - - right - - still, the nice thing about what we are doing can be untangled yes it's nice - - That is we could run in production the simple deposit stuff (which does not do anything about the deposit injection yet) - - we accept query and store deposits (since we need the scheduling one-shot task as well... which can be worrisome about the delay) - - - - i remember zack and you spoke about it during the 'tech meeting' but i did not follow everything at that time. - - origin bigint PK FK - - visit bigint PK FK // ? - - date date - - provenance_type text // (enum: 'publisher', 'external_catalog' needs to be completed) - - location url // only needed if there are use cases where this differs from origin for external_catalogs - - raw_metadata jsonb // before translation - - indexer_configuration_id bigint FK // tool used for translation - - translated_metadata jsonb // with codemeta schema and terms - - -# SWH Identifier returned? - - swh-- - - e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e - - We could have a specific dedicated client table. - -# Nomenclature - -SWORD uses IRI. This means Internationalized Resource Identifier. In -this chapter, we will describe SWH's IRI. - -## SD-IRI - The Service Document IRI - -This is the IRI from which the root service document can be -located. - -## Col-IRI - The Collection IRI - -Only one collection of software is used in this repository. - -Note: -This is the IRI to which the initial deposit will take place, and -which are listed in the Service Document. -Discuss to check if we want to implement this or not. - -## Cont-IRI - The Content IRI - -This is the IRI from which the client will be able to retrieve -representations of the object as it resides in the SWORD server. - -## EM-IRI - The Atom Edit Media IRI - -To simplify, this is the same as the Cont-IRI. - -## Edit-IRI - The Atom Entry Edit IRI - -This is the IRI of the Atom Entry of the object, and therefore also of -the container within the SWORD server. - -## SE-IRI - The SWORD Edit IRI - -This is the IRI to which clients may POST additional content to an -Atom Entry Resource. This MAY be the same as the Edit-IRI, but is -defined separately as it supports HTTP POST explicitly while the -Edit-IRI is defined by [AtomPub] as limited to GET, PUT and DELETE -operations. - -## State-IRI - The SWORD Statement IRI - -This is the one of the IRIs which can be used to retrieve a -description of the object from the sword server, including the -structure of the object and its state. This will be used as the -operation status endpoint. - -# sources - -- [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) -- [arxiv documentation](https://arxiv.org/help/submit_sword) -- [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) -- [SWORD used on HAL]https://api.archives-ouvertes.fr/docs/sword -- [xml examples for CCSD] https://github.com/CCSDForge/HAL/tree/master/Sword