diff --git a/debian/control b/debian/control index f12acb00..18203639 100644 --- a/debian/control +++ b/debian/control @@ -1,41 +1,42 @@ Source: swh-deposit Maintainer: Software Heritage developers Section: python Priority: optional Build-Depends: debhelper (>= 9), dh-python (>= 2), python3-setuptools, python3-all, python3-nose, python3-django-nose, python3-vcversioner, python3-swh.core (>= 0.0.14~), python3-swh.loader.core (>= 0.0.25~), python3-swh.loader.tar (>= 0.0.29~), python3-swh.scheduler (>= 0.0.19~), python3-django, python3-click, python3-vcversioner, python3-djangorestframework, python3-djangorestframework-xml, python3-requests Standards-Version: 3.9.6 Homepage: https://forge.softwareheritage.org/source/swh-deposit/ Package: python3-swh.deposit Architecture: all Depends: python3-swh.core (>= 0.0.14~), python3-swh.loader.tar (>= 0.0.29~), python3-swh.scheduler (>= 0.0.19~), ${misc:Depends}, ${python3:Depends} Description: Software Heritage Deposit Server -Package: python3-swh.deposit.injection +Package: python3-swh.deposit.loader +Conflict: python3-swh.deposit.injection Architecture: all Depends: python3-swh.core (>= 0.0.14~), python3-swh.loader.core (>= 0.0.25~), python3-swh.loader.tar (>= 0.0.29~), python3-swh.scheduler (>= 0.0.19~), python3-requests, ${misc:Depends}, ${python3:Depends} -Description: Software Heritage Deposit Injection +Description: Software Heritage Deposit Loader diff --git a/debian/rules b/debian/rules index d491e874..a4600d2a 100755 --- a/debian/rules +++ b/debian/rules @@ -1,19 +1,19 @@ #!/usr/bin/make -f export PYBUILD_NAME=swh.deposit export PYBUILD_TEST_ARGS=--with-doctest -sv -a !db,!fs %: dh $@ --with python3 --buildsystem=pybuild override_dh_install: dh_install rm -v $(CURDIR)/debian/python3-*/usr/lib/python*/dist-packages/swh/__init__.py for pyvers in $(shell py3versions -vr); do \ - mkdir -p $(CURDIR)/debian/python3-swh.deposit.injection/usr/lib/python$$pyvers/dist-packages/swh/deposit/injection ; \ - mv $(CURDIR)/debian/python3-swh.deposit/usr/lib/python$$pyvers/dist-packages/swh/deposit/injection/* \ - $(CURDIR)/debian/python3-swh.deposit.injection/usr/lib/python$$pyvers/dist-packages/swh/deposit/injection/ ; \ + mkdir -p $(CURDIR)/debian/python3-swh.deposit.loader/usr/lib/python$$pyvers/dist-packages/swh/deposit/loader ; \ + mv $(CURDIR)/debian/python3-swh.deposit/usr/lib/python$$pyvers/dist-packages/swh/deposit/loader/* \ + $(CURDIR)/debian/python3-swh.deposit.loader/usr/lib/python$$pyvers/dist-packages/swh/deposit/loader/ ; \ done override_dh_auto_test: diff --git a/docs/getting-started.md b/docs/getting-started.md index 8d5e8e73..b010ba3d 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,332 +1,332 @@ # Getting Started This is a getting started to demonstrate the deposit api use case with a shell client. The api is rooted at https://deposit.softwareheritage.org. For more details, see the [main README](./README.md). ## Requirements You need to be referenced on SWH's client list to have: - a credential (needed for the basic authentication step). - an associated collection [Contact us for more information.](https://www.softwareheritage.org/contact/) ## Demonstration For the rest of the document, we will: - reference `` as the client and `` as its associated authentication password. - use curl as example on how to request the api. - present the main deposit use cases. The use cases are: - one single deposit step: The user posts in one query (one deposit) a software source code archive and associated metadata (deposit is finalized with status `ready`). This will demonstrate the multipart query. - another 3-steps deposit (which can be extended as more than 2 steps): 1. Create an incomplete deposit (status `partial`) 2. Update a deposit (and finalize it, so the status becomes `ready`) 3. Check the deposit's state This will demonstrate the stateful nature of the sword protocol. Those use cases share a common part, they must start by requesting the `service document iri` (internationalized resource identifier) for information about the collection's location. ### Common part - Start with the service document First, to determine the *collection iri* onto which deposit data, the client needs to ask the server where is its *collection* located. That is the role of the *service document iri*. For example: ``` Shell curl -i --user : https://deposit.softwareheritage.org/1/servicedocument/ ``` If everything went well, you should have received a response similar to this: ``` Shell HTTP/1.0 200 OK Server: WSGIServer/0.2 CPython/3.5.3 Content-Type: application/xml 2.0 209715200 The Software Heritage (SWH) Archive Software Collection application/zip Collection Policy Software Heritage Archive Collect, Preserve, Share false http://purl.org/net/sword/package/SimpleZip https://deposit.softwareheritage.org/1// ``` Explaining the response: - `HTTP/1.0 200 OK`: the query is successful and returns a body response - `Content-Type: application/xml`: The body response is in xml format - `body response`: it is a service document describing that the client `` has a collection named ``. That collection is available at the *collection iri* `/1//` (through POST query). At this level, if something went wrong, this should be authentication related. So the response would have been a 401 Unauthorized access. Something like: ``` Shell curl -i https://deposit.softwareheritage.org/1// HTTP/1.0 401 Unauthorized Server: WSGIServer/0.2 CPython/3.5.3 Content-Type: application/xml WWW-Authenticate: Basic realm="" X-Frame-Options: SAMEORIGIN Access to this api needs authentication processing failed ``` ### Single deposit A single deposit translates to a multipart deposit request. This means, in swh's deposit's terms, sending exactly one POST query with: - 1 archive (`content-type application/zip`) - 1 atom xml content (`content-type: application/atom+xml;type=entry`) The supported archive, for now are limited to zip files. Those archives are expected to contain some form of software source code. The atom entry content is some xml defining metadata about that software. Example of minimal atom entry file: ``` XML Title urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a 2005-10-07T17:17:08Z Contributor The abstract The abstract Access Rights Alternative Title Date Available Bibliographic Citation Contributor Description Has Part Has Version Identifier Is Part Of Publisher References Rights Holder Source Title Type ``` Once the files are ready for deposit, we want to do the actual deposit in one shot. For this, we need to provide: - the contents and their associated correct content-types - either the header `In-Progress` to false (meaning, it's finished after this query) or nothing (the server will assume it's not in progress if not present). - Optionally, the `Slug` header, which is a reference to a unique identifier the client knows about and wants to provide us. You can do this with the following command: ``` Shell curl -i --user : \ -F "file=@deposit.zip;type=application/zip;filename=payload" \ -F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ -H 'In-Progress: false' \ -H 'Slug: some-external-id' \ -XPOST https://deposit.softwareheritage.org/1// ``` You just posted a deposit to the collection https://deposit.softwareheritage.org/1//. If everything went well, you should have received a response similar to this: ``` Shell HTTP/1.0 201 Created Server: WSGIServer/0.2 CPython/3.5.3 Location: /1//10/metadata/ Content-Type: application/xml 9 Sept. 26, 2017, 10:11 a.m. payload ready http://purl.org/net/sword/package/SimpleZip ``` Explaining this response: - `HTTP/1.0 201 Created`: the deposit is successful - `Location: /1//10/metadata/`: the EDIT-SE-IRI through which we can update a deposit - body response: it is a deposit receipt detailing all endpoints available to manipulate the deposit (update, replace, delete, etc...) It also explains the deposit identifier to be 9 (which is useful for the remaining example). Note: As the deposit is in `ready` status (meaning ready to be injected), you cannot actually update anything after this query. Well, the client can try, but it will be answered with a 403 forbidden answer. ### Multi-steps deposit 1. Create a deposit We will use the collection IRI again as the starting point. We need to explicitely give to the server information about: - the deposit's completeness (through header `In-Progress` to true, as we want to do in multiple steps now). - archive's md5 hash (through header `Content-MD5`) - upload's type (through the headers `Content-Disposition` and `Content-Type`) The following command: ``` Shell curl -i --user : \ --data-binary @swh/deposit.zip \ -H 'In-Progress: true' \ -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ -H 'Content-Disposition: attachment; filename=[deposit.zip]' \ -H 'Slug: some-external-id' \ -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ -H 'Content-type: application/zip' \ -XPOST https://deposit.softwareheritage.org/1// ``` The expected answer is the same as the previous sample. 2. Update deposit's metadata To update a deposit, we can either add some more archives, some more metadata or replace existing ones. As we don't have defined metadata yet (except for the `slug` header), we can add some to the `EDIT-SE-IRI` endpoint (/1//10/metadata/). That information is extracted from the deposit receipt sample. Using here the same atom-entry.xml file presented in previous chapter. For example, here is the command to update deposit metadata: ``` Shell curl -i --user : --data-binary @atom-entry.xml \ -H 'In-Progress: true' \ -H 'Slug: some-external-id' \ -H 'Content-Type: application/atom+xml;type=entry' \ -XPOST https://deposit.softwareheritage.org/1//10/metadata/ HTTP/1.0 201 Created Server: WSGIServer/0.2 CPython/3.5.3 Location: /1//10/metadata/ Content-Type: application/xml 10 Sept. 26, 2017, 10:32 a.m. None partial http://purl.org/net/sword/package/SimpleZip ``` 3. Check the deposit's state You need to check the STATE-IRI endpoint (/1//10/status/). ``` Shell curl -i --user : https://deposit.softwareheritage.org/1//10/status/ HTTP/1.0 200 OK Date: Wed, 27 Sep 2017 08:25:53 GMT Content-Type: application/xml ``` Response: ``` XML 9 ready - deposit is fully received and ready for injection + deposit is fully received and ready for loading ``` diff --git a/docs/index.rst b/docs/index.rst index 5e7338f2..9ec3e948 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,22 +1,22 @@ .. _swh-deposit: Software Heritage Deposit ========================= .. toctree:: :maxdepth: 3 :caption: Contents: getting-started.md spec-api.md metadata.md - spec-injection.md + spec-loading.md dev-info.md sys-info.md Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/docs/spec-api.md b/docs/spec-api.md index 6899f683..5f9f85b2 100644 --- a/docs/spec-api.md +++ b/docs/spec-api.md @@ -1,790 +1,790 @@ # API Specification This is [Software Heritage](https://www.softwareheritage.org)'s [SWORD 2.0](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) Server implementation. **S.W.O.R.D** (**S**imple **W**eb-Service **O**ffering **R**epository **D**eposit) is an interoperability standard for digital file deposit. This implementation will permit interaction between a client (a repository) and a server (SWH repository) to permit deposits of software source code archives and associated metadata. *Note:* In the following document, we will use the `archive` or `software source code archive` interchangeably. ## Collection SWORD defines a `collection` concept. In SWH's case, this collection refers to a group of deposits. A `deposit` is some form of software source code archive(s) associated with metadata. *Note:* It may be multiple archives if one archive is too big and must be splitted into multiple smaller ones. ### Example As part of the [HAL](https://hal.archives-ouvertes.fr/)-[SWH](https://www.softwareheritage.org) collaboration, we define a `HAL collection` to which the `hal` client will have access to. ## Limitations We will not have a fully compliant SWORD 2.0 protocol at first, so voluntary implementation shortcomings can exist, for example, only zip tarballs will be accepted. Other more permanent limitations exists: - upload limitation of 100Mib - no mediation ## Endpoints Here are the defined endpoints this document will refer to from this point on: - `/1/servicedocument/` *service document iri* (a.k.a [SD-IRI](#sd-iri-the-service-document-iri)) *Goal:* For a client to discover its collection's location - `/1//` *collection iri* (a.k.a [COL-IRI](#col-iri-the-collection-iri)) *Goal:*: create deposit to a collection - `/1///media/` *update iri* (a.k.a [EM-IRI](#em-iri-the-atom-edit-media-iri)) *Goal:*: Add or replace archive(s) to a deposit - `/1///metadata/` *update iri* (a.k.a [EDIT-IRI](#edit-iri-the-atom-entry-edit-iri) merged with [SE-IRI](#se-iri-the-sword-edit-iri)) *Goal:*: Add or replace metadata (and optionally archive(s) to a deposit - `/1///status/` *state iri* (a.k.a [STATE-IRI](#state-iri-the-sword-statement-iri)) - *Goal:*: Display deposit's status in regards to injection + *Goal:*: Display deposit's status in regards to loading - `/1///content/` *content iri* (a.k.a [CONT-FILE-IRI](#cont-iri-the-content-iri)) *Goal:*: Display information on the content's representation in the sword server ## Use cases ### Deposit creation From client's deposit repository server to SWH's repository server: [1.] The client requests for the server's abilities and its associated collection (GET query to the *SD/service document uri*) [2.] The server answers the client with the service document which gives the *collection uri* (also known as *COL/collection IRI*). [3.] The client sends a deposit (optionally a zip archive, some metadata or both) through the *collection uri*. This can be done in: - one POST request (metadata + archive). - one POST request (metadata or archive) + other PUT or POST request to the *update uris* (*edit-media iri* or *edit iri*) [3.1.] Server validates the client's input or returns detailed error if any [3.2.] Server stores information received (metadata or software archive source code or both) [4.] The server notifies the client it acknowledged the client's request. An `http 201 Created` response with a deposit receipt in the body response is sent back. That deposit receipt will hold the necessary information to eventually complete the deposit later on if it was incomplete (also known as status `partial`). #### Schema representation ![](/images/deposit-create-chart.png) ### Updating an existing deposit [5.] Client updates existing deposit through the *update uris* (one or more POST or PUT requests to either the *edit-media iri* or *edit iri*). [5.1.] Server validates the client's input or returns detailed error if any [5.2.] Server stores information received (metadata or software archive source code or both) This would be the case for example if the client initially posted a `partial` deposit (e.g. only metadata with no archive, or an archive without metadata, or a splitted archive because the initial one exceeded the limit size imposed by swh repository deposit) #### Schema representation ![](/images/deposit-update-chart.png) ### Deleting deposit (or associated archive, or associated metadata) [6.] Deposit deletion is possible as long as the deposit is still in `partial` state. [6.1.] Server validates the client's input or returns detailed error if any [6.2.] Server actually delete information according to request #### Schema representation ![](/images/deposit-delete-chart.png) ### Client asks for operation status [7.] Operation status can be read through a GET query to the *state iri*. -### Server: Triggering injection +### Server: Triggering loading Once the status `ready` is reached for a deposit, the server will inject the archive(s) sent and the associated metadata. -This is described in the [injection document](./spec-injection.html). +This is described in the [loading document](./spec-loading.html). ## API overview API access is over HTTPS. The API is protected through basic authentication. The API endpoints are rooted at [https://deposit.softwareheritage.org/1/](https://deposit.softwareheritage.org/1/). Data is sent and received as XML (as specified in the SWORD 2.0 specification). In the following chapters, we will described the different endpoints [through the use cases described previously.](#use-cases) ### [2] Service document Endpoint: GET /1/servicedocument/ This is the starting endpoint for the client to discover its initial collection. The answer to this query will describes: - the server's abilities - connected client's collection information Also known as: [SD-IRI - The Service Document IRI](#sd-iri-the-service-document-iri). #### Sample request ``` Shell GET https://deposit.softwareheritage.org/1/servicedocument/ HTTP/1.1 Host: deposit.softwareheritage.org ``` The server returns its abilities with the service document in xml format: - protocol sword version v2 - accepted mime types: application/zip - upload max size accepted. Beyond that point, it's expected the client splits its tarball into multiple ones - the collection the client can act upon (swh supports only one software collection per client) - mediation is not supported - etc... The current answer for example for the [hal archive](https://hal.archives-ouvertes.fr/) is: ``` XML 2.0 20971520 The Software Heritage (SWH) archive SWH Software Archive application/zip Collection Policy Software Heritage Archive false false Collect, Preserve, Share http://purl.org/net/sword/package/SimpleZip https://deposit.softwareheritage.org/1/hal/ ``` ### [3|5] Deposit creation/update The client can send deposit creation/update through a series of deposit requests to the following endpoints: - *collection iri* (COL-IRI) to initialize a deposit - *update iris* (EM-IRI, EDIT-SE-IRI) to complete/finalize a deposit The deposit creation/update can also happens in one request. The deposit request can contain: - an archive holding the software source code (binary upload) - an envelop with metadata describing information regarding a deposit (atom entry deposit) - or both (multipart deposit, exactly one archive and one envelop). #### Request Types ##### Binary deposit The client can deposit a binary archive, supplying the following headers: - Content-Type (text): accepted mimetype - Content-Length (int): tarball size - Content-MD5 (text): md5 checksum hex encoded of the tarball - Content-Disposition (text): attachment; filename=[filename] ; the filename parameter must be text (ascii) - Packaging (IRI): http://purl.org/net/sword/package/SimpleZip - In-Progress (bool): true to specify it's not the last request, false to specify it's a final request and the server can go on with processing the request's information (if not provided, this is considered false, so final). This is a single zip archive deposit. Almost no metadata is associated with the archive except for the unique external identifier. *Note:* This kind of deposit should be `partial` (In-Progress: True) as almost no metadata can be associated with the uploaded archive. ##### API endpoints concerned POST /1// Create a first deposit with one archive PUT /1///media/ Replace existing archives POST /1///media/ Add new archive ##### Sample request ``` Shell curl -i -u hal: \ --data-binary @swh/deposit.zip \ -H 'In-Progress: false' -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ -H 'Content-Disposition: attachment; filename=[deposit.zip]' \ -H 'Slug: some-external-id' \ -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ -H 'Content-type: application/zip' \ -XPOST https://deposit.softwareheritage.org/1/hal/ ``` #### Atom entry deposit The client can deposit an xml body holding metadata information on the deposit. *Note:* This kind of deposit is mostly expected to be `partial` (In-Progress: True) since no archive will be associated to those metadata. ##### API endpoints concerned POST /1// Create a first atom deposit entry PUT /1///metadata/ Replace existing metadata POST /1///metadata/ Add new metadata to deposit ##### Sample request Sample query: ``` Shell curl -i -u hal: --data-binary @atom-entry.xml \ -H 'In-Progress: false' \ -H 'Slug: some-external-id' \ -H 'Content-Type: application/atom+xml;type=entry' \ -XPOST https://deposit.softwareheritage.org/1/hal/ HTTP/1.0 201 Created Date: Tue, 26 Sep 2017 10:32:35 GMT Server: WSGIServer/0.2 CPython/3.5.3 Vary: Accept, Cookie Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS Location: /1/hal/10/metadata/ X-Frame-Options: SAMEORIGIN Content-Type: application/xml 10 Sept. 26, 2017, 10:32 a.m. None ready http://purl.org/net/sword/package/SimpleZip ``` Sample body: ``` XML Title urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a 2005-10-07T17:17:08Z Contributor The abstract The abstract Access Rights Alternative Title Date Available Bibliographic Citation # noqa Contributor Description Has Part Has Version Identifier Is Part Of Publisher References Rights Holder Source Title Type ``` #### One request deposit / Multipart deposit The one request deposit is a single request containing both the metadata (as atom entry attachment) and the archive (as payload attachment). Thus, it is a multipart deposit. Client provides: - Content-Disposition (text): header of type 'attachment' on the Entry Part with a name parameter set to 'atom' - Content-Disposition (text): header of type 'attachment' on the Media Part with a name parameter set to payload and a filename parameter (the filename will be expressed in ASCII). - Content-MD5 (text): md5 checksum hex encoded of the tarball - Packaging (text): http://purl.org/net/sword/package/SimpleZip (packaging format used on the Media Part) - In-Progress (bool): true|false; true means `partial` upload and we can expect other requests in the future, false means the deposit is done. - add metadata formats or foreign markup to the atom:entry element ##### API endpoints concerned POST /1// Create a full deposit (metadata + archive) PUT /1///metadata/ Replace existing metadata and archive POST /1///metadata/ Add new metadata and archive to deposit ##### Sample request Sample query: ``` Shell curl -i -u hal: \ -F "file=@../deposit.json;type=application/zip;filename=payload" \ -F "atom=@../atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ -H 'In-Progress: false' \ -H 'Slug: some-external-id' \ -XPOST https://deposit.softwareheritage.org/1/hal/ HTTP/1.0 201 Created Date: Tue, 26 Sep 2017 10:11:55 GMT Server: WSGIServer/0.2 CPython/3.5.3 Vary: Accept, Cookie Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS Location: /1/hal/9/metadata/ X-Frame-Options: SAMEORIGIN Content-Type: application/xml 9 Sept. 26, 2017, 10:11 a.m. payload ready http://purl.org/net/sword/package/SimpleZip ``` Sample content: ``` XML POST deposit HTTP/1.1 Host: deposit.softwareheritage.org Content-Length: [content length] Content-Type: multipart/related; boundary="===============1605871705=="; type="application/atom+xml" In-Progress: false MIME-Version: 1.0 Media Post --===============1605871705== Content-Type: application/atom+xml; charset="utf-8" Content-Disposition: attachment; name="atom" MIME-Version: 1.0 Title hal-or-other-archive-id 2005-10-07T17:17:08Z Contributor The abstract Access Rights Alternative Title Date Available Bibliographic Citation # noqa Contributor Description Has Part Has Version Identifier Is Part Of Publisher References Rights Holder Source Title Type --===============1605871705== Content-Type: application/zip Content-Disposition: attachment; name=payload; filename=[filename] Packaging: http://purl.org/net/sword/package/SimpleZip Content-MD5: [md5-digest] MIME-Version: 1.0 [...binary package data...] --===============1605871705==-- ``` ## Deposit Creation - server point of view The server receives the request(s) and does minimal checking on the input prior to any saving operations. ### [3|5|6.1] Validation of the header and body request Any kind of errors can happen, here is the list depending on the situation: - common errors: - 401 (unauthenticated) if a client does not provide credential or provide wrong ones - 403 (forbidden) if a client tries access to a collection it does not own - 404 (not found) if a client tries access to an unknown collection - 404 (not found) if a client tries access to an unknown deposit - 415 (unsupported media type) if a wrong media type is provided to the endpoint - archive/binary deposit: - 403 (forbidden) if the length of the archive exceeds the max size configured - 412 (precondition failed) if the length or hash provided mismatch the reality of the archive. - 415 (unsupported media type) if a wrong media type is provided - multipart deposit: - 412 (precondition failed) if the md5 hash provided mismatch the reality of the archive - 415 (unsupported media type) if a wrong media type is provided - Atom entry deposit: - 400 (bad request) if the request's body is empty (for creation only) ### [3|5|6.2] Server uploads the content in a temporary location Using an objstorage, the server stores the archive in a temporary location. It's deemed temporary the time the deposit is completed -(status becomes `ready`) and the injection finishes. +(status becomes `ready`) and the loading finishes. The server also persists requests' information in a database. ### [4] Servers answers the client If everything went well, the server answers either with a 200, 201 or 204 response (depending on the actual endpoint) A `http 200` response is returned for GET endpoints. A `http 201 Created` response is returned for POST endpoints. The body holds the deposit receipt. The headers holds the EDIT-IRI in the Location header of the response. A `http 204 No Content` response is returned for PUT, DELETE endpoints. If something went wrong, the server answers with one of the [error status code and associated message mentioned](#possible errors)). ### [5] Deposit Update The client previously deposited a `partial` document (through an archive, metadata, or both). The client wants to update information for that previous deposit (possibly in multiple steps as well). The important thing to note here is that, as long as the deposit is in -status `partial`, the injection did not start. Thus, the client can +status `partial`, the loading did not start. Thus, the client can update information (replace or add new archive, new metadata, even delete) for that same `partial` deposit. When the deposit status changes to `ready`, the client can no longer change the deposit's information (a 403 will be returned in that case). Then aggregation of all those deposit's information will later be used -for the actual injection. +for the actual loading. Providing the collection name, and the identifier of the previous deposit id received from the deposit receipt, the client executes a POST or PUT request on the *update iris*. After validation of the body request, the server: - uploads such content in a temporary location - answers the client an `http 204 (No content)`. In the Location header of the response lies an iri to permit further update. - Asynchronously, the server will inject the archive uploaded and the associated metadata. An operation status endpoint *state iri* - permits the client to query the injection operation status. + permits the client to query the loading operation status. #### Possible update endpoints PUT /1///media/ Replace existing archives for the deposit POST /1///media/ Add new archives to the deposit PUT /1///metadata/ Replace existing metadata (and possible archives) POST /1///metadata/ Add new metadata ### [6] Deposit Removal As long as the deposit's status remains `partial`, it's possible to remove the deposit entirely or remove only the deposit's archive(s). If the deposit has been removed, further querying that deposit will return a *404* response. If the deposit's archive(s) has been removed, we can still ensue other query to update that deposit. ### Operation Status Providing a collection name and a deposit id, the client asks the operation status of a prior deposit. URL: GET /1///status/ This returns: - *201* response with the actual status - *404* if the deposit does not exist (or no longer does) ## Possible errors ### sword:ErrorContent IRI: `http://purl.org/net/sword/error/ErrorContent` The supplied format is not the same as that identified in the Packaging header and/or that supported by the server Associated HTTP Associated HTTP status: *415 (Unsupported Media Type)* ### sword:ErrorChecksumMismatch IRI: `http://purl.org/net/sword/error/ErrorChecksumMismatch` Checksum sent does not match the calculated checksum. Associated HTTP status: *412 Precondition Failed* ### sword:ErrorBadRequest IRI: `http://purl.org/net/sword/error/ErrorBadRequest` Some parameters sent with the POST/PUT were not understood. Associated HTTP status: *400 Bad Request* ### sword:MediationNotAllowed IRI: `http://purl.org/net/sword/error/MediationNotAllowed` Used where a client has attempted a mediated deposit, but this is not supported by the server. Associated HTTP status: *412 Precondition Failed* ### sword:MethodNotAllowed IRI: `http://purl.org/net/sword/error/MethodNotAllowed` Used when the client has attempted one of the HTTP update verbs (POST, PUT, DELETE) but the server has decided not to respond to such requests on the specified resource at that time. Associated HTTP Status: *405 Method Not Allowed* ### sword:MaxUploadSizeExceeded IRI: `http://purl.org/net/sword/error/MaxUploadSizeExceeded` Used when the client has attempted to supply to the server a file which exceeds the server's maximum upload size limit Associated HTTP Status: *413 (Request Entity Too Large)* ### sword:Unauthorized IRI: `http://purl.org/net/sword/error/ErrorUnauthorized` The access to the api is through authentication. Associated HTTP status: *401* ### sword:Forbidden IRI: `http://purl.org/net/sword/error/ErrorForbidden` The action is forbidden (access to another collection for example). Associated HTTP status: *403* ## Nomenclature SWORD uses IRI notion, Internationalized Resource Identifier. In this chapter, we will describe SWH's IRIs. ### SD-IRI - The Service Document IRI The Service Document IRI. This is the IRI from which the client can discover its collection IRI. HTTP verbs supported: *GET* ### Col-IRI - The Collection IRI The software collection associated to one user. The SWORD Collection IRI is the IRI to which the initial deposit will take place, and which is listed in the Service Document. Following our previous example, this is: https://deposit.softwareheritage.org/1/hal/. HTTP verbs supported: *POST* ### Cont-IRI - The Content IRI This is the endpoint which permits the client to retrieve representations of the object as it resides in the SWORD server. This will display information about the content and its associated metadata. HTTP verbs supported: *GET* *Note:* We also refer to it as *Cont-File-IRI*. ### EM-IRI - The Atom Edit Media IRI This is the endpoint to upload other related archives for the same deposit. It is used to change a `partial` deposit in regards of archives, in particular: - replace existing archives with new ones - add new archives - delete archives from a deposit Example use case: A first archive to put exceeds the deposit's limit size. The client can thus split the archives in multiple ones. Post a first `partial` archive to the Col-IRI (with In-Progress: True). Then, in order to complete the deposit, POST the other remaining archives to the EM-IRI (the last one with the In-Progress header to False). HTTP verbs supported: *POST*, *PUT*, *DELETE* ### Edit-IRI - The Atom Entry Edit IRI This is the endpoint to change a `partial` deposit in regards of metadata. In particular: - replace existing metadata (and archives) with new ones - add new metadata (and archives) - delete deposit HTTP verbs supported: *POST*, *PUT*, *DELETE* *Note:* We also refer to it as *Edit-SE-IRI*. ### SE-IRI - The SWORD Edit IRI The sword specification permits to merge this with EDIT-IRI, so we did. *Note:* We also refer to it as *Edit-SE-IRI*. ### State-IRI - The SWORD Statement IRI This is the IRI which can be used to retrieve a description of the object from the sword server, including the structure of the object and its state. This will be used as the operation status endpoint. HTTP verbs supported: *GET* ## Sources - [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) - [arxiv documentation](https://arxiv.org/help/submit_sword) - [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) - [SWORD used on HAL](https://api.archives-ouvertes.fr/docs/sword) - [xml examples for CCSD](https://github.com/CCSDForge/HAL/tree/master/Sword) diff --git a/docs/spec-injection.md b/docs/spec-loading.md similarity index 82% rename from docs/spec-injection.md rename to docs/spec-loading.md index fe74dfef..385e2234 100644 --- a/docs/spec-injection.md +++ b/docs/spec-loading.md @@ -1,220 +1,220 @@ -# Injection specification (draft) +# Loading specification (draft) -This part discusses the deposit injection part on the server side. +This part discusses the deposit loading part on the server side. -## Tarball Injection +## Tarball Loading The `swh-loader-tar` module is already able to inject tarballs in swh with very limited metadata (mainly the origin). -The injection of the deposit will use the deposit's associated data: +The loading of the deposit will use the deposit's associated data: - the metadata - the archive(s) We will use the `synthetic` revision notion. To that revision will be associated the metadata. Those will be included in the hash computation, thus resulting in a unique identifier. -### Injection mapping +### Loading mapping Some of those metadata will also be included in the `origin_metadata` table. ``` origin | https://hal.inria.fr/hal-id | ------------------------------------|----------------------------------------| origin_visit | 1 :reception_date | origin_metadata | aggregated metadata | occurrence & occurrence_history | branch: client's version n° (e.g hal) | revision | synthetic_revision (tarball) | directory | upper level of the uncompressed archive| ``` -### Questions raised concerning injection +### Questions raised concerning loading - A deposit has one origin, yet an origin can have multiple deposits? No, an origin can have multiple requests for the same deposit. Which should end up in one single deposit (when the client pushes its final request saying deposit 'done' through the header In-Progress). Only update of existing 'partial' deposit is permitted. Other than that, the deposit 'update' operation. To create a new version of a software (already deposited), the client must prior to this create a new deposit. -Illustration First deposit injection: +Illustration First deposit loading: HAL's deposit 01535619 = SWH's deposit **01535619-1** + 1 origin with url:https://hal.inria.fr/medihal-01535619 + 1 synthetic revision + 1 directory HAL's update on deposit 01535619 = SWH's deposit **01535619-2** (*with HAL updates can only be on the metadata and a new version is required if the content changes) + 1 origin with url:https://hal.inria.fr/medihal-01535619 + new synthetic revision (with new metadata) + same directory HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** + same origin + new revision + new directory ## Technical details ### Requirements - one dedicated database to store the deposit's state - swh-deposit - one dedicated temporary objstorage to store archives before - injection + loading - one client to test the communication with SWORD protocol ### Deposit reception schema - SWORD imposes the use of basic authentication, so we need a way to authenticate client. Also, a client can access collections: **deposit_client** table: - id (bigint): Client's identifier - username (str): Client's username - password (pass): Client's crypted password - collections ([id]): List of collections the client can access - Collections group deposits together: **deposit_collection** table: - id (bigint): Collection's identifier - name (str): Collection's human readable name - A deposit is the main object the repository is all about: **deposit** table: - id (bigint): deposit's identifier - reception_date (date): First deposit's reception date - - complete_data (date): Date when the deposit is deemed complete and ready for injection + - complete_data (date): Date when the deposit is deemed complete and ready for loading - collection (id): The collection the deposit belongs to - external id (text): client's internal identifier (e.g hal's id, etc...). - client_id (id) : Client which did the deposit - - swh_id (str) : swh identifier result once the injection is complete + - swh_id (str) : swh identifier result once the loading is complete - status (enum): The deposit's current status - As mentioned, a deposit can have a status, whose possible values are: ``` text 'partial', -- the deposit is new or partially received since it -- can be done in multiple requests 'expired', -- deposit has been there too long and is now deemed -- ready to be garbage collected 'ready-for-checks' -- ready for checks to ensure data coherency - 'ready', -- deposit is fully received, checked, and ready for injection - 'injecting, -- injection is ongoing on swh's side - 'success', -- injection is successful - 'failure' -- injection is a failure + 'ready-for-load', -- deposit is fully received, checked, and ready for loading + 'loading', -- loading is ongoing on swh's side + 'success', -- loading is successful + 'failure' -- loading is a failure ``` A deposit is stateful and can be made in multiple requests: **deposit_request** table: - id (bigint): identifier - type (id): deposit request's type (possible values: 'archive', 'metadata') - deposit_id (id): deposit whose request belongs to - metadata: metadata associated to the request - date (date): date of the requests Information sent along a request are stored in a `deposit_request` row. They can be either of type `metadata` (atom entry, multipart's atom entry part) or of type `archive` (binary upload, multipart's binary upload part). When the deposit is complete (status `ready`), those `metadata` and `archive` deposit requests will be read and aggregated. They will then -be sent as parameters to the injection routine. +be sent as parameters to the loading routine. -During injection, some of those metadata are kept in the +During loading, some of those metadata are kept in the `origin_metadata` table and some other are stored in the `revision` -table (see [metadata injection](#metadata-injection)). +table (see [metadata loading](#metadata-loading)). The only update actions occurring on the deposit table are in regards of: - status changing: - `partial` -> {`expired`/`ready`}, - `ready` -> `injecting`, - `injecting` -> {`success`/`failure`} - `complete_date` when the deposit is finalized (when the status is changed to ready) -- `swh-id` is populated once we have the injection result +- `swh-id` is populated once we have the loading result #### SWH Identifier returned The synthetic revision id e.g: 47dc6b4636c7f6cba0df83e3d5490bf4334d987e -### Scheduling injection +### Scheduling loading All `archive` and `metadata` deposit requests should be aggregated -before injection. +before loading. -The injection should be scheduled via the scheduler's api. +The loading should be scheduled via the scheduler's api. -Only `ready` deposit are concerned by the injection. +Only `ready` deposit are concerned by the loading. -When the injection is done and successful, the deposit entry is +When the loading is done and successful, the deposit entry is updated: - `status` is updated to `success` - `swh-id` is populated with the resulting hash (cf. [swh identifier](#swh-identifier-returned)) -- `complete_date` is updated to the injection's finished time +- `complete_date` is updated to the loading's finished time -When the injection is failed, the deposit entry is updated: +When the loading is failed, the deposit entry is updated: - `status` is updated to `failure` - `swh-id` and `complete_data` remains as is *Note:* As a further improvement, we may prefer having a retry policy with graceful delays for further scheduling. -### Metadata injection +### Metadata loading - the metadata received with the deposit should be kept in the -`origin_metadata` table before translation as part of the injection +`origin_metadata` table before translation as part of the loading process and an indexation process should be scheduled. - provider_id and tool_id are resolved by the prepare_metadata method in the loader-core - the origin_metadata entry is sent to storage by the send_origin_metadata in the loader-core origin_metadata table: ``` id bigint PK origin bigint discovery_date date provider_id bigint FK // (from provider table) tool_id bigint FK // indexer_configuration_id tool used for extraction metadata jsonb // before translation ``` diff --git a/setup.py b/setup.py index 53a0b677..3e62a5e1 100644 --- a/setup.py +++ b/setup.py @@ -1,33 +1,33 @@ from setuptools import setup, find_packages def parse_requirements(): requirements = [] for reqf in ('requirements.txt', 'requirements-swh.txt'): with open(reqf) as f: for line in f.readlines(): line = line.strip() if not line or line.startswith('#'): continue requirements.append(line) return requirements setup( name='swh.deposit', description='Software Heritage Deposit Server', author='Software Heritage developers', author_email='swh-devel@inria.fr', url='https://forge.softwareheritage.org/source/swh-deposit/', packages=find_packages(), scripts=[], # scripts to package install_requires=parse_requirements(), extras_require={ - 'injection': ['swh.loader.core >= 0.0.19', - 'swh.scheduler >= 0.0.17', - 'requests'], + 'loader': ['swh.loader.core >= 0.0.19', + 'swh.scheduler >= 0.0.17', + 'requests'], }, setup_requires=['vcversioner'], vcversioner={}, include_package_data=True, ) diff --git a/swh/deposit/config.py b/swh/deposit/config.py index 67095d37..039a94ea 100644 --- a/swh/deposit/config.py +++ b/swh/deposit/config.py @@ -1,84 +1,84 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os import logging from swh.core.config import SWHConfig # IRIs (Internationalized Resource identifier) sword 2.0 specified EDIT_SE_IRI = 'edit_se_iri' EM_IRI = 'em_iri' CONT_FILE_IRI = 'cont_file_iri' SD_IRI = 'servicedocument' COL_IRI = 'upload' STATE_IRI = 'state_iri' PRIVATE_GET_RAW_CONTENT = 'private-download' PRIVATE_CHECK_DEPOSIT = 'check-deposit' PRIVATE_PUT_DEPOSIT = 'private-update' PRIVATE_GET_DEPOSIT_METADATA = 'private-read' ARCHIVE_KEY = 'archive' METADATA_KEY = 'metadata' ARCHIVE_TYPE = 'archive' METADATA_TYPE = 'metadata' AUTHORIZED_PLATFORMS = ['development', 'production', 'testing'] DEPOSIT_STATUS_REJECTED = 'rejected' DEPOSIT_STATUS_PARTIAL = 'partial' -DEPOSIT_STATUS_READY = 'ready' +DEPOSIT_STATUS_READY = 'ready-for-load' DEPOSIT_STATUS_READY_FOR_CHECKS = 'ready-for-checks' def setup_django_for(platform): """Setup function for command line tools (swh.deposit.create_user, swh.deposit.scheduler.cli) to initialize the needed db access. Note: Do not import any django related module prior to this function call. Otherwise, this will raise an django.core.exceptions.ImproperlyConfigured error message. Args: platform (str): the platform the scheduling is running Raises: ValueError in case of wrong platform inputs. """ if platform not in AUTHORIZED_PLATFORMS: raise ValueError('Platform should be one of %s' % AUTHORIZED_PLATFORMS) os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'swh.deposit.settings.%s' % platform) import django django.setup() class SWHDefaultConfig(SWHConfig): """Mixin intended to enrich views with SWH configuration. """ CONFIG_BASE_FILENAME = 'deposit/server' DEFAULT_CONFIG = { 'max_upload_size': ('int', 209715200), 'checks': ('bool', True), } ADDITIONAL_CONFIG = {} def __init__(self, **config): super().__init__() self.config = self.parse_config_file( additional_configs=[self.ADDITIONAL_CONFIG]) self.config.update(config) self.log = logging.getLogger('swh.deposit') if self.config['checks']: from swh.scheduler.backend import SchedulerBackend self.scheduler = SchedulerBackend() diff --git a/swh/deposit/injection/__init__.py b/swh/deposit/loader/__init__.py similarity index 100% rename from swh/deposit/injection/__init__.py rename to swh/deposit/loader/__init__.py diff --git a/swh/deposit/injection/checker.py b/swh/deposit/loader/checker.py similarity index 100% rename from swh/deposit/injection/checker.py rename to swh/deposit/loader/checker.py diff --git a/swh/deposit/injection/client.py b/swh/deposit/loader/client.py similarity index 100% rename from swh/deposit/injection/client.py rename to swh/deposit/loader/client.py diff --git a/swh/deposit/injection/loader.py b/swh/deposit/loader/loader.py similarity index 95% rename from swh/deposit/injection/loader.py rename to swh/deposit/loader/loader.py index 9ef04497..345246c9 100644 --- a/swh/deposit/injection/loader.py +++ b/swh/deposit/loader/loader.py @@ -1,129 +1,129 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime import os import tempfile from swh.model import hashutil from swh.loader.tar import loader from swh.loader.core.loader import SWHLoader from .client import DepositClient class DepositLoader(loader.TarLoader): """Deposit loader implementation. This is a subclass of the :class:TarLoader as the main goal of this class is to first retrieve the deposit's tarball contents as one and its associated metadata. Then provide said tarball to be loaded by the TarLoader. This will: - retrieves the deposit's archive locally - provide the archive to be loaded by the tar loader - clean up the temporary location used to retrieve the archive locally - update the deposit's status accordingly """ CONFIG_BASE_FILENAME = 'loader/deposit' ADDITIONAL_CONFIG = { - 'extraction_dir': ('str', '/tmp/swh.deposit.injection/'), + 'extraction_dir': ('str', '/tmp/swh.deposit.loader/'), } def __init__(self, client=None): super().__init__( - logging_class='swh.deposit.injection.loader.DepositLoader') + logging_class='swh.deposit.loader.loader.DepositLoader') self.client = client if client else DepositClient() def load(self, *, archive_url, deposit_meta_url, deposit_update_url): SWHLoader.load( self, archive_url=archive_url, deposit_meta_url=deposit_meta_url, deposit_update_url=deposit_update_url) def prepare(self, *, archive_url, deposit_meta_url, deposit_update_url): - """Prepare the injection by first retrieving the deposit's raw archive + """Prepare the loading by first retrieving the deposit's raw archive content. """ self.deposit_update_url = deposit_update_url temporary_directory = tempfile.TemporaryDirectory() self.temporary_directory = temporary_directory archive_path = os.path.join(temporary_directory.name, 'archive.zip') archive = self.client.archive_get( archive_url, archive_path, log=self.log) metadata = self.client.metadata_get( deposit_meta_url, log=self.log) origin = metadata['origin'] visit_date = datetime.datetime.now(tz=datetime.timezone.utc) revision = metadata['revision'] occurrence = metadata['occurrence'] self.origin_metadata = metadata['origin_metadata'] self.prepare_metadata() self.client.status_update(deposit_update_url, 'injecting') super().prepare(tar_path=archive, origin=origin, visit_date=visit_date, revision=revision, occurrences=[occurrence]) def store_metadata(self): """Storing the origin_metadata during the load processus. Provider_id and tool_id are resolved during the prepare() method. """ origin_id = self.origin_id visit_date = self.visit_date provider_id = self.origin_metadata['provider']['provider_id'] tool_id = self.origin_metadata['tool']['tool_id'] metadata = self.origin_metadata['metadata'] try: self.send_origin_metadata(origin_id, visit_date, provider_id, tool_id, metadata) except: self.log.exception('Problem when storing origin_metadata') raise def post_load(self, success=True): """Updating the deposit's status according to its loading status. If not successful, we update its status to failure. Otherwise, we update its status to 'success' and pass along its associated revision. """ try: if not success: self.client.status_update(self.deposit_update_url, status='failure') return # first retrieve the new revision [rev_id] = self.objects['revision'].keys() if rev_id: rev_id_hex = hashutil.hash_to_hex(rev_id) # then update the deposit's status to success with its # revision-id self.client.status_update(self.deposit_update_url, status='success', revision_id=rev_id_hex) except: self.log.exception( 'Problem when trying to update the deposit\'s status') def cleanup(self): """Clean up temporary directory where we retrieved the tarball. """ super().cleanup() self.temporary_directory.cleanup() diff --git a/swh/deposit/injection/scheduler.py b/swh/deposit/loader/scheduler.py similarity index 91% rename from swh/deposit/injection/scheduler.py rename to swh/deposit/loader/scheduler.py index 41281020..fad48f5b 100644 --- a/swh/deposit/injection/scheduler.py +++ b/swh/deposit/loader/scheduler.py @@ -1,212 +1,212 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """Module in charge of sending deposit loading/checking as either celery task or scheduled one-shot tasks. """ import click import logging from abc import ABCMeta, abstractmethod from celery import group from swh.core import utils from swh.core.config import SWHConfig from swh.deposit.config import setup_django_for, DEPOSIT_STATUS_READY from swh.deposit.config import DEPOSIT_STATUS_READY_FOR_CHECKS from swh.scheduler.utils import get_task, create_oneshot_task_dict class SWHScheduling(SWHConfig, metaclass=ABCMeta): """Base swh scheduling class to aggregate the schedule deposit - injection. + loading. """ CONFIG_BASE_FILENAME = 'deposit/server' DEFAULT_CONFIG = { 'dry_run': ('bool', False), } ADDITIONAL_CONFIG = {} def __init__(self): super().__init__() self.config = self.parse_config_file( additional_configs=[self.ADDITIONAL_CONFIG]) self.log = logging.getLogger('swh.deposit.scheduling') @abstractmethod def schedule(self, deposits): - """Schedule the new deposit injection. + """Schedule the new deposit loading. Args: data (dict): Deposit aggregated data Returns: None """ pass class SWHCeleryScheduling(SWHScheduling): - """Deposit injection as Celery task scheduling. + """Deposit loading as Celery task scheduling. """ def __init__(self, config=None): super().__init__() if config: self.config.update(**config) self.dry_run = self.config['dry_run'] self.check = self.config['check'] if self.check: - task_name = 'swh.deposit.injection.tasks.ChecksDepositTsk' + task_name = 'swh.deposit.loader.tasks.ChecksDepositTsk' else: - task_name = 'swh.deposit.injection.tasks.LoadDepositArchiveTsk' + task_name = 'swh.deposit.loader.tasks.LoadDepositArchiveTsk' self.task = get_task(task_name) def _convert(self, deposits): """Convert tuple to celery task signature. """ task = self.task for archive_url, meta_url, update_url, check_url in deposits: if self.check: yield task.s(deposit_check_url=check_url) else: yield task.s(archive_url=archive_url, deposit_meta_url=meta_url, deposit_update_url=update_url) def schedule(self, deposits): - """Schedule the new deposit injection directly through celery. + """Schedule the new deposit loading directly through celery. Args: depositdata (dict): Deposit aggregated information. Returns: None """ if self.dry_run: return return group(self._convert(deposits)).delay() class SWHSchedulerScheduling(SWHScheduling): - """Deposit injection through SWH's task scheduling interface. + """Deposit loading through SWH's task scheduling interface. """ ADDITIONAL_CONFIG = {} def __init__(self, config=None): super().__init__() from swh.scheduler.backend import SchedulerBackend if config: self.config.update(**config) self.dry_run = self.config['dry_run'] self.scheduler = SchedulerBackend(**self.config) self.check = self.config['check'] def _convert(self, deposits): """Convert tuple to one-shot scheduling tasks. """ for archive_url, meta_url, update_url, check_url in deposits: if self.check: task = create_oneshot_task_dict( 'swh-deposit-archive-checks', deposit_check_url=check_url) else: task = create_oneshot_task_dict( - 'swh-deposit-archive-injection', + 'swh-deposit-archive-loading', archive_url=archive_url, deposit_meta_url=meta_url, deposit_update_url=update_url) yield task def schedule(self, deposits): - """Schedule the new deposit injection through swh.scheduler's api. + """Schedule the new deposit loading through swh.scheduler's api. Args: deposits (dict): Deposit aggregated information. """ if self.dry_run: return self.scheduler.create_tasks(self._convert(deposits)) def get_deposit_by(status): """Filter deposit given a specific status. """ from swh.deposit.models import Deposit yield from Deposit.objects.filter(status=status) def prepare_task_arguments(check): """Convert deposit to argument for task to be executed. """ from swh.deposit.config import PRIVATE_GET_RAW_CONTENT from swh.deposit.config import PRIVATE_GET_DEPOSIT_METADATA from swh.deposit.config import PRIVATE_PUT_DEPOSIT from swh.deposit.config import PRIVATE_CHECK_DEPOSIT from django.core.urlresolvers import reverse if check: status = DEPOSIT_STATUS_READY_FOR_CHECKS else: status = DEPOSIT_STATUS_READY for deposit in get_deposit_by(status): args = [deposit.collection.name, deposit.id] archive_url = reverse(PRIVATE_GET_RAW_CONTENT, args=args) meta_url = reverse(PRIVATE_GET_DEPOSIT_METADATA, args=args) update_url = reverse(PRIVATE_PUT_DEPOSIT, args=args) check_url = reverse(PRIVATE_CHECK_DEPOSIT, args=args) yield archive_url, meta_url, update_url, check_url @click.command( - help='Schedule one-shot deposit injections') + help='Schedule one-shot deposit loadings') @click.option('--platform', default='development', help='development or production platform') @click.option('--scheduling-method', default='celery', help='Scheduling method') @click.option('--batch-size', default=1000, type=click.INT, help='Task batch size') @click.option('--dry-run/--no-dry-run', is_flag=True, default=False, help='Dry run') @click.option('--check', is_flag=True, default=False) def main(platform, scheduling_method, batch_size, dry_run, check): setup_django_for(platform) override_config = {} if dry_run: override_config['dry_run'] = dry_run override_config['check'] = check if scheduling_method == 'celery': scheduling = SWHCeleryScheduling(override_config) elif scheduling_method == 'swh-scheduler': scheduling = SWHSchedulerScheduling(override_config) else: raise ValueError( 'Only `celery` or `swh-scheduler` values are accepted') for deposits in utils.grouper(prepare_task_arguments(check), batch_size): scheduling.schedule(deposits) if __name__ == '__main__': main() diff --git a/swh/deposit/injection/tasks.py b/swh/deposit/loader/tasks.py similarity index 66% rename from swh/deposit/injection/tasks.py rename to swh/deposit/loader/tasks.py index b5d81d16..57a49e43 100644 --- a/swh/deposit/injection/tasks.py +++ b/swh/deposit/loader/tasks.py @@ -1,51 +1,50 @@ # Copyright (C) 2015-2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.scheduler.task import Task -from swh.deposit.injection.loader import DepositLoader -from swh.deposit.injection.checker import DepositChecker +from swh.deposit.loader import loader, checker class LoadDepositArchiveTsk(Task): - """Deposit archive injection task described by the following steps: + """Deposit archive loading task described by the following steps: 1. Retrieve tarball from deposit's private api and store locally in a temporary directory - 2. Trigger the injection + 2. Trigger the loading 3. clean up the temporary directory 4. Update the deposit's status according to result using the deposit's private update status api """ task_queue = 'swh_loader_deposit' def run_task(self, *, archive_url, deposit_meta_url, deposit_update_url): """Import a deposit tarball into swh. Args: see :func:`DepositLoader.load`. """ - loader = DepositLoader() - loader.log = self.log - loader.load(archive_url=archive_url, - deposit_meta_url=deposit_meta_url, - deposit_update_url=deposit_update_url) + _loader = loader.DepositLoader() + _loader.log = self.log + _loader.load(archive_url=archive_url, + deposit_meta_url=deposit_meta_url, + deposit_update_url=deposit_update_url) class ChecksDepositTsk(Task): """Deposit checks task. """ task_queue = 'swh_checker_deposit' def run_task(self, deposit_check_url): """Check a deposit's status Args: see :func:`DepositChecker.check`. """ - checker = DepositChecker() - checker.log = self.log - checker.check(deposit_check_url) + _checker = checker.DepositChecker() + _checker.log = self.log + _checker.check(deposit_check_url) diff --git a/swh/deposit/migrations/0008_auto_20171130_1513.py b/swh/deposit/migrations/0008_auto_20171130_1513.py new file mode 100644 index 00000000..20e5afba --- /dev/null +++ b/swh/deposit/migrations/0008_auto_20171130_1513.py @@ -0,0 +1,20 @@ +# -*- coding: utf-8 -*- +# Generated by Django 1.10.7 on 2017-11-30 15:13 +from __future__ import unicode_literals + +from django.db import migrations, models + + +class Migration(migrations.Migration): + + dependencies = [ + ('deposit', '0007_auto_20171129_1609'), + ] + + operations = [ + migrations.AlterField( + model_name='deposit', + name='status', + field=models.TextField(choices=[('partial', 'partial'), ('expired', 'expired'), ('ready-for-checks', 'ready-for-checks'), ('ready-for-load', 'ready-for-load'), ('rejected', 'rejected'), ('loading', 'loading'), ('success', 'success'), ('failure', 'failure')], default='partial'), + ), + ] diff --git a/swh/deposit/models.py b/swh/deposit/models.py index 91afdb7e..67d9f5d9 100644 --- a/swh/deposit/models.py +++ b/swh/deposit/models.py @@ -1,202 +1,202 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information # Generated from: # cd swh_deposit && \ # python3 -m manage inspectdb from django.contrib.postgres.fields import JSONField, ArrayField from django.contrib.auth.models import User, UserManager from django.db import models from django.utils.timezone import now from .config import DEPOSIT_STATUS_READY, DEPOSIT_STATUS_READY_FOR_CHECKS from .config import DEPOSIT_STATUS_PARTIAL class Dbversion(models.Model): """Db version """ version = models.IntegerField(primary_key=True) release = models.DateTimeField(default=now, null=True) description = models.TextField(blank=True, null=True) class Meta: db_table = 'dbversion' def __str__(self): return str({ 'version': self.version, 'release': self.release, 'description': self.description }) """Possible status""" DEPOSIT_STATUS = [ (DEPOSIT_STATUS_PARTIAL, DEPOSIT_STATUS_PARTIAL), ('expired', 'expired'), (DEPOSIT_STATUS_READY_FOR_CHECKS, DEPOSIT_STATUS_READY_FOR_CHECKS), (DEPOSIT_STATUS_READY, DEPOSIT_STATUS_READY), ('rejected', 'rejected'), - ('injecting', 'injecting'), + ('loading', 'loading'), ('success', 'success'), ('failure', 'failure'), ] """Possible status and the detailed meaning.""" DEPOSIT_STATUS_DETAIL = { DEPOSIT_STATUS_PARTIAL: 'Deposit is new or partially received since it can' ' be done in multiple requests', 'expired': 'Deposit has been there too long and is now ' 'deemed ready to be garbage collected', DEPOSIT_STATUS_READY_FOR_CHECKS: 'Deposit is ready for additional checks ' '(tarball ok, etc...)', DEPOSIT_STATUS_READY: 'Deposit is fully received, checked, and ' - 'ready for injection', + 'ready for loading', 'rejected': 'Deposit failed the checks', - 'injecting': "Injection is ongoing on swh's side", - 'success': 'Injection is successful', - 'failure': 'Injection is a failure', + 'loading': "Loading is ongoing on swh's side", + 'success': 'Loading is successful', + 'failure': 'Loading is a failure', } class DepositClient(User): """Deposit client """ collections = ArrayField(models.IntegerField(), null=True) objects = UserManager() url = models.TextField(null=False) class Meta: db_table = 'deposit_client' def __str__(self): return str({ 'id': self.id, 'collections': self.collections, 'username': super().username, }) class Deposit(models.Model): """Deposit reception table """ id = models.BigAutoField(primary_key=True) # First deposit reception date reception_date = models.DateTimeField(auto_now_add=True) - # Date when the deposit is deemed complete and ready for injection + # Date when the deposit is deemed complete and ready for loading complete_date = models.DateTimeField(null=True) # collection concerned by the deposit collection = models.ForeignKey( 'DepositCollection', models.DO_NOTHING) # Deposit's external identifier external_id = models.TextField() # Deposit client client = models.ForeignKey('DepositClient', models.DO_NOTHING) - # SWH's injection result identifier + # SWH's loading result identifier swh_id = models.TextField(blank=True, null=True) - # Deposit's status regarding injection + # Deposit's status regarding loading status = models.TextField( choices=DEPOSIT_STATUS, default=DEPOSIT_STATUS_PARTIAL) class Meta: db_table = 'deposit' def __str__(self): return str({ 'id': self.id, 'reception_date': self.reception_date, 'collection': self.collection.name, 'external_id': self.external_id, 'client': self.client.username, 'status': self.status }) class DepositRequestType(models.Model): """Deposit request type made by clients (either archive or metadata) """ id = models.BigAutoField(primary_key=True) name = models.TextField() class Meta: db_table = 'deposit_request_type' def __str__(self): return str({'id': self.id, 'name': self.name}) def client_directory_path(instance, filename): """Callable to upload archive in MEDIA_ROOT/user_/ Args: instance (DepositRequest): DepositRequest concerned by the upload filename (str): Filename of the uploaded file Returns: A path to be prefixed by the MEDIA_ROOT to access physically to the file uploaded. """ return 'client_{0}/{1}'.format(instance.deposit.client.id, filename) class DepositRequest(models.Model): """Deposit request associated to one deposit. """ id = models.BigAutoField(primary_key=True) # Deposit concerned by the request deposit = models.ForeignKey(Deposit, models.DO_NOTHING) date = models.DateTimeField(auto_now_add=True) # Deposit request information on the data to inject # this can be null when type is 'archive' metadata = JSONField(null=True) # this can be null when type is 'metadata' archive = models.FileField(null=True, upload_to=client_directory_path) type = models.ForeignKey( 'DepositRequestType', models.DO_NOTHING) class Meta: db_table = 'deposit_request' def __str__(self): meta = None if self.metadata: from json import dumps meta = dumps(self.metadata) archive_name = None if self.archive: archive_name = self.archive.name return str({ 'id': self.id, 'deposit': self.deposit, 'metadata': meta, 'archive': archive_name }) class DepositCollection(models.Model): id = models.BigAutoField(primary_key=True) # Human readable name for the collection type e.g HAL, arXiv, etc... name = models.TextField() class Meta: db_table = 'deposit_collection' def __str__(self): return str({'id': self.id, 'name': self.name}) diff --git a/swh/deposit/signals.py b/swh/deposit/signals.py index 6b9d2512..a1b85c6f 100644 --- a/swh/deposit/signals.py +++ b/swh/deposit/signals.py @@ -1,83 +1,83 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """Module in charge of defining some uncoupled actions on deposit. Typically, checking that the archives deposited are ok are not directly testing in the request/answer to avoid too long computations. So this is done in the deposit_on_status_ready_for_check callback. """ from django.db.models.signals import post_save from django.dispatch import receiver from .models import Deposit from .config import SWHDefaultConfig, DEPOSIT_STATUS_READY from .config import DEPOSIT_STATUS_READY_FOR_CHECKS @receiver(post_save, sender=Deposit) def post_deposit_save(sender, instance, created, raw, using, update_fields, **kwargs): """When a deposit is saved, check for the deposit's status change and schedule actions accordingly. When the status passes to ready-for-checks, schedule checks. When the status pass to ready, schedule loading. Otherwise, do nothing. Args: sender (Deposit): The model class instance (Deposit): The actual instance being saved created (bool): True if a new record was created raw (bool): True if the model is saved exactly as presented (i.e. when loading a fixture). One should not query/modify other records in the database as the database might not be in a consistent state yet using: The database alias being used update_fields: The set of fields to update as passed to Model.save(), or None if update_fields wasn’t passed to save() """ default_config = SWHDefaultConfig() if not default_config.config['checks']: return if instance.status not in {DEPOSIT_STATUS_READY_FOR_CHECKS, DEPOSIT_STATUS_READY}: return from django.core.urlresolvers import reverse from swh.scheduler.utils import create_oneshot_task_dict args = [instance.collection.name, instance.id] if instance.status == DEPOSIT_STATUS_READY_FOR_CHECKS: # schedule archive check from swh.deposit.config import PRIVATE_CHECK_DEPOSIT check_url = reverse(PRIVATE_CHECK_DEPOSIT, args=args) task = create_oneshot_task_dict( 'swh-deposit-archive-checks', deposit_check_url=check_url) else: # instance.status == DEPOSIT_STATUS_READY: # schedule loading from swh.deposit.config import PRIVATE_GET_RAW_CONTENT from swh.deposit.config import PRIVATE_GET_DEPOSIT_METADATA from swh.deposit.config import PRIVATE_PUT_DEPOSIT archive_url = reverse(PRIVATE_GET_RAW_CONTENT, args=args) meta_url = reverse(PRIVATE_GET_DEPOSIT_METADATA, args=args) update_url = reverse(PRIVATE_PUT_DEPOSIT, args=args) task = create_oneshot_task_dict( - 'swh-deposit-archive-injection', + 'swh-deposit-archive-loading', archive_url=archive_url, deposit_meta_url=meta_url, deposit_update_url=update_url) default_config.scheduler.create_tasks([task]) diff --git a/swh/deposit/tests/api/test_deposit_update_status.py b/swh/deposit/tests/api/test_deposit_update_status.py index 6532a966..675f5ecb 100644 --- a/swh/deposit/tests/api/test_deposit_update_status.py +++ b/swh/deposit/tests/api/test_deposit_update_status.py @@ -1,119 +1,119 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json from django.core.urlresolvers import reverse from nose.tools import istest from rest_framework import status from rest_framework.test import APITestCase from swh.deposit.models import Deposit, DEPOSIT_STATUS_DETAIL from swh.deposit.config import PRIVATE_PUT_DEPOSIT, DEPOSIT_STATUS_READY from ..common import BasicTestCase class UpdateDepositStatusTest(APITestCase, BasicTestCase): """Update the deposit's status scenario """ def setUp(self): super().setUp() deposit = Deposit(status=DEPOSIT_STATUS_READY, collection=self.collection, client=self.user) deposit.save() self.deposit = Deposit.objects.get(pk=deposit.id) assert self.deposit.status == DEPOSIT_STATUS_READY @istest def update_deposit_status(self): """Existing status for update should return a 204 response """ url = reverse(PRIVATE_PUT_DEPOSIT, args=[self.collection.name, self.deposit.id]) possible_status = set(DEPOSIT_STATUS_DETAIL.keys()) - set(['success']) for _status in possible_status: response = self.client.put( url, content_type='application/json', data=json.dumps({'status': _status})) self.assertEqual(response.status_code, status.HTTP_204_NO_CONTENT) deposit = Deposit.objects.get(pk=self.deposit.id) self.assertEquals(deposit.status, _status) @istest - def update_deposit_with_success_injection_and_swh_id(self): + def update_deposit_with_success_loading_and_swh_id(self): """Existing status for update should return a 204 response """ url = reverse(PRIVATE_PUT_DEPOSIT, args=[self.collection.name, self.deposit.id]) expected_status = 'success' expected_id = revision_id = '47dc6b4636c7f6cba0df83e3d5490bf4334d987e' response = self.client.put( url, content_type='application/json', data=json.dumps({ 'status': expected_status, 'revision_id': revision_id, })) self.assertEqual(response.status_code, status.HTTP_204_NO_CONTENT) deposit = Deposit.objects.get(pk=self.deposit.id) self.assertEquals(deposit.status, expected_status) self.assertEquals(deposit.swh_id, expected_id) @istest def update_deposit_status_will_fail_with_unknown_status(self): """Unknown status for update should return a 400 response """ url = reverse(PRIVATE_PUT_DEPOSIT, args=[self.collection.name, self.deposit.id]) response = self.client.put( url, content_type='application/json', data=json.dumps({'status': 'unknown'})) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) @istest def update_deposit_status_will_fail_with_no_status_key(self): """No status provided for update should return a 400 response """ url = reverse(PRIVATE_PUT_DEPOSIT, args=[self.collection.name, self.deposit.id]) response = self.client.put( url, content_type='application/json', data=json.dumps({'something': 'something'})) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) @istest def update_deposit_status_success_without_swh_id_fail(self): """Providing 'success' status without swh_id should return a 400 """ url = reverse(PRIVATE_PUT_DEPOSIT, args=[self.collection.name, self.deposit.id]) response = self.client.put( url, content_type='application/json', data=json.dumps({'status': 'success'})) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) diff --git a/swh/deposit/tests/injection/__init__.py b/swh/deposit/tests/loader/__init__.py similarity index 100% rename from swh/deposit/tests/injection/__init__.py rename to swh/deposit/tests/loader/__init__.py diff --git a/swh/deposit/tests/injection/common.py b/swh/deposit/tests/loader/common.py similarity index 96% rename from swh/deposit/tests/injection/common.py rename to swh/deposit/tests/loader/common.py index 8d453589..a1103943 100644 --- a/swh/deposit/tests/injection/common.py +++ b/swh/deposit/tests/loader/common.py @@ -1,49 +1,49 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json -from swh.deposit.injection.client import DepositClient +from swh.deposit.loader.client import DepositClient CLIENT_TEST_CONFIG = { 'url': 'http://nowhere:9000/', 'auth': {}, # no authentication in test scenario } class SWHDepositTestClient(DepositClient): """Deposit test client to permit overriding the default request client. """ def __init__(self, client, config): super().__init__(config=config) self.client = client def archive_get(self, archive_update_url, archive_path, log=None): r = self.client.get(archive_update_url) with open(archive_path, 'wb') as f: for chunk in r.streaming_content: f.write(chunk) return archive_path def metadata_get(self, metadata_url, log=None): r = self.client.get(metadata_url) return json.loads(r.content.decode('utf-8')) def status_update(self, update_status_url, status, revision_id=None): payload = {'status': status} if revision_id: payload['revision_id'] = revision_id self.client.put(update_status_url, content_type='application/json', data=json.dumps(payload)) def check(self, check_url): r = self.client.get(check_url) data = json.loads(r.content.decode('utf-8')) return data['status'] diff --git a/swh/deposit/tests/injection/test_checker.py b/swh/deposit/tests/loader/test_checker.py similarity index 97% rename from swh/deposit/tests/injection/test_checker.py rename to swh/deposit/tests/loader/test_checker.py index 23899e19..740089b8 100644 --- a/swh/deposit/tests/injection/test_checker.py +++ b/swh/deposit/tests/loader/test_checker.py @@ -1,70 +1,70 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from nose.tools import istest from rest_framework.test import APITestCase from swh.deposit.models import Deposit from swh.deposit.config import PRIVATE_CHECK_DEPOSIT, DEPOSIT_STATUS_READY from swh.deposit.config import DEPOSIT_STATUS_REJECTED -from swh.deposit.injection.checker import DepositChecker +from swh.deposit.loader.checker import DepositChecker from django.core.urlresolvers import reverse from .common import SWHDepositTestClient, CLIENT_TEST_CONFIG from ..common import BasicTestCase, WithAuthTestCase, CommonCreationRoutine from ..common import FileSystemCreationRoutine class DepositCheckerScenarioTest(APITestCase, WithAuthTestCase, BasicTestCase, CommonCreationRoutine, FileSystemCreationRoutine): def setUp(self): super().setUp() # 2. Sets a basic client which accesses the test data checker_client = SWHDepositTestClient(client=self.client, config=CLIENT_TEST_CONFIG) # 3. setup loader with no persistence and that client self.checker = DepositChecker(client=checker_client) @istest def check_deposit_ready(self): """Check a valid deposit ready-for-checks should result in ready state """ # 1. create a deposit with archive and metadata deposit_id = self.create_simple_binary_deposit() args = [self.collection.name, deposit_id] deposit_check_url = reverse(PRIVATE_CHECK_DEPOSIT, args=args) # when actual_status = self.checker.check(deposit_check_url=deposit_check_url) # then deposit = Deposit.objects.get(pk=deposit_id) self.assertEquals(deposit.status, DEPOSIT_STATUS_READY) self.assertEquals(actual_status, DEPOSIT_STATUS_READY) @istest def check_deposit_rejected(self): """Check an invalid deposit ready-for-checks should result in rejected """ # 1. create a deposit with archive and metadata deposit_id = self.create_invalid_deposit() args = [self.collection.name, deposit_id] deposit_check_url = reverse(PRIVATE_CHECK_DEPOSIT, args=args) # when actual_status = self.checker.check(deposit_check_url=deposit_check_url) # then deposit = Deposit.objects.get(pk=deposit_id) self.assertEquals(deposit.status, DEPOSIT_STATUS_REJECTED) self.assertEquals(actual_status, DEPOSIT_STATUS_REJECTED) diff --git a/swh/deposit/tests/injection/test_client.py b/swh/deposit/tests/loader/test_client.py similarity index 99% rename from swh/deposit/tests/injection/test_client.py rename to swh/deposit/tests/loader/test_client.py index b30366c4..d24ead3a 100644 --- a/swh/deposit/tests/injection/test_client.py +++ b/swh/deposit/tests/loader/test_client.py @@ -1,265 +1,265 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os import shutil import tempfile import unittest from nose.plugins.attrib import attr from nose.tools import istest -from swh.deposit.injection.client import DepositClient +from swh.deposit.loader.client import DepositClient from .common import CLIENT_TEST_CONFIG class StreamedResponse: """Streamed response facsimile """ def __init__(self, ok, stream): self.ok = ok self.stream = stream def iter_content(self): yield from self.stream class FakeRequestClientGet: """Fake request client dedicated to get method calls. """ def __init__(self, response): self.response = response def get(self, *args, **kwargs): self.args = args self.kwargs = kwargs return self.response @attr('fs') class DepositClientReadArchiveTest(unittest.TestCase): def setUp(self): super().setUp() self.temporary_directory = tempfile.mkdtemp(dir='/tmp') def tearDown(self): super().setUp() shutil.rmtree(self.temporary_directory) @istest def archive_get(self): """Reading archive should write data in temporary directory """ stream_content = [b"some", b"streamed", b"response"] response = StreamedResponse( ok=True, stream=(s for s in stream_content)) _client = FakeRequestClientGet(response) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) archive_path = os.path.join(self.temporary_directory, 'test.archive') archive_path = deposit_client.archive_get('/some/url', archive_path) self.assertTrue(os.path.exists(archive_path)) with open(archive_path, 'rb') as f: actual_content = f.read() self.assertEquals(actual_content, b''.join(stream_content)) self.assertEquals(_client.args, ('http://nowhere:9000/some/url', )) self.assertEquals(_client.kwargs, { 'stream': True }) @istest def archive_get_with_authentication(self): """Reading archive should write data in temporary directory """ stream_content = [b"some", b"streamed", b"response", b"for", b"auth"] response = StreamedResponse( ok=True, stream=(s for s in stream_content)) _client = FakeRequestClientGet(response) _config = CLIENT_TEST_CONFIG.copy() _config['auth'] = { # add authentication setup 'username': 'user', 'password': 'pass' } deposit_client = DepositClient(_config, _client=_client) archive_path = os.path.join(self.temporary_directory, 'test.archive') archive_path = deposit_client.archive_get('/some/url', archive_path) self.assertTrue(os.path.exists(archive_path)) with open(archive_path, 'rb') as f: actual_content = f.read() self.assertEquals(actual_content, b''.join(stream_content)) self.assertEquals(_client.args, ('http://nowhere:9000/some/url', )) self.assertEquals(_client.kwargs, { 'stream': True, 'auth': ('user', 'pass') }) @istest def archive_get_can_fail(self): """Reading archive can fail for some reasons """ response = StreamedResponse(ok=False, stream=None) _client = FakeRequestClientGet(response) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) with self.assertRaisesRegex( ValueError, 'Problem when retrieving deposit archive'): deposit_client.archive_get('/some/url', 'some/path') class JsonResponse: """Json response facsimile """ def __init__(self, ok, response): self.ok = ok self.response = response def json(self): return self.response class DepositClientReadMetadataTest(unittest.TestCase): @istest def metadata_get(self): """Reading archive should write data in temporary directory """ expected_response = {"some": "dict"} response = JsonResponse( ok=True, response=expected_response) _client = FakeRequestClientGet(response) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) actual_metadata = deposit_client.metadata_get('/metadata') self.assertEquals(actual_metadata, expected_response) @istest def metadata_get_can_fail(self): """Reading metadata can fail for some reasons """ _client = FakeRequestClientGet(JsonResponse(ok=False, response=None)) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) with self.assertRaisesRegex( ValueError, 'Problem when retrieving metadata at'): deposit_client.metadata_get('/some/metadata/url') class FakeRequestClientPut: """Fake Request client dedicated to put request method calls. """ args = None kwargs = None def put(self, *args, **kwargs): self.args = args self.kwargs = kwargs class DepositClientStatusUpdateTest(unittest.TestCase): @istest def status_update(self): """Update status """ _client = FakeRequestClientPut() deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) deposit_client.status_update('/update/status', 'success', revision_id='some-revision-id') self.assertEquals(_client.args, ('http://nowhere:9000/update/status', )) self.assertEquals(_client.kwargs, { 'json': { 'status': 'success', 'revision_id': 'some-revision-id', } }) @istest def status_update_with_no_revision_id(self): """Reading metadata can fail for some reasons """ _client = FakeRequestClientPut() deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) deposit_client.status_update('/update/status/fail', 'failure') self.assertEquals(_client.args, ('http://nowhere:9000/update/status/fail', )) self.assertEquals(_client.kwargs, { 'json': { 'status': 'failure', } }) class DepositClientCheckTest(unittest.TestCase): @istest def check(self): """When check ok, this should return the deposit's status """ _client = FakeRequestClientGet( JsonResponse(ok=True, response={'status': 'something'})) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) r = deposit_client.check('/check') self.assertEquals(_client.args, ('http://nowhere:9000/check', )) self.assertEquals(_client.kwargs, {}) self.assertEquals(r, 'something') @istest def check_fails(self): """Checking deposit can fail for some reason """ _client = FakeRequestClientGet( JsonResponse(ok=False, response=None)) deposit_client = DepositClient(config=CLIENT_TEST_CONFIG, _client=_client) with self.assertRaisesRegex( ValueError, 'Problem when checking deposit'): deposit_client.check('/check/fails') self.assertEquals(_client.args, ('http://nowhere:9000/check/fails', )) self.assertEquals(_client.kwargs, {}) diff --git a/swh/deposit/tests/injection/test_loader.py b/swh/deposit/tests/loader/test_loader.py similarity index 98% rename from swh/deposit/tests/injection/test_loader.py rename to swh/deposit/tests/loader/test_loader.py index 300e745f..dac48c5d 100644 --- a/swh/deposit/tests/injection/test_loader.py +++ b/swh/deposit/tests/loader/test_loader.py @@ -1,285 +1,286 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os import unittest import shutil from nose.tools import istest from nose.plugins.attrib import attr from rest_framework.test import APITestCase from swh.model import hashutil -from swh.deposit.injection.loader import DepositLoader +from swh.deposit.loader import loader from swh.deposit.config import PRIVATE_GET_RAW_CONTENT from swh.deposit.config import PRIVATE_GET_DEPOSIT_METADATA from swh.deposit.config import PRIVATE_PUT_DEPOSIT from django.core.urlresolvers import reverse from .common import SWHDepositTestClient, CLIENT_TEST_CONFIG from .. import TEST_LOADER_CONFIG from ..common import BasicTestCase, WithAuthTestCase, CommonCreationRoutine from ..common import FileSystemCreationRoutine TOOL_ID = 99 PROVIDER_ID = 12 class DepositLoaderInhibitsStorage: """Mixin class to inhibit the persistence and keep in memory the data sent for storage. cf. SWHDepositLoaderNoStorage """ def __init__(self, client=None): # client is not used here, transit it nonetheless to other mixins super().__init__(client=client) # typed data self.state = { 'origin': [], 'origin_visit': [], 'origin_metadata': [], 'content': [], 'directory': [], 'revision': [], 'release': [], 'occurrence': [], 'tool': [], 'provider': [] } def _add(self, type, l): """Add without duplicates and keeping the insertion order. Args: type (str): Type of objects concerned by the action l ([object]): List of 'type' object """ col = self.state[type] for o in l: if o in col: continue col.extend([o]) def send_origin(self, origin): origin.update({'id': 1}) self._add('origin', [origin]) return origin['id'] def send_origin_visit(self, origin_id, visit_date): origin_visit = { 'origin': origin_id, 'visit_date': visit_date, 'visit': 1, } self._add('origin_visit', [origin_visit]) return origin_visit def send_origin_metadata(self, origin_id, visit_date, provider_id, tool_id, metadata): origin_metadata = { 'origin_id': origin_id, 'visit_date': visit_date, 'provider_id': provider_id, 'tool_id': tool_id, 'metadata': metadata } self._add('origin_metadata', [origin_metadata]) return origin_metadata def send_tool(self, tool): tool = { 'tool_name': tool['tool_name'], 'tool_version': tool['tool_version'], 'tool_configuration': tool['tool_configuration'] } self._add('tool', [tool]) tool_id = TOOL_ID return tool_id def send_provider(self, provider): provider = { 'provider_name': provider['provider_name'], 'provider_type': provider['provider_type'], 'provider_url': provider['provider_url'], 'metadata': provider['metadata'] } self._add('provider', [provider]) provider_id = PROVIDER_ID return provider_id def maybe_load_contents(self, contents): self._add('content', contents) def maybe_load_directories(self, directories): self._add('directory', directories) def maybe_load_revisions(self, revisions): self._add('revision', revisions) def maybe_load_releases(self, releases): self._add('release', releases) def maybe_load_occurrences(self, occurrences): self._add('occurrence', occurrences) def open_fetch_history(self): pass def close_fetch_history_failure(self, fetch_history_id): pass def close_fetch_history_success(self, fetch_history_id): pass def update_origin_visit(self, origin_id, visit, status): self.status = status # Override to do nothing at the end def close_failure(self): pass def close_success(self): pass class TestLoaderUtils(unittest.TestCase): def assertRevisionsOk(self, expected_revisions): """Check the loader's revisions match the expected revisions. Expects self.loader to be instantiated and ready to be inspected (meaning the loading took place). Args: expected_revisions (dict): Dict with key revision id, value the targeted directory id. """ # The last revision being the one used later to start back from for rev in self.loader.state['revision']: rev_id = hashutil.hash_to_hex(rev['id']) directory_id = hashutil.hash_to_hex(rev['directory']) self.assertEquals(expected_revisions[rev_id], directory_id) -class SWHDepositLoaderNoStorage(DepositLoaderInhibitsStorage, DepositLoader): +class SWHDepositLoaderNoStorage(DepositLoaderInhibitsStorage, + loader.DepositLoader): """Loader to test. It inherits from the actual deposit loader to actually test its correct behavior. It also inherits from DepositLoaderInhibitsStorage so that no persistence takes place. """ pass @attr('fs') class DepositLoaderScenarioTest(APITestCase, WithAuthTestCase, BasicTestCase, CommonCreationRoutine, FileSystemCreationRoutine, TestLoaderUtils): def setUp(self): super().setUp() # create the extraction dir used by the loader os.makedirs(TEST_LOADER_CONFIG['extraction_dir'], exist_ok=True) # 1. create a deposit with archive and metadata self.deposit_id = self.create_simple_binary_deposit() # 2. Sets a basic client which accesses the test data loader_client = SWHDepositTestClient(self.client, config=CLIENT_TEST_CONFIG) # 3. setup loader with no persistence and that client self.loader = SWHDepositLoaderNoStorage(client=loader_client) def tearDown(self): super().tearDown() shutil.rmtree(TEST_LOADER_CONFIG['extraction_dir']) @istest def inject_deposit_ready(self): """Load a deposit which is ready """ args = [self.collection.name, self.deposit_id] archive_url = reverse(PRIVATE_GET_RAW_CONTENT, args=args) deposit_meta_url = reverse(PRIVATE_GET_DEPOSIT_METADATA, args=args) deposit_update_url = reverse(PRIVATE_PUT_DEPOSIT, args=args) # when self.loader.load(archive_url=archive_url, deposit_meta_url=deposit_meta_url, deposit_update_url=deposit_update_url) # then self.assertEquals(len(self.loader.state['content']), 1) self.assertEquals(len(self.loader.state['directory']), 1) self.assertEquals(len(self.loader.state['revision']), 1) self.assertEquals(len(self.loader.state['release']), 0) self.assertEquals(len(self.loader.state['occurrence']), 1) @istest def inject_deposit_verify_metadata(self): """Load a deposit with metadata, test metadata integrity """ self.deposit_metadata_id = self.add_metadata_to_deposit( self.deposit_id) args = [self.collection.name, self.deposit_metadata_id] archive_url = reverse(PRIVATE_GET_RAW_CONTENT, args=args) deposit_meta_url = reverse(PRIVATE_GET_DEPOSIT_METADATA, args=args) deposit_update_url = reverse(PRIVATE_PUT_DEPOSIT, args=args) # when self.loader.load(archive_url=archive_url, deposit_meta_url=deposit_meta_url, deposit_update_url=deposit_update_url) # then self.assertEquals(len(self.loader.state['content']), 1) self.assertEquals(len(self.loader.state['directory']), 1) self.assertEquals(len(self.loader.state['revision']), 1) self.assertEquals(len(self.loader.state['release']), 0) self.assertEquals(len(self.loader.state['occurrence']), 1) self.assertEquals(len(self.loader.state['origin_metadata']), 1) self.assertEquals(len(self.loader.state['tool']), 1) self.assertEquals(len(self.loader.state['provider']), 1) atom = '{http://www.w3.org/2005/Atom}' codemeta = '{https://doi.org/10.5063/SCHEMA/CODEMETA-2.0}' expected_origin_metadata = { atom + 'author': { atom + 'email': 'hal@ccsd.cnrs.fr', atom + 'name': 'HAL' }, codemeta + 'url': 'https://hal-test.archives-ouvertes.fr/hal-01243065', codemeta + 'runtimePlatform': 'phpstorm', codemeta + 'license': { codemeta + 'name': 'CeCILL Free Software License Agreement v1.1' }, codemeta + 'programmingLanguage': 'C', codemeta + 'applicationCategory': 'test', codemeta + 'dateCreated': '2017-05-03T16:08:47+02:00', codemeta + 'version': 1, atom + 'external_identifier': 'hal-01243065', atom + 'title': 'Composing a Web of Audio Applications', codemeta + 'description': 'this is the description', atom + 'id': 'hal-01243065', atom + 'client': 'hal', codemeta + 'keywords': 'DSP programming,Web', codemeta + 'developmentStatus': 'stable' } result = self.loader.state['origin_metadata'][0] self.assertEquals(result['metadata'], expected_origin_metadata) self.assertEquals(result['tool_id'], TOOL_ID) self.assertEquals(result['provider_id'], PROVIDER_ID)