diff --git a/PKG-INFO b/PKG-INFO index 628a4159..46302db6 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,11 +1,11 @@ Metadata-Version: 2.1 Name: swh.deposit -Version: 0.0.38 +Version: 0.0.39 Summary: Software Heritage Deposit Server Home-page: https://forge.softwareheritage.org/source/swh-deposit/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Description: UNKNOWN Platform: UNKNOWN Provides-Extra: loader diff --git a/docs/getting-started.md b/docs/getting-started.md index 319b8c1e..f7d3eab1 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,332 +1,332 @@ # Getting Started This is a getting started to demonstrate the deposit api use case with a shell client. The api is rooted at https://deposit.softwareheritage.org. -For more details, see the [main README](./README.md). +For more details, see the [main documentation](./index.html). ## Requirements You need to be referenced on SWH's client list to have: - a credential (needed for the basic authentication step). - an associated collection [Contact us for more information.](https://www.softwareheritage.org/contact/) ## Demonstration For the rest of the document, we will: - reference `` as the client and `` as its associated authentication password. - use curl as example on how to request the api. - present the main deposit use cases. The use cases are: - one single deposit step: The user posts in one query (one deposit) a software source code archive and associated metadata (deposit is finalized with status `ready-for-checks`). This will demonstrate the multipart query. - another 3-steps deposit (which can be extended as more than 2 steps): 1. Create an incomplete deposit (status `partial`) 2. Update a deposit (and finalize it, so the status becomes `ready-for-checks`) 3. Check the deposit's state This will demonstrate the stateful nature of the sword protocol. Those use cases share a common part, they must start by requesting the `service document iri` (internationalized resource identifier) for information about the collection's location. ### Common part - Start with the service document First, to determine the *collection iri* onto which deposit data, the client needs to ask the server where is its *collection* located. That is the role of the *service document iri*. For example: ``` Shell curl -i --user : https://deposit.softwareheritage.org/1/servicedocument/ ``` If everything went well, you should have received a response similar to this: ``` Shell HTTP/1.0 200 OK Server: WSGIServer/0.2 CPython/3.5.3 Content-Type: application/xml 2.0 209715200 The Software Heritage (SWH) Archive Software Collection application/zip Collection Policy Software Heritage Archive Collect, Preserve, Share false http://purl.org/net/sword/package/SimpleZip https://deposit.softwareheritage.org/1// ``` Explaining the response: - `HTTP/1.0 200 OK`: the query is successful and returns a body response - `Content-Type: application/xml`: The body response is in xml format - `body response`: it is a service document describing that the client `` has a collection named ``. That collection is available at the *collection iri* `/1//` (through POST query). At this level, if something went wrong, this should be authentication related. So the response would have been a 401 Unauthorized access. Something like: ``` Shell curl -i https://deposit.softwareheritage.org/1// HTTP/1.0 401 Unauthorized Server: WSGIServer/0.2 CPython/3.5.3 Content-Type: application/xml WWW-Authenticate: Basic realm="" X-Frame-Options: SAMEORIGIN Access to this api needs authentication processing failed ``` ### Single deposit A single deposit translates to a multipart deposit request. This means, in swh's deposit's terms, sending exactly one POST query with: - 1 archive (`content-type application/zip`) - 1 atom xml content (`content-type: application/atom+xml;type=entry`) The supported archive, for now are limited to zip files. Those archives are expected to contain some form of software source code. The atom entry content is some xml defining metadata about that software. Example of minimal atom entry file: ``` XML Title urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a 2005-10-07T17:17:08Z Contributor The abstract The abstract Access Rights Alternative Title Date Available Bibliographic Citation Contributor Description Has Part Has Version Identifier Is Part Of Publisher References Rights Holder Source Title Type ``` Once the files are ready for deposit, we want to do the actual deposit in one shot. For this, we need to provide: - the contents and their associated correct content-types - either the header `In-Progress` to false (meaning, it's finished after this query) or nothing (the server will assume it's not in progress if not present). - Optionally, the `Slug` header, which is a reference to a unique identifier the client knows about and wants to provide us. You can do this with the following command: ``` Shell curl -i --user : \ -F "file=@deposit.zip;type=application/zip;filename=payload" \ -F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \ -H 'In-Progress: false' \ -H 'Slug: some-external-id' \ -XPOST https://deposit.softwareheritage.org/1// ``` You just posted a deposit to the collection https://deposit.softwareheritage.org/1//. If everything went well, you should have received a response similar to this: ``` Shell HTTP/1.0 201 Created Server: WSGIServer/0.2 CPython/3.5.3 Location: /1//10/metadata/ Content-Type: application/xml 9 Sept. 26, 2017, 10:11 a.m. payload ready-for-checks http://purl.org/net/sword/package/SimpleZip ``` Explaining this response: - `HTTP/1.0 201 Created`: the deposit is successful - `Location: /1//10/metadata/`: the EDIT-SE-IRI through which we can update a deposit - body response: it is a deposit receipt detailing all endpoints available to manipulate the deposit (update, replace, delete, etc...) It also explains the deposit identifier to be 9 (which is useful for the remaining example). Note: As the deposit is in `ready-for-checks` status, you cannot actually update anything after this query. Well, the client can try, but it will be answered with a 403 forbidden answer. ### Multi-steps deposit -1. Create a deposit +#### Create a deposit We will use the collection IRI again as the starting point. We need to explicitely give to the server information about: - the deposit's completeness (through header `In-Progress` to true, as we want to do in multiple steps now). - archive's md5 hash (through header `Content-MD5`) - upload's type (through the headers `Content-Disposition` and `Content-Type`) The following command: ``` Shell curl -i --user : \ --data-binary @swh/deposit.zip \ -H 'In-Progress: true' \ -H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \ -H 'Content-Disposition: attachment; filename=[deposit.zip]' \ -H 'Slug: some-external-id' \ -H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \ -H 'Content-type: application/zip' \ -XPOST https://deposit.softwareheritage.org/1// ``` The expected answer is the same as the previous sample. -2. Update deposit's metadata +#### Update deposit's metadata To update a deposit, we can either add some more archives, some more metadata or replace existing ones. As we don't have defined metadata yet (except for the `slug` header), we can add some to the `EDIT-SE-IRI` endpoint (/1//10/metadata/). That information is extracted from the deposit receipt sample. Using here the same atom-entry.xml file presented in previous chapter. For example, here is the command to update deposit metadata: ``` Shell curl -i --user : --data-binary @atom-entry.xml \ -H 'In-Progress: true' \ -H 'Slug: some-external-id' \ -H 'Content-Type: application/atom+xml;type=entry' \ -XPOST https://deposit.softwareheritage.org/1//10/metadata/ HTTP/1.0 201 Created Server: WSGIServer/0.2 CPython/3.5.3 Location: /1//10/metadata/ Content-Type: application/xml 10 Sept. 26, 2017, 10:32 a.m. None partial http://purl.org/net/sword/package/SimpleZip ``` -3. Check the deposit's state +#### Check the deposit's state You need to check the STATE-IRI endpoint (/1//10/status/). ``` Shell curl -i --user : https://deposit.softwareheritage.org/1//10/status/ HTTP/1.0 200 OK Date: Wed, 27 Sep 2017 08:25:53 GMT Content-Type: application/xml ``` Response: ``` XML 9 ready-for-checks deposit is fully received and ready for loading ``` diff --git a/swh.deposit.egg-info/PKG-INFO b/swh.deposit.egg-info/PKG-INFO index 628a4159..46302db6 100644 --- a/swh.deposit.egg-info/PKG-INFO +++ b/swh.deposit.egg-info/PKG-INFO @@ -1,11 +1,11 @@ Metadata-Version: 2.1 Name: swh.deposit -Version: 0.0.38 +Version: 0.0.39 Summary: Software Heritage Deposit Server Home-page: https://forge.softwareheritage.org/source/swh-deposit/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Description: UNKNOWN Platform: UNKNOWN Provides-Extra: loader diff --git a/swh/deposit/api/private/deposit_check.py b/swh/deposit/api/private/deposit_check.py index b7885022..e3875c48 100644 --- a/swh/deposit/api/private/deposit_check.py +++ b/swh/deposit/api/private/deposit_check.py @@ -1,122 +1,117 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json import zipfile from rest_framework import status from ..common import SWHGetDepositAPI, SWHPrivateAPIView from ...config import DEPOSIT_STATUS_READY, DEPOSIT_STATUS_REJECTED from ...config import ARCHIVE_TYPE, METADATA_TYPE from ...models import Deposit, DepositRequest class SWHChecksDeposit(SWHGetDepositAPI, SWHPrivateAPIView): """Dedicated class to read a deposit's raw archives content. Only GET is supported. """ def deposit_requests(self, deposit): """Given a deposit, yields its associated deposit_request Yields: deposit request """ deposit_requests = DepositRequest.objects.filter( deposit=deposit).order_by('id') for deposit_request in deposit_requests: yield deposit_request def _check_archive(self, archive): """Check that a given archive is actually ok for reading. Args: archive (File): Archive to check Returns: True if archive is successfully read, False otherwise. """ try: zf = zipfile.ZipFile(archive.path) zf.infolist() except Exception as e: return False else: return True def _check_metadata(self, metadata): """Check to execute on metadata. Args: metadata (): Metadata to actually check Returns: True if metadata is ok, False otherwise. """ - must_meta = ['url', 'external_identifier', ['name', 'title'], 'author'] - # checks only for must metadata on all metadata requests - for mm in must_meta: - found = False - for k in metadata: - if isinstance(mm, list): - for p in mm: - if p in k: - found = True - break - elif mm in k: - found = True - break - if not found: - return False - return True + required_fields = (('url',), + ('external_identifier',), + ('name', 'title'), + ('author',)) + + result = all(any(name in field + for field in metadata + for name in possible_names) + for possible_names in required_fields) + return result def process_get(self, req, collection_name, deposit_id): """Build a unique tarball from the multiple received and stream that content to the client. Args: req (Request): collection_name (str): Collection owning the deposit deposit_id (id): Deposit concerned by the reading Returns: Tuple status, stream of content, content-type """ deposit = Deposit.objects.get(pk=deposit_id) all_metadata = {} + archives_status = False # will check each deposit request for the deposit for dr in self.deposit_requests(deposit): if dr.type.name == ARCHIVE_TYPE: archives_status = self._check_archive(dr.archive) elif dr.type.name == METADATA_TYPE: # aggregating all metadata requests for check on complete set all_metadata.update(dr.metadata) if not archives_status: break metadatas_status = self._check_metadata(all_metadata) deposit_status = archives_status and metadatas_status # if problem in any deposit requests, the deposit is rejected if not deposit_status: deposit.status = DEPOSIT_STATUS_REJECTED else: deposit.status = DEPOSIT_STATUS_READY deposit.save() return (status.HTTP_200_OK, json.dumps({ 'status': deposit.status }), 'application/json') diff --git a/swh/deposit/models.py b/swh/deposit/models.py index fd0f4694..3270bef1 100644 --- a/swh/deposit/models.py +++ b/swh/deposit/models.py @@ -1,205 +1,207 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information # Generated from: # cd swh_deposit && \ # python3 -m manage inspectdb from django.contrib.postgres.fields import JSONField, ArrayField from django.contrib.auth.models import User, UserManager from django.db import models from django.utils.timezone import now from .config import DEPOSIT_STATUS_READY, DEPOSIT_STATUS_READY_FOR_CHECKS from .config import DEPOSIT_STATUS_PARTIAL, DEPOSIT_STATUS_LOAD_SUCCESS from .config import DEPOSIT_STATUS_LOAD_FAILURE class Dbversion(models.Model): """Db version """ version = models.IntegerField(primary_key=True) release = models.DateTimeField(default=now, null=True) description = models.TextField(blank=True, null=True) class Meta: db_table = 'dbversion' def __str__(self): return str({ 'version': self.version, 'release': self.release, 'description': self.description }) """Possible status""" DEPOSIT_STATUS = [ (DEPOSIT_STATUS_PARTIAL, DEPOSIT_STATUS_PARTIAL), ('expired', 'expired'), (DEPOSIT_STATUS_READY_FOR_CHECKS, DEPOSIT_STATUS_READY_FOR_CHECKS), (DEPOSIT_STATUS_READY, DEPOSIT_STATUS_READY), ('rejected', 'rejected'), ('loading', 'loading'), (DEPOSIT_STATUS_LOAD_SUCCESS, DEPOSIT_STATUS_LOAD_SUCCESS), (DEPOSIT_STATUS_LOAD_FAILURE, DEPOSIT_STATUS_LOAD_FAILURE), ] """Possible status and the detailed meaning.""" DEPOSIT_STATUS_DETAIL = { - DEPOSIT_STATUS_PARTIAL: 'Deposit is new or partially received since it can' - ' be done in multiple requests', + DEPOSIT_STATUS_PARTIAL: 'Deposit is partially received. To finalize it, ' + 'In-Progress header should be false', 'expired': 'Deposit has been there too long and is now ' 'deemed ready to be garbage collected', DEPOSIT_STATUS_READY_FOR_CHECKS: 'Deposit is ready for additional checks ' - '(tarball ok, etc...)', + '(tarball ok, metadata, etc...)', DEPOSIT_STATUS_READY: 'Deposit is fully received, checked, and ' 'ready for loading', 'rejected': 'Deposit failed the checks', 'loading': "Loading is ongoing on swh's side", - DEPOSIT_STATUS_LOAD_SUCCESS: 'Loading is successful', - DEPOSIT_STATUS_LOAD_FAILURE: 'Loading is a failure', + DEPOSIT_STATUS_LOAD_SUCCESS: 'The deposit has been successfully ' + 'loaded into the Software Heritage archive', + DEPOSIT_STATUS_LOAD_FAILURE: 'The deposit loading into the ' + 'Software Heritage archive failed', } class DepositClient(User): """Deposit client """ collections = ArrayField(models.IntegerField(), null=True) objects = UserManager() url = models.TextField(null=False) class Meta: db_table = 'deposit_client' def __str__(self): return str({ 'id': self.id, 'collections': self.collections, 'username': super().username, }) class Deposit(models.Model): """Deposit reception table """ id = models.BigAutoField(primary_key=True) # First deposit reception date reception_date = models.DateTimeField(auto_now_add=True) # Date when the deposit is deemed complete and ready for loading complete_date = models.DateTimeField(null=True) # collection concerned by the deposit collection = models.ForeignKey( 'DepositCollection', models.DO_NOTHING) # Deposit's external identifier external_id = models.TextField() # Deposit client client = models.ForeignKey('DepositClient', models.DO_NOTHING) # SWH's loading result identifier swh_id = models.TextField(blank=True, null=True) # Deposit's status regarding loading status = models.TextField( choices=DEPOSIT_STATUS, default=DEPOSIT_STATUS_PARTIAL) # deposit can have one parent parent = models.ForeignKey('self', null=True) class Meta: db_table = 'deposit' def __str__(self): return str({ 'id': self.id, 'reception_date': self.reception_date, 'collection': self.collection.name, 'external_id': self.external_id, 'client': self.client.username, 'status': self.status }) class DepositRequestType(models.Model): """Deposit request type made by clients (either archive or metadata) """ id = models.BigAutoField(primary_key=True) name = models.TextField() class Meta: db_table = 'deposit_request_type' def __str__(self): return str({'id': self.id, 'name': self.name}) def client_directory_path(instance, filename): """Callable to upload archive in MEDIA_ROOT/user_/ Args: instance (DepositRequest): DepositRequest concerned by the upload filename (str): Filename of the uploaded file Returns: A path to be prefixed by the MEDIA_ROOT to access physically to the file uploaded. """ return 'client_{0}/{1}'.format(instance.deposit.client.id, filename) class DepositRequest(models.Model): """Deposit request associated to one deposit. """ id = models.BigAutoField(primary_key=True) # Deposit concerned by the request deposit = models.ForeignKey(Deposit, models.DO_NOTHING) date = models.DateTimeField(auto_now_add=True) # Deposit request information on the data to inject # this can be null when type is 'archive' metadata = JSONField(null=True) # this can be null when type is 'metadata' archive = models.FileField(null=True, upload_to=client_directory_path) type = models.ForeignKey( 'DepositRequestType', models.DO_NOTHING) class Meta: db_table = 'deposit_request' def __str__(self): meta = None if self.metadata: from json import dumps meta = dumps(self.metadata) archive_name = None if self.archive: archive_name = self.archive.name return str({ 'id': self.id, 'deposit': self.deposit, 'metadata': meta, 'archive': archive_name }) class DepositCollection(models.Model): id = models.BigAutoField(primary_key=True) # Human readable name for the collection type e.g HAL, arXiv, etc... name = models.TextField() class Meta: db_table = 'deposit_collection' def __str__(self): return str({'id': self.id, 'name': self.name}) diff --git a/version.txt b/version.txt index 2cbb505c..b67d3fa4 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.38-0-geff3391 \ No newline at end of file +v0.0.39-0-g5214a45 \ No newline at end of file