diff --git a/README-injection.md b/README-injection.md index bbc39cf5..06fd9eb1 100644 --- a/README-injection.md +++ b/README-injection.md @@ -1,212 +1,211 @@ # Injection specification (draft) This part discusses the deposit injection part on the server side. ## Tarball Injection The `swh-loader-tar` module is already able to inject tarballs in swh with very limited metadata (mainly the origin). The injection of the deposit will use the deposit's associated data: - the metadata - the archive(s) We will use the `synthetic` revision notion. To that revision will be associated the metadata. Those will be included in the hash computation, thus resulting in a unique identifier. ### Injection mapping Some of those metadata will also be included in the `origin_metadata` table. origin | https://hal.inria.fr/hal-id -------------------------------------|---------------------------------------- origin_visit | 1 :reception_date occurrence & occurrence_history | branch: client's version n° (e.g hal) revision | synthetic_revision (tarball) directory | upper level of the uncompressed archive ### Questions raised concerning injection - A deposit has one origin, yet an origin can have multiple deposits? No, an origin can have multiple requests for the same deposit. Which should end up in one single deposit (when the client pushes its final request saying deposit 'done' through the header In-Progress). Only update of existing 'partial' deposit is permitted. Other than that, the deposit 'update' operation. To create a new version of a software (already deposited), the client must prior to this create a new deposit. Illustration First deposit injection: HAL's deposit 01535619 = SWH's deposit **01535619-1** + 1 origin with url:https://hal.inria.fr/medihal-01535619 + 1 synthetic revision + 1 directory HAL's update on deposit 01535619 = SWH's deposit **01535619-2** (*with HAL updates can only be on the metadata and a new version is required if the content changes) + 1 origin with url:https://hal.inria.fr/medihal-01535619 + new synthetic revision (with new metadata) + same directory HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** + same origin + new revision + new directory ## Technical details ### Requirements - one dedicated database to store the deposit's state - swh-deposit - one dedicated temporary objstorage to store archives before injection - one client to test the communication with SWORD protocol ### Deposit reception schema - SWORD imposes the use of basic authentication, so we need a way to authenticate client. Also, a client can access collections: **deposit_client** table: - id (bigint): Client's identifier - username (str): Client's username - password (pass): Client's crypted password - collections ([id]): List of collections the client can access - Collections group deposits together: **deposit_collection** table: - id (bigint): Collection's identifier - name (str): Collection's human readable name - A deposit is the main object the repository is all about: **deposit** table: - id (bigint): deposit's identifier - reception_date (date): First deposit's reception date - complete_data (date): Date when the deposit is deemed complete and ready for injection - collection (id): The collection the deposit belongs to - external id (text): client's internal identifier (e.g hal's id, etc...). - client_id (id) : Client which did the deposit - swh_id (str) : swh identifier result once the injection is complete - status (enum): The deposit's current status - As mentioned, a deposit can have a status, whose possible values are: ``` text 'partial', -- the deposit is new or partially received since it -- can be done in multiple requests 'expired', -- deposit has been there too long and is now deemed -- ready to be garbage collected 'ready', -- deposit is fully received and ready for injection 'injecting, -- injection is ongoing on swh's side 'success', -- injection is successful 'failure' -- injection is a failure ``` A deposit is stateful and can be made in multiple requests: **deposit_request** table: - id (bigint): identifier - type (id): deposit request's type (possible values: 'archive', 'metadata') - deposit_id (id): deposit whose request belongs to - metadata: metadata associated to the request - date (date): date of the requests Information sent along a request are stored in a `deposit_request` row. They can be either of type `metadata` (atom entry, multipart's atom entry part) or of type `archive` (binary upload, multipart's binary upload part). When the deposit is complete (status `ready`), those `metadata` and `archive` deposit requests will be read and aggregated. They will then be sent as parameters to the injection routine. During injection, some of those metadata are kept in the `origin_metadata` table and some other are stored in the `revision` table (see [metadata injection](#metadata-injection)). The only update actions occurring on the deposit table are in regards of: - status changing: - `partial` -> {`expired`/`ready`}, - `ready` -> `injecting`, - `injecting` -> {`success`/`failure`} - `complete_date` when the deposit is finalized (when the status is changed to ready) - `swh-id` is populated once we have the injection result #### SWH Identifier returned swh-- e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e ### Scheduling injection All `archive` and `metadata` deposit requests should be aggregated before injection. The injection should be scheduled via the scheduler's api. Only `ready` deposit are concerned by the injection. When the injection is done and successful, the deposit entry is updated: - `status` is updated to `success` - `swh-id` is populated with the resulting hash (cf. [swh identifier](#swh-identifier-returned)) - `complete_date` is updated to the injection's finished time When the injection is failed, the deposit entry is updated: - `status` is updated to `failure` - `swh-id` and `complete_data` remains as is *Note:* As a further improvement, we may prefer having a retry policy with graceful delays for further scheduling. ### Metadata injection - the metadata received with the deposit should be kept in the `origin_metadata` table before translation as part of the injection process and an indexation process should be scheduled. origin_metadata table: ``` -origin bigint PK FK -discovery_date date PK FK -translation_date date PK FK -provenance_type text // (enum: 'publisher', 'lister' needs to be completed) -raw_metadata jsonb // before translation -indexer_configuration_id bigint FK // tool used for translation -translated_metadata jsonb // with codemeta schema and terms +id bigint PK +origin bigint +discovery_date date +provider_id bigint FK // (from provider table) +metadata jsonb // before translation +indexer_configuration_id bigint FK // tool used for extraction ``` diff --git a/README-metadata.md b/README-metadata.md new file mode 100644 index 00000000..27ef85cf --- /dev/null +++ b/README-metadata.md @@ -0,0 +1,153 @@ +# Deposit metadata + +When making a software deposit into the SWH archive, one can add information +describing the software artifact and the software project. +and the metadata will be translated to the [CodeMeta v.2](https://doi.org/10.5063/SCHEMA/CODEMETA-2.0) vocabulary +if possible. + +## Metadata requirements + +MUST +- the schema/vocabulary used *MUST* be specified with a persistant url +(DublinCore, DOAP, CodeMeta, etc.) +- the origin url *MUST* be defined depending on the schema you use: +```XML + +hal.archives-ouvertes.fr +hal.archives-ouvertes.fr +hal.archives-ouvertes.fr +``` + + +SHOULD +- the external_identifier *SHOULD* match the Slug external-identifier in +the header +- the following metadata *SHOULD* be included using the correct terminology +(depending on the schema you are using- the CodeMeta crosswalk table can + help you identify the terms): + - codemeta:name - the software artifact title + - codemeta:description - short or long description of the software in the + deposit + - codemeta:license - the software license/s + - codemeta:author - the software authors + +MAY + - other metadata *MAY* be added with terms defined by the schema in use. + +## Examples +### Using only Atom +```XML + + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 2017-10-07T15:17:08Z + some awesome author + +``` +### Using Atom with CodeMeta +```XML + + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 1785io25c695 + origin url + other identifier, DOI, ARK + Domain + + description + key-word 1 + key-word 2 + creation date + publication date + comment + + article name + article id + + + Collaboration/Projet + project name + id + + see also + Sponsor A + Sponsor B + Platform/OS + dependencies + Version + active + + license + url spdx + + .Net Framework 3.0 + Python2.3 + + author1 + Inria + UPMC + + + author2 + Inria + UPMC + + http://code.com + language 1 + language 2 + http://issuetracker.com + +``` +### Using Atom with DublinCore and CodeMeta (multi-schema entry) +``` XML + + + Awesome Compiler + hal + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + %s + hal-01587361 + doi:10.5281/zenodo.438684 + The assignment problem + AffectationRO + author + [INFO] Computer Science [cs] + [INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO] + SOFTWARE + Project in OR: The assignment problemA java implementation for the assignment problem first release + description fr + 2015-06-01 + 2017-10-19 + en + + + origin url + + 1.0.0 + key word + Comment + Rfrence interne + + link + Sponsor + + Platform/OS + dependencies + Ended + + license + url spdx + + + http://code.com + language 1 + language 2 + +``` diff --git a/swh/deposit/settings/testing.py b/swh/deposit/settings/testing.py index a8d3574c..2984a329 100644 --- a/swh/deposit/settings/testing.py +++ b/swh/deposit/settings/testing.py @@ -1,49 +1,50 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from .common import * # noqa from .common import ALLOWED_HOSTS from .development import * # noqa from .development import INSTALLED_APPS # django-nose setup ALLOWED_HOSTS += ['testserver'] INSTALLED_APPS += ['django_nose'] TEST_RUNNER = 'django_nose.NoseTestSuiteRunner' +NOSE_ARGS = ['--verbosity=3', '-s'] # to see test pass # https://docs.djangoproject.com/en/1.10/ref/settings/#logging LOGGING = { 'version': 1, 'disable_existing_loggers': True, 'formatters': { 'standard': { 'format': "[%(asctime)s] %(levelname)s [%(name)s:%(lineno)s] %(message)s", # noqa 'datefmt': "%d/%b/%Y %H:%M:%S" }, }, 'handlers': { 'console': { 'level': 'ERROR', 'class': 'logging.StreamHandler', 'formatter': 'standard' }, }, 'loggers': { 'swh.deposit': { 'handlers': ['console'], 'level': 'ERROR', }, } } # https://docs.djangoproject.com/en/1.11/ref/settings/#std:setting-MEDIA_ROOT # SECURITY WARNING: Override this in the production.py module MEDIA_ROOT = '/tmp/swh-deposit/test/uploads/' FILE_UPLOAD_HANDLERS = [ "django.core.files.uploadhandler.MemoryFileUploadHandler", ] diff --git a/swh/deposit/tests/api/test_deposit_atom.py b/swh/deposit/tests/api/test_deposit_atom.py index ae96e151..f243b8b8 100644 --- a/swh/deposit/tests/api/test_deposit_atom.py +++ b/swh/deposit/tests/api/test_deposit_atom.py @@ -1,295 +1,454 @@ # Copyright (C) 2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from django.core.urlresolvers import reverse from io import BytesIO from nose.tools import istest from rest_framework import status from rest_framework.test import APITestCase from swh.deposit.config import COL_IRI, DEPOSIT_STATUS_READY from swh.deposit.models import Deposit, DepositRequest from swh.deposit.parsers import parse_xml from ..common import BasicTestCase, WithAuthTestCase class DepositAtomEntryTestCase(APITestCase, WithAuthTestCase, BasicTestCase): """Try and post atom entry deposit. """ def setUp(self): super().setUp() self.atom_entry_data0 = b""" Awesome Compiler hal urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a %s 2017-10-07T15:17:08Z some awesome author something awesome-compiler This is an awesome compiler destined to awesomely compile stuff and other stuff compiler,programming,language 2005-10-07T17:17:08Z 2005-10-07T17:17:08Z release note related link Awesome https://hoster.org/awesome-compiler GNU/Linux 0.0.1 running all """ self.atom_entry_data1 = b""" hal urn:uuid:2225c695-cfb8-4ebb-aaaa-80da344efa6a 2017-10-07T15:17:08Z some awesome author something awesome-compiler This is an awesome compiler destined to awesomely compile stuff and other stuff compiler,programming,language 2005-10-07T17:17:08Z 2005-10-07T17:17:08Z release note related link Awesome https://hoster.org/awesome-compiler GNU/Linux 0.0.1 running all """ self.atom_entry_data2 = b""" %s """ self.atom_entry_data_empty_body = b""" """ self.atom_entry_data3 = b""" something """ + self.atom_entry_data_atom_only = b""" + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 2017-10-07T15:17:08Z + some awesome author + """ + + self.atom_entry_data_codemeta = b""" + + Awesome Compiler + urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a + 1785io25c695 + 1785io25c695 + origin url + other identifier, DOI, ARK + Domain + + description + key-word 1 + key-word 2 + creation date + publication date + comment + + article name + article id + + + Collaboration/Projet + project name + id + + see also + Sponsor A + Sponsor B + Platform/OS + dependencies + Version + active + + license + url spdx + + .Net Framework 3.0 + Python2.3 + + author1 + Inria + UPMC + + + author2 + Inria + UPMC + + http://code.com + language 1 + language 2 + http://issuetracker.com + """ + + self.atom_entry_data_dc_codemeta = b""" + + + + %s + hal-01587361 + https://hal.inria.fr/hal-01587361 + https://hal.inria.fr/hal-01587361/document + https://hal.inria.fr/hal-01587361/file/AffectationRO-v1.0.0.zip + doi:10.5281/zenodo.438684 + The assignment problem + AffectationRO + Gruenpeter, Morane + [INFO] Computer Science [cs] + [INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO] + SOFTWARE + Project in OR: The assignment problemA java implementation for the assignment problem first release + description fr + 2015-06-01 + 2017-10-19 + en + + + url stable + Version sur hal + Version entre par lutilisateur + Mots-cls + Commentaire + Rfrence interne + + Collaboration/Projet + nom du projet + id + + Voir aussi + Financement + Projet ANR + Projet Europen + Platform/OS + Dpendances + Etat du dveloppement + + license + url spdx + + Outils de dveloppement- outil no1 + Outils de dveloppement- outil no2 + http://code.com + language 1 + language 2 + """ + self.atom_entry_tei = b"""HAL TEI export of hal-01587083CCSDDistributed under a Creative Commons Attribution 4.0 International License

HAL API platform

questionnaire software metadataMoraneGruenpeter7de56c632362954fa84172cad80afe4einria.fr1556733MoraneGruenpeterf85a43a5fb4a2e0778a77e017f28c8fdgmail.com2017-09-29 11:21:322017-10-03 17:20:132017-10-03 17:20:132017-09-292017-09-29contributorMoraneGruenpeterf85a43a5fb4a2e0778a77e017f28c8fdgmail.comCCSDhal-01587083https://hal.inria.fr/hal-01587083gruenpeter:hal-0158708320172017questionnaire software metadataMoraneGruenpeter7de56c632362954fa84172cad80afe4einria.fr1556733EnglishComputer Science [cs]SoftwareIRILLInitiative pour la Recherche et l'Innovation sur le Logiciel Libre
https://www.irill.org/
Universite Pierre et Marie Curie - Paris 6UPMC
4 place Jussieu - 75005 Paris
http://www.upmc.fr/
Institut National de Recherche en Informatique et en AutomatiqueInria
Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex
http://www.inria.fr/en/
Universite Paris Diderot - Paris 7UPD7
5 rue Thomas-Mann - 75205 Paris cedex 13
http://www.univ-paris-diderot.fr
""" # noqa self.atom_entry_data_badly_formatted = b""" """ @istest def post_deposit_atom_empty_body_request(self): """Posting empty body request should return a 400 response """ response = self.client.post( reverse(COL_IRI, args=[self.collection.name]), content_type='application/atom+xml;type=entry', data=self.atom_entry_data_empty_body) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) @istest def post_deposit_atom_badly_formatted_is_a_bad_request(self): """Posting a badly formatted atom should return a 400 response """ response = self.client.post( reverse(COL_IRI, args=[self.collection.name]), content_type='application/atom+xml;type=entry', data=self.atom_entry_data_badly_formatted) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) @istest def post_deposit_atom_without_slug_header_is_bad_request(self): """Posting an atom entry without a slug header should return a 400 """ url = reverse(COL_IRI, args=[self.collection.name]) # when response = self.client.post( url, content_type='application/atom+xml;type=entry', data=self.atom_entry_data0, # + headers HTTP_IN_PROGRESS='false') self.assertIn(b'Missing SLUG header', response.content) self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) @istest def post_deposit_atom_unknown_collection(self): """Posting an atom entry to an unknown collection should return a 404 """ response = self.client.post( reverse(COL_IRI, args=['unknown-one']), content_type='application/atom+xml;type=entry', data=self.atom_entry_data3, HTTP_SLUG='something') self.assertEqual(response.status_code, status.HTTP_404_NOT_FOUND) @istest def post_deposit_atom_entry_initial(self): """Posting an initial atom entry should return 201 with deposit receipt """ # given external_id = 'urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a' with self.assertRaises(Deposit.DoesNotExist): Deposit.objects.get(external_id=external_id) atom_entry_data = self.atom_entry_data0 % external_id.encode('utf-8') # when response = self.client.post( reverse(COL_IRI, args=[self.collection.name]), content_type='application/atom+xml;type=entry', data=atom_entry_data, HTTP_SLUG='external-id', HTTP_IN_PROGRESS='false') # then self.assertEqual(response.status_code, status.HTTP_201_CREATED) response_content = parse_xml(BytesIO(response.content)) deposit_id = response_content[ '{http://www.w3.org/2005/Atom}deposit_id'] deposit = Deposit.objects.get(pk=deposit_id) self.assertEqual(deposit.collection, self.collection) self.assertEqual(deposit.external_id, external_id) self.assertEqual(deposit.status, DEPOSIT_STATUS_READY) self.assertEqual(deposit.client, self.user) # one associated request to a deposit deposit_request = DepositRequest.objects.get(deposit=deposit) self.assertIsNotNone(deposit_request.metadata) self.assertFalse(bool(deposit_request.archive)) + @istest + def post_deposit_atom_entry_with_codemeta(self): + """Posting an initial atom entry should return 201 with deposit receipt + + """ + # given + external_id = 'urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a' + + with self.assertRaises(Deposit.DoesNotExist): + Deposit.objects.get(external_id=external_id) + + atom_entry_data = self.atom_entry_data_dc_codemeta % external_id.encode('utf-8') + + # when + response = self.client.post( + reverse(COL_IRI, args=[self.collection.name]), + content_type='application/atom+xml;type=entry', + data=atom_entry_data, + HTTP_SLUG='external-id', + HTTP_IN_PROGRESS='false') + + # then + self.assertEqual(response.status_code, status.HTTP_201_CREATED) + + response_content = parse_xml(BytesIO(response.content)) + + deposit_id = response_content[ + '{http://www.w3.org/2005/Atom}deposit_id'] + + deposit = Deposit.objects.get(pk=deposit_id) + self.assertEqual(deposit.collection, self.collection) + self.assertEqual(deposit.external_id, external_id) + self.assertEqual(deposit.status, 'ready') + self.assertEqual(deposit.client, self.user) + + # one associated request to a deposit + deposit_request = DepositRequest.objects.get(deposit=deposit) + self.assertIsNotNone(deposit_request.metadata) + + self.assertFalse(bool(deposit_request.archive)) + @istest def test_post_deposit_atom_entry_tei(self): """Posting initial atom entry as TEI should return 201 with receipt """ # given external_id = 'urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a' with self.assertRaises(Deposit.DoesNotExist): Deposit.objects.get(external_id=external_id) atom_entry_data = self.atom_entry_tei # when response = self.client.post( reverse(COL_IRI, args=[self.collection.name]), content_type='application/atom+xml;type=entry', data=atom_entry_data, HTTP_SLUG=external_id, HTTP_IN_PROGRESS='false') # then self.assertEqual(response.status_code, status.HTTP_201_CREATED) response_content = parse_xml(BytesIO(response.content)) deposit_id = response_content[ '{http://www.w3.org/2005/Atom}deposit_id'] deposit = Deposit.objects.get(pk=deposit_id) self.assertEqual(deposit.collection, self.collection) self.assertEqual(deposit.external_id, external_id) self.assertEqual(deposit.status, DEPOSIT_STATUS_READY) self.assertEqual(deposit.client, self.user) # one associated request to a deposit deposit_request = DepositRequest.objects.get(deposit=deposit) self.assertIsNotNone(deposit_request.metadata) self.assertFalse(bool(deposit_request.archive)) @istest def post_deposit_atom_entry_multiple_steps(self): """After initial deposit, updating a deposit should return a 201 """ # given external_id = 'urn:uuid:2225c695-cfb8-4ebb-aaaa-80da344efa6a' with self.assertRaises(Deposit.DoesNotExist): deposit = Deposit.objects.get(external_id=external_id) # when response = self.client.post( reverse(COL_IRI, args=[self.collection.name]), content_type='application/atom+xml;type=entry', data=self.atom_entry_data1, HTTP_IN_PROGRESS='True', HTTP_SLUG=external_id) # then self.assertEqual(response.status_code, status.HTTP_201_CREATED) response_content = parse_xml(BytesIO(response.content)) deposit_id = response_content[ '{http://www.w3.org/2005/Atom}deposit_id'] deposit = Deposit.objects.get(pk=deposit_id) self.assertEqual(deposit.collection, self.collection) self.assertEqual(deposit.external_id, external_id) self.assertEqual(deposit.status, 'partial') self.assertEqual(deposit.client, self.user) # one associated request to a deposit deposit_requests = DepositRequest.objects.filter(deposit=deposit) self.assertEqual(len(deposit_requests), 1) atom_entry_data = self.atom_entry_data2 % external_id.encode('utf-8') update_uri = response._headers['location'][1] # when updating the first deposit post response = self.client.post( update_uri, content_type='application/atom+xml;type=entry', data=atom_entry_data, HTTP_IN_PROGRESS='False') # then self.assertEqual(response.status_code, status.HTTP_201_CREATED) response_content = parse_xml(BytesIO(response.content)) deposit_id = response_content[ '{http://www.w3.org/2005/Atom}deposit_id'] deposit = Deposit.objects.get(pk=deposit_id) self.assertEqual(deposit.collection, self.collection) self.assertEqual(deposit.external_id, external_id) self.assertEqual(deposit.status, DEPOSIT_STATUS_READY) self.assertEqual(deposit.client, self.user) self.assertEqual(len(Deposit.objects.all()), 1) # now 2 associated requests to a same deposit deposit_requests = DepositRequest.objects.filter(deposit=deposit) self.assertEqual(len(deposit_requests), 2) for deposit_request in deposit_requests: actual_metadata = deposit_request.metadata self.assertIsNotNone(actual_metadata) self.assertFalse(bool(deposit_request.archive))