diff --git a/docs/getting-started.rst b/docs/getting-started.rst --- a/docs/getting-started.rst +++ b/docs/getting-started.rst @@ -175,7 +175,7 @@ The steps to create a multisteps deposit: 1. Create an incomplete deposit -~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First use the ``--partial`` argument to declare there is more to come .. code:: shell @@ -186,7 +186,7 @@ 2. Add content or metadata to the deposit -~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Continue the deposit by using the ``--deposit-id`` argument given as a response for the first step. You can continue adding content or metadata while you use the ``--partial`` argument. @@ -268,7 +268,7 @@ .. code:: shell -$ swh-deposit --username name --password secret --deposit-id '11' --status + $ swh-deposit --username name --password secret --deposit-id '11' --status .. code:: json diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -12,6 +12,7 @@ metadata.rst dev-info.rst sys-info.rst + specs/specs.rst Indices and tables ================== diff --git a/docs/blueprint.rst b/docs/specs/blueprint.rst rename from docs/blueprint.rst rename to docs/specs/blueprint.rst --- a/docs/blueprint.rst +++ b/docs/specs/blueprint.rst @@ -8,13 +8,13 @@ From client's deposit repository server to SWH's repository server: 1. The client requests for the server's abilities and its associated collection - (GET query to the *SD/service document uri*) + (GET query to the *SD/service document uri*) 2. The server answers the client with the service document which gives the - *collection uri* (also known as *COL/collection IRI*). + *collection uri* (also known as *COL/collection IRI*). 3. The client sends a deposit (optionally a zip archive, some metadata or both) - through the *collection uri*. + through the *collection uri*. This can be done in: @@ -22,16 +22,16 @@ * one POST request (metadata or archive) + other PUT or POST request to the *update uris* (*edit-media iri* or *edit iri*) - 1. Server validates the client's input or returns detailed error if any + a. Server validates the client's input or returns detailed error if any - 2. Server stores information received (metadata or software archive source + b. Server stores information received (metadata or software archive source code or both) 4. The server notifies the client it acknowledged the client's request. An - ``http 201 Created`` response with a deposit receipt in the body response is - sent back. That deposit receipt will hold the necessary information to - eventually complete the deposit later on if it was incomplete (also known as - status ``partial``). + ``http 201 Created`` response with a deposit receipt in the body response is + sent back. That deposit receipt will hold the necessary information to + eventually complete the deposit later on if it was incomplete (also known as + status ``partial``). Schema representation ^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/specs/metadata_example.xml b/docs/specs/metadata_example.xml new file mode 100644 --- /dev/null +++ b/docs/specs/metadata_example.xml @@ -0,0 +1,38 @@ +<?xml version="1.0"?> + <entry xmlns="http://www.w3.org/2005/Atom" + xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0" + xmlns:swh="swh.xsd"> + "{http://www.w3.org/2005/Atom}author": { + "{http://www.w3.org/2005/Atom}email": "hal@ccsd.cnrs.fr", + "{http://www.w3.org/2005/Atom}name": "HAL" + }, + <author> + <name>HAL</name> + <email>hal@ccsd.cnrs.fr</email> + </author> + <client>hal</client> + <external_identifier>hal-01243573</external_identifier> + <codemeta:name>The assignment problem</codemeta:name> + <codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url> + <codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier> + <codemeta:applicationCategory>Domain</codemeta:applicationCategory> + <codemeta:description>description</codemeta:description> + <codemeta:author> + <codemeta:name> author1 </codemeta:name> + <codemeta:affiliation> Inria </codemeta:affiliation> + <codemeta:affiliation> UPMC </codemeta:affiliation> + </codemeta:author> + <codemeta:author> + <codemeta:name> author2 </codemeta:name> + <codemeta:affiliation> Inria </codemeta:affiliation> + <codemeta:affiliation> UPMC </codemeta:affiliation> + </codemeta:author> + <swh:deposit> + <swh:manifest> + <swh:object> + <swh:path>./path/to/file.txt</swh:path> + <swh:swhid>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</swh:swhid> + </swh:object> + </swh:manifest> + </swh:deposit> + </entry> diff --git a/docs/spec-loading.rst b/docs/specs/spec-loading.rst rename from docs/spec-loading.rst rename to docs/specs/spec-loading.rst diff --git a/docs/specs/spec-meta-deposit.rst b/docs/specs/spec-meta-deposit.rst new file mode 100644 --- /dev/null +++ b/docs/specs/spec-meta-deposit.rst @@ -0,0 +1,31 @@ +The meta-deposit +================ + +Goal +---- +A client wishes to deposit only metadata about an object in the Software +Heritage archive. + +The meta-deposit is a special deposit where no content is +deposited and the data transfered to Software Heritage is only +the metadata about an object or several objects in the archive. + +The scope of the meta-deposit is larger than the sparse-deposit, because +with a meta-deposit all types of objects in the archive can be described +with the deposited metadata: + +- origin +- snapshot +- revision +- release +- directory +- content + + +Loading procedure +------------------ + +In this case, the meta-deposit will be injected as a metadata entry at the +appropriate level (origin_metadata, revision_metadata, etc.) and won't result +in the creation of a new object like with the complete deposit and the +sparse-deposit. diff --git a/docs/specs/spec-sparse-deposit.rst b/docs/specs/spec-sparse-deposit.rst new file mode 100644 --- /dev/null +++ b/docs/specs/spec-sparse-deposit.rst @@ -0,0 +1,109 @@ +The sparse-deposit +================== + +Goal +---- +A client wishes to transfer a tarball for which part of the content is +already in the SWH archive. + +Requirements +------------ +To do so, the paths to the missing directories/content must be provided as +empty paths in the tarball and the list linking each path to the object in the +archive will be provided as part of the metadata. The list will be refered to +as the manifest list. + ++----------------------+-------------------------------------+ +| path | swh-id | ++======================+=====================================+ +| ./path/to/file.txt | swh:1:cnt:aaaaaaaaaaaaaaaaaaaaa... | ++----------------------+-------------------------------------+ +| ./path/to/dir/ | swh:1:dir:aaaaaaaaaaaaaaaaaaaaa... | ++----------------------+-------------------------------------+ + +Note: the *name* of the file or the directory is given by the path and is not +part of the identified object. + +A concrete example +------------------ +The manifest list is included in the metadata xml atomEntry under the +swh namespace: + +.. code:: xml + + <?xml version="1.0"?> + <entry xmlns="http://www.w3.org/2005/Atom" + xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0" + xmlns:swh="swh.xsd"> + <author> + <name>HAL</name> + <email>hal@ccsd.cnrs.fr</email> + </author> + <client>hal</client> + <external_identifier>hal-01243573</external_identifier> + <codemeta:name>The assignment problem</codemeta:name> + <codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url> + <codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier> + <codemeta:applicationCategory>Domain</codemeta:applicationCategory> + <codemeta:description>description</codemeta:description> + <codemeta:author> + <codemeta:name> author1 </codemeta:name> + <codemeta:affiliation> Inria </codemeta:affiliation> + <codemeta:affiliation> UPMC </codemeta:affiliation> + </codemeta:author> + <codemeta:author> + <codemeta:name> author2 </codemeta:name> + <codemeta:affiliation> Inria </codemeta:affiliation> + <codemeta:affiliation> UPMC </codemeta:affiliation> + </codemeta:author> + <swh:deposit> + <swh:manifest> + <swh:object> + <swh:path>./path/to/file.txt</swh:path> + <swh:swhid>swh:1:cnt:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</swh:swhid> + </swh:object> + <swh:object> + <swh:path>./path/to/second_file.txt</swh:path> + <swh:swhid>swh:1:cnt:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb</swh:swhid> + </swh:object> + <swh:object> + <swh:path>./path/to/dir/</swh:path> + <swh:swhid>swh:1:dir:ddddddddddddddddddddddddddddddddd</swh:swhid> + </swh:object> + </swh:manifest> + </swh:deposit> + </entry> + +The tarball sent with the deposit will contain the following empty paths: +- path/to/file.txt +- path/to/second_file.txt +- path/to/dir/ + +Deposit verification +-------------------- + +After checking the integrity of the deposit content and +metadata, the following checks should be added: + +1. validate the manifest list structure with a swh-id for each path +2. verify that the paths in the manifest list are explicit and empty in the tarball +3. verify that the path name corresponds to the object type +4. locate the identifiers in the SWH archive + +Each one of the verifications should return a different error with the deposit +and result in a 'rejected' deposit. + +Loading procedure +------------------ +The injection procedure should include: + +- load the tarball data +- create new objects using the path name and create links from the path to the + SWH object using the identifier +- calculate identifier of the new objects at each level +- return final swh-id of the new revision + +Invariant: the same content should yield the same swhid, that's why a complete +deposit with all the content and a sparse-deposit with the correct links will +result with the same root directory swh-id and if the metadata are identical +also with the same revision swh-id. diff --git a/docs/specs/specs.rst b/docs/specs/specs.rst new file mode 100644 --- /dev/null +++ b/docs/specs/specs.rst @@ -0,0 +1,13 @@ +.. _swh-deposit-specs: + +Software Heritage Deposit Specifications +======================================== + +.. toctree:: + :maxdepth: 1 + :caption: Contents: + + blueprint.rst + spec-loading.rst + spec-sparse-deposit.rst + spec-meta-deposit.rst