diff --git a/docs/specs/spec-loading.rst b/docs/specs/spec-loading.rst --- a/docs/specs/spec-loading.rst +++ b/docs/specs/spec-loading.rst @@ -1,9 +1,9 @@ Loading specification ===================== -AN important part of the deposit specifications is the loading procedure whereas +An important part of the deposit specifications is the loading procedure where a deposit is ingested into the Software Heritage (archive), using -the tarball loader and the complete schema of software artifacts creation +the tarball loader and the complete process of software artifacts creation in the archive. Tarball Loading @@ -19,7 +19,7 @@ Artifacts creation ----------------------- +------------------ Deposit to artifacts mapping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -50,10 +50,10 @@ Origin artifact -~~~~~~~~~~~~~~~~ -We create an origin using the url in the deposited metadata. -The current deposit and future deposits with the same url or external_id -will be associated to this origin. +~~~~~~~~~~~~~~~ + +We create an origin URL by concatenating the client URI and the value of the +Slug header of the initial POST request of the deposit. .. code-block:: json @@ -67,7 +67,8 @@ } Visits -~~~~~~~ +~~~~~~ + We identify with a visit each deposit push of the same external_id. Here in the example below, two snapshots are identified by two different visits. @@ -100,33 +101,37 @@ Snapshot artifact ~~~~~~~~~~~~~~~~~ -The snapshot represents one deposit push. The master branch points to a -synthetic revision. We will create a second branch pointing to a release -artifact, if the indicate that the deposit is a release with a `releaseNotes`. -.. code-block:: json +The snapshot represents one deposit push. The ``HEAD`` branch points to a +synthetic revision. + + .. code-block:: json { "snapshot": { "branches": { - "master": { + "HEAD": { "target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", "target_type": "revision", "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" } - "refs/tags/v1.1": { - "target": "a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97", - "target_type": "release", - "target_url": "/api/1/release/a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97/" - } }, "id": "a3773941561cc557853898773a19c07cfe2efc5a", "next_branch": null } } +Note that previous versions of the deposit-loader named the branch ``master`` +instead, and created release branches under certain conditions. + Release artifact ~~~~~~~~~~~~~~~~ + +.. warning:: + + This part of the specification is not implemented yet, only releases are + currently being created. + The content is deposited with a set of descriptive metadata in the CodeMeta vocabulary. The following CodeMeta terms implies that the artifact is a release: @@ -177,29 +182,47 @@ Revision artifact ~~~~~~~~~~~~~~~~~ -The metadata sent with the deposit is included in the revision which affects -the hash computation, thus resulting in a unique identifier. -This way, by depositing the same content with different metadata, will result -in two different revisions in the SWH archive. + +The metadata sent with the deposit is stored outside the revision, +and does not affect the hash computation. +It contains the same fields as any revision object; in particular: + ++-------------------+-----------------------------------------+ +| SWH revision field| Description | ++===================+=========================================+ +| message | synthetic message, containing the name | +| | of the deposit client and an internal | +| | identifier of the deposit. For example: | +| | ``hal: Deposit 817 in collection hal`` | ++-------------------+-----------------------------------------+ +| author | synthetic author (SWH itself, for now) | ++-------------------+-----------------------------------------+ +| committer | same as the author (for now) | ++-------------------+-----------------------------------------+ +| date | see below | ++-------------------+-----------------------------------------+ +| committer_date | see below | ++-------------------+-----------------------------------------+ The date mapping ^^^^^^^^^^^^^^^^ + A deposit may contain 4 different dates concerning the software artifacts. The deposit's revision will reflect the most accurate point in time available. Here are all dates that can be available in a deposit: -+-------------------+-----------------------------------+-----------------------------------------------+ -| dates | location | Description | -+===================+===================================+===============================================+ -| reception_date | On SWORD reception (automatic) |the deposit was received at this ts | -+-------------------+-----------------------------------+-----------------------------------------------+ -| complete_date | On SWH ingestion (automatic) |the ingestion was completed by SWH at this ts | -+-------------------+-----------------------------------+-----------------------------------------------+ -| dateCreated | metadata in codeMeta (optional) |the software artifact was created at this ts | -+-------------------+-----------------------------------+----------------------+------------------------+ -| datePublished | metadata in codeMeta (optional) |the software was published (contributed in HAL)| -+-------------------+-----------------------------------+----------------------+------------------------+ ++----------------+---------------------------------+------------------------------------------------+ +| dates | location | Description | ++================+=================================+================================================+ +| reception_date | On SWORD reception (automatic) | the deposit was received at this ts | ++----------------+---------------------------------+------------------------------------------------+ +| complete_date | On SWH ingestion (automatic) | the ingestion was completed by SWH at this ts | ++----------------+---------------------------------+------------------------------------------------+ +| dateCreated | metadata in codeMeta (optional) | the software artifact was created at this ts | ++----------------+---------------------------------+------------------------------------------------+ +| datePublished | metadata in codeMeta (optional) | the software was published (contributed in HAL)| ++----------------+---------------------------------+------------------------------------------------+ A visit targeting a snapshot contains one date: @@ -222,11 +245,11 @@ A release contains one date: -+-------------------+----------------------------------+---------------+----------------+ -| SWH release field |Description |CodeMeta term | Fallback value | -+===================+==================================+===============+================+ -| date |release date = publication date |datePublished |reception_date | -+-------------------+----------------------------------+---------------+----------------+ ++-------------------+----------------------------------+----------------+-----------------+ +| SWH release field |Description | CodeMeta term | Fallback value | ++===================+==================================+================+=================+ +| date |release date = publication date | datePublished | reception_date | ++-------------------+----------------------------------+----------------+-----------------+ .. code-block:: json @@ -320,6 +343,7 @@ Directory artifact ~~~~~~~~~~~~~~~~~~ + The directory artifact is the archive(s)' raw content deposited. .. code-block:: json @@ -417,23 +441,10 @@ Metadata loading ~~~~~~~~~~~~~~~~ -- the metadata received with the deposit are kept in the `metadata` fields - of the revision and in the ```origin_metadata`` table to facilitate search - over origin metadata. +- the metadata received with the deposit are kept in a dedicated table + ``raw_extrinsic_metadata``, distinct from the ``revision`` and ``origin`` + tables. -- provider\_id and tool\_id are resolved by the prepare\_metadata method in the - loader-core - -- the origin\_metadata entry is sent to storage by the send\_origin\_metadata - in the loader-core - -origin\_metadata table: - -:: +- ``authority`` is computed from the deposit client information, and ``fetcher`` + is the deposit loader. - id bigint PK - origin bigint - discovery_date date - provider_id bigint FK // (from provider table) - tool_id bigint FK // indexer_configuration_id tool used for extraction - metadata jsonb // before translation diff --git a/docs/specs/swh.xsd b/docs/specs/swh.xsd --- a/docs/specs/swh.xsd +++ b/docs/specs/swh.xsd @@ -9,7 +9,7 @@ - +