Changeset View
Changeset View
Standalone View
Standalone View
docs/specs/spec-loading.rst
Loading specification | Loading specification | ||||
===================== | ===================== | ||||
AN important part of the deposit specifications is the loading procedure whereas | An important part of the deposit specifications is the loading procedure where | ||||
a deposit is ingested into the Software Heritage (archive), using | a deposit is ingested into the Software Heritage (archive), using | ||||
the tarball loader and the complete schema of software artifacts creation | the tarball loader and the complete process of software artifacts creation | ||||
in the archive. | in the archive. | ||||
Tarball Loading | Tarball Loading | ||||
--------------- | --------------- | ||||
The ``swh-loader-tar`` module is already able to inject tarballs in swh | The ``swh-loader-tar`` module is already able to inject tarballs in swh | ||||
with very limited metadata (mainly the origin). | with very limited metadata (mainly the origin). | ||||
The loading of the deposit will use the deposit's associated data: | The loading of the deposit will use the deposit's associated data: | ||||
* the metadata | * the metadata | ||||
* the archive(s) | * the archive(s) | ||||
Artifacts creation | Artifacts creation | ||||
---------------------- | ------------------ | ||||
Deposit to artifacts mapping | Deposit to artifacts mapping | ||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
This is a global view of the deposit ingestion | This is a global view of the deposit ingestion | ||||
+------------------------------------+-----------------------------------------+ | +------------------------------------+-----------------------------------------+ | ||||
| swh artifact | representation in deposit | | | swh artifact | representation in deposit | | ||||
Show All 14 Lines | |||||
| | the expanded submitted tarball | | | | the expanded submitted tarball | | ||||
+------------------------------------+-----------------------------------------+ | +------------------------------------+-----------------------------------------+ | ||||
| directory | root directory of the expanded submitted| | | directory | root directory of the expanded submitted| | ||||
| | tarball | | | | tarball | | ||||
+------------------------------------+-----------------------------------------+ | +------------------------------------+-----------------------------------------+ | ||||
Origin artifact | Origin artifact | ||||
~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~ | ||||
We create an origin using the url in the deposited metadata. | |||||
The current deposit and future deposits with the same url or external_id | We create an origin URL by concatenating the client URI and the value of the | ||||
will be associated to this origin. | Slug header of the initial POST request of the deposit. | ||||
.. code-block:: json | .. code-block:: json | ||||
{ | { | ||||
"origin": { | "origin": { | ||||
"id": 89283768, | "id": 89283768, | ||||
"origin_visits_url": "/api/1/origin/89283768/visits/", | "origin_visits_url": "/api/1/origin/89283768/visits/", | ||||
"type": "deposit", | "type": "deposit", | ||||
"url": "https://hal.archives-ouvertes.fr/hal-02140606" | "url": "https://hal.archives-ouvertes.fr/hal-02140606" | ||||
} | } | ||||
} | } | ||||
Visits | Visits | ||||
~~~~~~~ | ~~~~~~ | ||||
We identify with a visit each deposit push of the same external_id. | We identify with a visit each deposit push of the same external_id. | ||||
Here in the example below, two snapshots are identified by two different visits. | Here in the example below, two snapshots are identified by two different visits. | ||||
.. code-block:: json | .. code-block:: json | ||||
{ | { | ||||
"visits": [ | "visits": [ | ||||
{ | { | ||||
Show All 16 Lines | .. code-block:: json | ||||
"type": "deposit", | "type": "deposit", | ||||
"visit": 1 | "visit": 1 | ||||
} | } | ||||
] | ] | ||||
} | } | ||||
Snapshot artifact | Snapshot artifact | ||||
~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~ | ||||
The snapshot represents one deposit push. The master branch points to a | |||||
synthetic revision. We will create a second branch pointing to a release | The snapshot represents one deposit push. The ``HEAD`` branch points to a | ||||
artifact, if the indicate that the deposit is a release with a `releaseNotes`. | synthetic revision. | ||||
.. code-block:: json | .. code-block:: json | ||||
{ | { | ||||
"snapshot": { | "snapshot": { | ||||
"branches": { | "branches": { | ||||
"master": { | "HEAD": { | ||||
"target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | "target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | ||||
"target_type": "revision", | "target_type": "revision", | ||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | ||||
} | } | ||||
"refs/tags/v1.1": { | |||||
"target": "a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97", | |||||
"target_type": "release", | |||||
"target_url": "/api/1/release/a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97/" | |||||
} | |||||
}, | }, | ||||
"id": "a3773941561cc557853898773a19c07cfe2efc5a", | "id": "a3773941561cc557853898773a19c07cfe2efc5a", | ||||
"next_branch": null | "next_branch": null | ||||
} | } | ||||
} | } | ||||
Note that previous versions of the deposit-loader named the branch ``master`` | |||||
instead, and created release branches under certain conditions. | |||||
Release artifact | Release artifact | ||||
~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~ | ||||
.. warning:: | |||||
This part of the specification is not implemented yet, only releases are | |||||
currently being created. | |||||
The content is deposited with a set of descriptive metadata in the CodeMeta | The content is deposited with a set of descriptive metadata in the CodeMeta | ||||
vocabulary. The following CodeMeta terms implies that the | vocabulary. The following CodeMeta terms implies that the | ||||
artifact is a release: | artifact is a release: | ||||
- `releaseNotes` | - `releaseNotes` | ||||
- `softwareVersion` | - `softwareVersion` | ||||
If present, a release artifact will be created with the mapping below: | If present, a release artifact will be created with the mapping below: | ||||
Show All 34 Lines | .. code-block:: json | ||||
"target_type": "revision", | "target_type": "revision", | ||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | ||||
} | } | ||||
} | } | ||||
Revision artifact | Revision artifact | ||||
~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~ | ||||
The metadata sent with the deposit is included in the revision which affects | |||||
the hash computation, thus resulting in a unique identifier. | The metadata sent with the deposit is stored outside the revision, | ||||
This way, by depositing the same content with different metadata, will result | and does not affect the hash computation. | ||||
in two different revisions in the SWH archive. | It contains the same fields as any revision object; in particular: | ||||
ardumont: `name` | |||||
+-------------------+-----------------------------------------+ | |||||
| SWH revision field| Description | | |||||
+===================+=========================================+ | |||||
| message | synthetic message, containing the name | | |||||
| | of the deposit client and an internal | | |||||
| | identifier of the deposit. For example: | | |||||
| | ``hal: Deposit 817 in collection hal`` | | |||||
+-------------------+-----------------------------------------+ | |||||
| author | synthetic author (SWH itself, for now) | | |||||
+-------------------+-----------------------------------------+ | |||||
| committer | same as the author (for now) | | |||||
+-------------------+-----------------------------------------+ | |||||
| date | see below | | |||||
+-------------------+-----------------------------------------+ | |||||
| committer_date | see below | | |||||
+-------------------+-----------------------------------------+ | |||||
The date mapping | The date mapping | ||||
^^^^^^^^^^^^^^^^ | ^^^^^^^^^^^^^^^^ | ||||
A deposit may contain 4 different dates concerning the software artifacts. | A deposit may contain 4 different dates concerning the software artifacts. | ||||
The deposit's revision will reflect the most accurate point in time available. | The deposit's revision will reflect the most accurate point in time available. | ||||
Here are all dates that can be available in a deposit: | Here are all dates that can be available in a deposit: | ||||
+-------------------+-----------------------------------+-----------------------------------------------+ | +----------------+---------------------------------+------------------------------------------------+ | ||||
| dates | location | Description | | | dates | location | Description | | ||||
+===================+===================================+===============================================+ | +================+=================================+================================================+ | ||||
| reception_date | On SWORD reception (automatic) |the deposit was received at this ts | | | reception_date | On SWORD reception (automatic) | the deposit was received at this ts | | ||||
+-------------------+-----------------------------------+-----------------------------------------------+ | +----------------+---------------------------------+------------------------------------------------+ | ||||
| complete_date | On SWH ingestion (automatic) |the ingestion was completed by SWH at this ts | | | complete_date | On SWH ingestion (automatic) | the ingestion was completed by SWH at this ts | | ||||
+-------------------+-----------------------------------+-----------------------------------------------+ | +----------------+---------------------------------+------------------------------------------------+ | ||||
| dateCreated | metadata in codeMeta (optional) |the software artifact was created at this ts | | | dateCreated | metadata in codeMeta (optional) | the software artifact was created at this ts | | ||||
+-------------------+-----------------------------------+----------------------+------------------------+ | +----------------+---------------------------------+------------------------------------------------+ | ||||
| datePublished | metadata in codeMeta (optional) |the software was published (contributed in HAL)| | | datePublished | metadata in codeMeta (optional) | the software was published (contributed in HAL)| | ||||
+-------------------+-----------------------------------+----------------------+------------------------+ | +----------------+---------------------------------+------------------------------------------------+ | ||||
A visit targeting a snapshot contains one date: | A visit targeting a snapshot contains one date: | ||||
+-------------------+----------------------------------------------+----------------+ | +-------------------+----------------------------------------------+----------------+ | ||||
| SWH visit field | Description | value | | | SWH visit field | Description | value | | ||||
+===================+==============================================+================+ | +===================+==============================================+================+ | ||||
| date | the origin pushed the deposit at this date | reception_date | | | date | the origin pushed the deposit at this date | reception_date | | ||||
+-------------------+----------------------------------------------+----------------+ | +-------------------+----------------------------------------------+----------------+ | ||||
A revision contains two dates: | A revision contains two dates: | ||||
+-------------------+-----------------------------------------+----------------+----------------+ | +-------------------+-----------------------------------------+----------------+----------------+ | ||||
| SWH revision field| Description | CodeMeta term | Fallback value | | | SWH revision field| Description | CodeMeta term | Fallback value | | ||||
+===================+=========================================+================+================+ | +===================+=========================================+================+================+ | ||||
| date | date of software artifact modification | dateCreated | reception_date | | | date | date of software artifact modification | dateCreated | reception_date | | ||||
+-------------------+-----------------------------------------+----------------+----------------+ | +-------------------+-----------------------------------------+----------------+----------------+ | ||||
| committer_date | date of the commit in VCS | datePublished | reception_date | | | committer_date | date of the commit in VCS | datePublished | reception_date | | ||||
+-------------------+-----------------------------------------+----------------+----------------+ | +-------------------+-----------------------------------------+----------------+----------------+ | ||||
A release contains one date: | A release contains one date: | ||||
+-------------------+----------------------------------+---------------+----------------+ | +-------------------+----------------------------------+----------------+-----------------+ | ||||
| SWH release field |Description |CodeMeta term | Fallback value | | | SWH release field |Description | CodeMeta term | Fallback value | | ||||
+===================+==================================+===============+================+ | +===================+==================================+================+=================+ | ||||
| date |release date = publication date |datePublished |reception_date | | | date |release date = publication date | datePublished | reception_date | | ||||
+-------------------+----------------------------------+---------------+----------------+ | +-------------------+----------------------------------+----------------+-----------------+ | ||||
.. code-block:: json | .. code-block:: json | ||||
{ | { | ||||
"revision": { | "revision": { | ||||
"author": { | "author": { | ||||
"email": "robot@softwareheritage.org", | "email": "robot@softwareheritage.org", | ||||
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines | .. code-block:: json | ||||
"synthetic": true, | "synthetic": true, | ||||
"type": "tar", | "type": "tar", | ||||
"url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | "url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | ||||
} | } | ||||
} | } | ||||
Directory artifact | Directory artifact | ||||
~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~ | ||||
The directory artifact is the archive(s)' raw content deposited. | The directory artifact is the archive(s)' raw content deposited. | ||||
.. code-block:: json | .. code-block:: json | ||||
{ | { | ||||
"directory": [ | "directory": [ | ||||
{ | { | ||||
"dir_id": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | "dir_id": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | ||||
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines | When the loading has failed, the deposit entry is updated: | ||||
- ``swh-id`` and ``complete_data`` remains as is | - ``swh-id`` and ``complete_data`` remains as is | ||||
*Note:* As a further improvement, we may prefer having a retry policy with | *Note:* As a further improvement, we may prefer having a retry policy with | ||||
graceful delays for further scheduling. | graceful delays for further scheduling. | ||||
Metadata loading | Metadata loading | ||||
~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~ | ||||
- the metadata received with the deposit are kept in the `metadata` fields | - the metadata received with the deposit are kept in a dedicated table | ||||
of the revision and in the ```origin_metadata`` table to facilitate search | ``raw_extrinsic_metadata``, distinct from the ``revision`` and ``origin`` | ||||
over origin metadata. | tables. | ||||
- provider\_id and tool\_id are resolved by the prepare\_metadata method in the | - ``authority`` is computed from the deposit client information, and ``fetcher`` | ||||
loader-core | is the deposit loader. | ||||
- the origin\_metadata entry is sent to storage by the send\_origin\_metadata | |||||
in the loader-core | |||||
origin\_metadata table: | |||||
:: | |||||
id bigint PK | |||||
origin bigint | |||||
discovery_date date | |||||
provider_id bigint FK // (from provider table) | |||||
tool_id bigint FK // indexer_configuration_id tool used for extraction | |||||
metadata jsonb // before translation |
name