Changeset View
Standalone View
docs/specs/spec-loading.rst
Loading specification (draft) | Loading specification | ||||||
============================= | ===================== | ||||||
This part discusses the deposit loading part on the server side. | This part specifies the ingestion of the deposit in the SWH archive, using | ||||||
the tarball loader and the complete schema of software artifacts creation | |||||||
in the archive. | |||||||
zack: "this part" ... of what?
Alternative beginning suggestion: "This specification describes the… | |||||||
Done Inline ActionsI agree. This page is a part of a collection of specifications, this is why it starts that way, but I prefer starting fresh as you suggest also with the full definition of the acronym. moranegg: I agree. This page is a part of a collection of specifications, this is why it starts that way… | |||||||
Tarball Loading | Tarball Loading | ||||||
--------------- | --------------- | ||||||
The ``swh-loader-tar`` module is already able to inject tarballs in swh | The ``swh-loader-tar`` module is already able to inject tarballs in swh | ||||||
with very limited metadata (mainly the origin). | with very limited metadata (mainly the origin). | ||||||
The loading of the deposit will use the deposit's associated data: | The loading of the deposit will use the deposit's associated data: | ||||||
* the metadata | * the metadata | ||||||
* the archive(s) | * the archive(s) | ||||||
We will use the ``synthetic`` revision notion. | |||||||
To that revision will be associated the metadata. Those will be included | SWH artifacts creation | ||||||
in the hash computation, thus resulting in a unique identifier. | ---------------------- | ||||||
Loading mapping | Deposit to artifacts mapping | ||||||
~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
Some of those metadata will also be included in the ``origin_metadata`` | This is a global view of the deposit ingestion | ||||||
table. | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|swh artifact | representation in deposit | | |||||||
+===================================+========================================+ | |||||||
|origin | https://hal.inria.fr/hal-id | | |||||||
ardumontUnsubmitted Done Inline ActionsIndentation of the url? ardumont: Indentation of the url? | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
Done Inline ActionsI don't get what this means. "origin_visit" is an internal table name. You probably mean a specific field in that table? Either way, I suggest to use a descriptive name, with the SQL path as additional detail, e.g., "timestamp of the visit (origin_visit.field_name)". zack: I don't get what this means. "origin_visit" is an internal table name. You probably mean a… | |||||||
Done Inline ActionsI kept it from the original specs when there was no snapshots.
How does it sound? moranegg: I kept it from the original specs when there was no snapshots.
Maybe the best solution it to… | |||||||
|origin_visit | 1 :reception_date | | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|origin_metadata | aggregated metadata | | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
Done Inline Actionsminor grammar issue: "at visit" ← not sure what this means zack: minor grammar issue: "at visit" ← not sure what this means | |||||||
Done Inline ActionsI agree. see comment above. moranegg: I agree. see comment above. | |||||||
|snapshot | at visit of all occurences | | |||||||
ardumontUnsubmitted Done Inline Actionsoccur`r`ence ardumont: occur`r`ence | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|occurrence & occurrence_history | master & | | |||||||
Not Done Inline ActionsAre the branches actually optional? Don't we always have at least one release here? Relatedly: this use case seems similar with package manager listers/loaders, we should compare with what they do and make sure we are consistent. In particular, I'm not sure we have a "master" branch there, most likely we have a "HEAD" branch, pointing to the most recent version at visit time + one branch for each release (the current one + all previous ones). zack: Are the branches actually optional? Don't we always have at least one release here?
Relatedly… | |||||||
Not Done Inline ActionsWe do not at the moment. Only master. With this specs we are introducing the concept of release to a deposit. @ardumont can you comment about package manager? moranegg: We do not at the moment. Only master.
With this specs we are introducing the concept of… | |||||||
Not Done Inline Actions
Good catch.
In the current state, we don't have release yet, only 1 revision.
We plan to refactor the deposit loader according to the package manager loader indeed. We can change it in the spec and in the implementation. ardumont: > Are the branches actually optional?
Good catch.
No, the snapshot branch is not optional. | |||||||
| | branch (optional): tag to release | | |||||||
ardumontUnsubmitted Done Inline ActionsThose no longer exists in swh model, i think that should go away. ardumont: Those no longer exists in swh model, i think that should go away.
We are using snapshot now… | |||||||
moraneggAuthorUnsubmitted Done Inline Actionsoccurrence isn't a branch in a snapshot? moranegg: `occurrence` isn't a branch in a snapshot?
and `occurrence history` is obsolete then? | |||||||
ardumontUnsubmitted Done Inline ActionsYes, obsolete. origin -> origin-visit -> snapshot -> {revision, release} ardumont: Yes, obsolete.
They were replaced by snapshot.
origin -> origin-visit -> snapshot -> {revision… | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|release | synthetic_release created from metadata| | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|revision | synthetic_revision (tarball) | | |||||||
ardumontUnsubmitted Done Inline Actionswhy do use _ here? ardumont: why do use `_` here? | |||||||
moraneggAuthorUnsubmitted Done Inline Actionsthis is the way it was, don't mind deleting the _ moranegg: this is the way it was, don't mind deleting the `_` | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
|directory | upper level of the uncompressed archive| | |||||||
Done Inline ActionsI see what you mean here, and I agree with the arrangement. But the description of these two can probably be simpler, how about:
zack: I see what you mean here, and I agree with the arrangement. But the description of these two… | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
+-----------------------------------+----------------------------------------+ | |||||||
Origin artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
An origin using the url in the deposited metadata is created. | |||||||
The current deposit and future deposits with the same url or external_id | |||||||
Done Inline Actionsactive voice is generally preferable to passive voice (this is a general remark that applies to most of the paragraphs in the spec) zack: active voice is generally preferable to passive voice (this is a general remark that applies to… | |||||||
Done Inline ActionsCan you show me an example, I tried to avoid using We like "We create an origin from the url used in the metadata". moranegg: Can you show me an example, I tried to avoid using `We` like "We create an origin from the url… | |||||||
Not Done Inline ActionsI'm letting this one go for now. The style change will require quite some effort, and it's more important to land the spec than block on this. zack: I'm letting this one go for now. The style change will require quite some effort, and it's more… | |||||||
will be associated with this origin. | |||||||
.. code-block:: json | |||||||
{ | |||||||
"id": 89283768, | |||||||
"origin_visits_url": "/api/1/origin/89283768/visits/", | |||||||
"type": "deposit", | |||||||
"url": "https://hal.archives-ouvertes.fr/hal-02140606" | |||||||
} | |||||||
Visits | |||||||
~~~~~~~ | |||||||
Each push of the same origin or external_id will generate a visit of the origin. | |||||||
Here in the example below, two snapshots are identified by two different visits. | |||||||
.. code-block:: json | |||||||
[ | |||||||
{ | |||||||
"date": "2019-06-03T09:28:10.223007+00:00", | |||||||
"origin": 89283768, | |||||||
"origin_visit_url": "/api/1/origin/89283768/visit/2/", | |||||||
"snapshot": "a3773941561cc557853898773a19c07cfe2efc5a", | |||||||
"snapshot_url": "/api/1/snapshot/a3773941561cc557853898773a19c07cfe2efc5a/", | |||||||
"status": "full", | |||||||
"type": "deposit", | |||||||
"visit": 2 | |||||||
}, | |||||||
{ | |||||||
"date": "2019-05-27T12:23:31.037273+00:00", | |||||||
"origin": 89283768, | |||||||
"origin_visit_url": "/api/1/origin/89283768/visit/1/", | |||||||
"snapshot": "43fdb8291f1bf6962211c370e394f6abb1cbe01d", | |||||||
"snapshot_url": "/api/1/snapshot/43fdb8291f1bf6962211c370e394f6abb1cbe01d/", | |||||||
"status": "full", | |||||||
"type": "deposit", | |||||||
"visit": 1 | |||||||
} | |||||||
] | |||||||
Snapshot artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The snapshot represents one deposit push. | |||||||
.. code-block:: json | |||||||
{ | |||||||
"branches": { | |||||||
"master": { | |||||||
"target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
"target_type": "revision", | |||||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
}, | |||||||
"id": "a3773941561cc557853898773a19c07cfe2efc5a", | |||||||
"next_branch": null | |||||||
} | |||||||
ardumontUnsubmitted Done Inline ActionsIt's not exactly clear to me. Is this for information purposes? ardumont: It's not exactly clear to me.
What part generates those nice json output?
Is this for… | |||||||
moraneggAuthorUnsubmitted Done Inline ActionsThis is the api output, I imagine you know. or maybe when adding the artifact name it is clearer: snapshot: { "branches": { "master": { "target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", "target_type": "revision", "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" } }, "id": "a3773941561cc557853898773a19c07cfe2efc5a", "next_branch": null } I've use this schema on all api artifacts, can you say if this works? moranegg: This is the api output, I imagine you know.
Which is the json representation of the artifact.
I… | |||||||
ardumontUnsubmitted Done Inline Actions
Yes, i just realized that's the main api indeed.
The description works. ardumont: > This is the api output, I imagine you know.
Yes, i just realized that's the main api indeed. | |||||||
Release artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
We will identify a deposit of a release with | |||||||
moraneggAuthorUnsubmitted Done Inline ActionsThe archive is deposited with a set of descriptive metadata, in the CodeMeta vocabulary. moranegg: The archive is deposited with a set of descriptive metadata, in the CodeMeta vocabulary.
The… | |||||||
the following metadata: | |||||||
- `releaseNotes` | |||||||
- `softwareVersion` | |||||||
If present, a release artifact will be created with the mapping below: | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|SWH release field | Description | Metadata term | Fallback value | | |||||||
+==================+===================================+================+================+ | |||||||
|target | revision containing all metadata |X |X | | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|target_type | revision |X |X | | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|name | release or tag name (mandatory) | softwareVersion| X | | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|message | message associated with release | releaseNotes | X | | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|date | release date = publication date | datePublished |deposit_date | | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
|author | deposit client | author | client | | |||||||
ardumontUnsubmitted Done Inline ActionsMaybe modify the column metadata term's title to metadata term (source). ardumont: Maybe modify the column metadata term's title to `metadata term (source)`.
That explicits the… | |||||||
moraneggAuthorUnsubmitted Done Inline Actionshow about, specifying above from where we get the metadata and here change to CodeMeta term moranegg: how about, specifying above from where we get the metadata and here change to `CodeMeta term` | |||||||
ardumontUnsubmitted Done Inline Actionsworks for me ;) Note that all tables are a bit misaligned. ardumont: works for me ;)
Note that all tables are a bit misaligned.
To align it properly, i'm letting… | |||||||
moraneggAuthorUnsubmitted Done Inline Actionsit's very strange, I'm using Atom and not Emacs and I've openned the file in Emacs and it is misaligned. moranegg: it's very strange, I'm using `Atom` and not `Emacs` and I've openned the file in `Emacs` and it… | |||||||
+------------------+-----------------------------------+----------------+----------------+ | |||||||
.. code-block:: json | |||||||
{ | |||||||
"author": { | |||||||
"email": "hal@ccsd.cnrs.fr", | |||||||
"fullname": "HAL <phal@ccsd.cnrs.fr>", | |||||||
"id": x, | |||||||
"name": "HAL" | |||||||
}, | |||||||
"author_url": "/api/1/person/x/", | |||||||
"date": "2019-05-27T16:28:33+02:00", | |||||||
"id": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", | |||||||
Done Inline ActionsMaybe take a real id (that's a hash as well). ardumont: Maybe take a real id (that's a hash as well).
Because otherwise, it's not consistent with other… | |||||||
Done Inline ActionsSeconded. zack: Seconded. | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
"message": "AffectationRO Version 1.1 - added new feature\n", | |||||||
"name": "1.1", | |||||||
"synthetic": true, | |||||||
"target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
"target_type": "revision", | |||||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
:: | |||||||
origin | https://hal.inria.fr/hal-id | | Revision artifact | ||||||
------------------------------------|----------------------------------------| | ~~~~~~~~~~~~~~~~ | ||||||
origin_visit | 1 :reception_date | | A ``synthetic`` revision is created because the deposit is not a commit and | ||||||
ardumontUnsubmitted Done Inline ActionsThe sentence sounds a bit weird to my ears. I think there is no need to mention swh-loader-tar again. Also, that will change soon (thus this would need update everywhere it's mentioned, better keep it only once ;). ardumont: The sentence sounds a bit weird to my ears.
Maybe change this to: "the deposit `revision` is… | |||||||
moraneggAuthorUnsubmitted Done Inline Actionsgood thinking! and you are right... it's not very readable.
moranegg: good thinking! and you are right... it's not very readable.
Going for:
> the deposit revision… | |||||||
origin_metadata | aggregated metadata | | is created by the ``swh-loader-tar`` module. | ||||||
Done Inline ActionsI tend to try and shorten sentences where i can. The deposit revision is synthetically created. And then letting the readers understand the synthetic: True on their own. ardumont: I tend to try and shorten sentences where i can.
I'd go for a simpler:
```
The deposit… | |||||||
Done Inline ActionsJust remove this sentence. We have already said it's going to be a synthetic revision above, no need to further expand on the matter, IMO. zack: Just remove this sentence. We have already said it's going to be a synthetic revision above, no… | |||||||
Done Inline Actionssure. moranegg: sure. | |||||||
occurrence & occurrence_history | branch: client's version n° (e.g hal) | | |||||||
revision | synthetic_revision (tarball) | | The metadata sent with the deposit will be included in the revision and will | ||||||
directory | upper level of the uncompressed archive| | affect the hash computation, thus resulting in a unique identifier. | ||||||
This way, by depositing the same content with different metadata will be two | |||||||
different revisions in the archive. | |||||||
.. code-block:: json | |||||||
{ | |||||||
"author": { | |||||||
"email": "robot@softwareheritage.org", | |||||||
Done Inline ActionsThat's a judgment call, i'm not sure we want to add that here. Also i'd remove this and amend the chapter when legacy software will actually happen. ardumont: That's a judgment call, i'm not sure we want to add that here.
Also i'd remove this and amend… | |||||||
Done Inline ActionsAgreed. Just remove the above paragraph; the subsequent (factual) paragraph is all we need in a spec. zack: Agreed. Just remove the above paragraph; the subsequent (factual) paragraph is all we need in a… | |||||||
Done Inline ActionsOk. You should know it is really complicating things and I'm pulling my hair on the Legacy Software. moranegg: Ok. You should know it is really complicating things and I'm pulling my hair on the Legacy… | |||||||
"fullname": "Software Heritage", | |||||||
"id": 18233048, | |||||||
"name": "Software Heritage" | |||||||
}, | |||||||
"author_url": "/api/1/person/18233048/", | |||||||
"committer": { | |||||||
Done Inline ActionsI don't understand what "dates in a deposit" mean. Are these fields that are available via SWORD? if so, we should add a sentence specifying that this column should be interpreted in that context. zack: I don't understand what "dates in a deposit" mean. Are these fields that are available via… | |||||||
Done Inline Actionsdates in a deposit can be found via SWORD in the header or in the xml metadata. moranegg: dates in a deposit can be found via SWORD in the header or in the xml metadata.
for example the… | |||||||
"email": "robot@softwareheritage.org", | |||||||
"fullname": "Software Heritage", | |||||||
Done Inline Actionstypo: "recieved" zack: typo: "recieved" | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
"id": 18233048, | |||||||
"name": "Software Heritage" | |||||||
}, | |||||||
"committer_date": "2019-05-27T16:28:33+02:00", | |||||||
Done Inline ActionsThe distinction between the comitter_date and the date for a deposit should be specified and documented better. Here the date is committer_date populated from deposit with CodeMeta term publicationDate moranegg: The distinction between the comitter_date and the date for a deposit should be specified and… | |||||||
"committer_url": "/api/1/person/18233048/", | |||||||
"date": "2012-01-01T00:00:00+00:00", | |||||||
Done Inline ActionsHere the date is the date (aka. author_date) populated from a deposit with CodeMeta term creationDate. This is the date used by @grouss. moranegg: Here the date is the `date` (aka. `author_date`) populated from a deposit with CodeMeta term… | |||||||
"directory": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | |||||||
"directory_url": "/api/1/directory/fb13b51abbcfd13de85d9ba8d070a23679576cd7/", | |||||||
"history_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/log/", | |||||||
Done Inline Actionstypo "targetting" zack: typo "targetting" | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
"id": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
"merge": false, | |||||||
"message": "hal: Deposit 282 in collection hal", | |||||||
"metadata": { | |||||||
"@xmlns": "http://www.w3.org/2005/Atom", | |||||||
"@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0", | |||||||
"author": { | |||||||
"email": "hal@ccsd.cnrs.fr", | |||||||
"name": "HAL" | |||||||
}, | |||||||
"client": "hal", | |||||||
"codemeta:applicationCategory": "info", | |||||||
"codemeta:author": { | |||||||
"codemeta:name": "Morane Gruenpeter" | |||||||
}, | |||||||
"codemeta:codeRepository": "www.code-repository.com", | |||||||
"codemeta:contributor": "Morane Gruenpeter", | |||||||
"codemeta:dateCreated": "2012", | |||||||
Done Inline Actionscommitter ardumont: `committer` | |||||||
"codemeta:datePublished": "2019-05-27T16:28:33+02:00", | |||||||
"codemeta:description": "description\\_en test v2", | |||||||
"codemeta:developmentStatus": "Inactif", | |||||||
"codemeta:keywords": "mot_cle_en,mot_cle_2_en,mot_cle_fr", | |||||||
"codemeta:license": [ | |||||||
{ | |||||||
"codemeta:name": "MIT License" | |||||||
}, | |||||||
{ | |||||||
"codemeta:name": "CeCILL Free Software License Agreement v1.1" | |||||||
} | |||||||
], | |||||||
"codemeta:name": "Test\\_20190527\\_01", | |||||||
"codemeta:operatingSystem": "OS", | |||||||
"codemeta:programmingLanguage": "Java", | |||||||
"codemeta:referencePublication": null, | |||||||
"codemeta:relatedLink": null, | |||||||
"codemeta:releaseNotes": "releaseNote", | |||||||
"codemeta:runtimePlatform": "outil", | |||||||
"codemeta:softwareVersion": "1.0.1", | |||||||
"codemeta:url": "https://hal.archives-ouvertes.fr/hal-02140606", | |||||||
"codemeta:version": "2", | |||||||
"external_identifier": "hal-02140606", | |||||||
"id": "hal-02140606", | |||||||
"original_artifact": [ | |||||||
{ | |||||||
"archive_type": "zip", | |||||||
"blake2s256": "96be3ddedfcee9669ad9c42b0bb3a706daf23824d04311c63505a4d8db02df00", | |||||||
"length": 193072, | |||||||
"name": "archive.zip", | |||||||
"sha1": "5b6ecc9d5bb113ff69fc275dcc9b0d993a8194f1", | |||||||
"sha1_git": "bd10e4d3ede17162692d7e211e08e87e67994488", | |||||||
"sha256": "3e2ce93384251ce6d6da7b8f2a061a8ebdaf8a28b8d8513223ca79ded8a10948" | |||||||
} | |||||||
] | |||||||
}, | |||||||
"parents": [ | |||||||
{ | |||||||
"id": "a9fdc3937d2b704b915852a64de2ab1b4b481003", | |||||||
"url": "/api/1/revision/a9fdc3937d2b704b915852a64de2ab1b4b481003/" | |||||||
} | |||||||
], | |||||||
"synthetic": true, | |||||||
"type": "tar", | |||||||
"url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
Directory artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The directory artifact is the actual content deposited. | |||||||
ardumontUnsubmitted Done Inline Actionsarchive(s)' raw content deposited. ardumont: `archive(s)' raw content` deposited. | |||||||
Done Inline Actions"Artifacts creation" should be enough here, SWH is kinda implicit anyway. zack: "Artifacts creation" should be enough here, SWH is kinda implicit anyway. | |||||||
Done Inline Actionssure. moranegg: sure. | |||||||
.. code-block:: json | |||||||
[ | |||||||
{ | |||||||
"dir_id": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | |||||||
"length": null, | |||||||
"name": "AffectationRO", | |||||||
"perms": 16384, | |||||||
"target": "fbc418f9ac2c39e8566b04da5dc24b14e65b23b1", | |||||||
"target_url": "/api/1/directory/fbc418f9ac2c39e8566b04da5dc24b14e65b23b1/", | |||||||
"type": "dir" | |||||||
} | |||||||
] | |||||||
Questions raised concerning loading | Questions raised concerning loading | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
- A deposit has one origin, yet an origin can have multiple deposits? | - A deposit has one origin, yet an origin can have multiple deposits? | ||||||
No, an origin can have multiple requests for the same deposit. Which | No, an origin can have multiple requests for the same deposit. Which | ||||||
should end up in one single deposit (when the client pushes its final | should end up in one single deposit (when the client pushes its final | ||||||
Show All 35 Lines | |||||||
:: | :: | ||||||
+ same origin | + same origin | ||||||
+ new revision | + new revision | ||||||
+ new directory | + new directory | ||||||
Technical details | |||||||
----------------- | |||||||
Requirements | |||||||
~~~~~~~~~~~~ | |||||||
* one dedicated database to store the deposit's state - swh-deposit | |||||||
* one dedicated temporary objstorage to store archives before loading | |||||||
* one client to test the communication with SWORD protocol | |||||||
Deposit reception schema | |||||||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||||||
* SWORD imposes the use of basic authentication, so we need a way to | |||||||
authenticate client. Also, a client can access collections: | |||||||
**deposit\_client** table: - id (bigint): Client's identifier - username | |||||||
(str): Client's username - password (pass): Client's crypted password - | |||||||
collections ([id]): List of collections the client can access | |||||||
* Collections group deposits together: | |||||||
**deposit\_collection** table: - id (bigint): Collection's identifier - name | |||||||
(str): Collection's human readable name | |||||||
* A deposit is the main object the repository is all about: | |||||||
**deposit** table: | |||||||
* id (bigint): deposit's identifier | |||||||
* reception\_date (date): First deposit's reception date | |||||||
* complete\_data (date): Date when the deposit is deemed complete and ready | |||||||
for loading | |||||||
* collection (id): The collection the deposit belongs to | |||||||
* external id (text): client's internal identifier (e.g hal's id, etc...). | |||||||
* client\_id (id) : Client which did the deposit | |||||||
* swh\_id (str) : swh identifier result once the loading is complete | |||||||
* status (enum): The deposit's current status | |||||||
- As mentioned, a deposit can have a status, whose possible values are: | |||||||
.. code:: text | |||||||
'partial', -- the deposit is new or partially received since it | |||||||
-- can be done in multiple requests | |||||||
'expired', -- deposit has been there too long and is now deemed | |||||||
-- ready to be garbage collected | |||||||
'deposited' -- deposit complete, it is ready to be checked to ensure data consistency | |||||||
'verified', -- deposit is fully received, checked, and ready for loading | |||||||
'loading', -- loading is ongoing on swh's side | |||||||
'done', -- loading is successful | |||||||
'failed' -- loading is a failure | |||||||
* A deposit is stateful and can be made in multiple requests: | |||||||
**deposit\_request** table: | |||||||
* id (bigint): identifier | |||||||
* type (id): deposit request's type (possible values: 'archive', 'metadata') | |||||||
* deposit\_id (id): deposit whose request belongs to | |||||||
* metadata: metadata associated to the request | |||||||
* date (date): date of the requests | |||||||
Information sent along a request are stored in a ``deposit_request`` row. | |||||||
They can be either of type ``metadata`` (atom entry, multipart's atom entry | |||||||
part) or of type ``archive`` (binary upload, multipart's binary upload part). | |||||||
When the deposit is complete (status ``deposited``), those ``metadata`` and | |||||||
``archive`` deposit requests will be read and aggregated. They will then be | |||||||
sent as parameters to the loading routine. | |||||||
During loading, some of those metadata are kept in the ``origin_metadata`` | |||||||
table and some other are stored in the ``revision`` table (see `metadata | |||||||
loading <#metadata-loading>`__). | |||||||
The only update actions occurring on the deposit table are in regards of: - | |||||||
status changing: - ``partial`` -> {``expired``/``deposited``}, - | |||||||
``deposited`` -> {``rejected``/``verified``}, - ``verified`` -> ``loading`` - | |||||||
``loading`` -> {``done``/``failed``} - ``complete_date`` when the deposit is | |||||||
finalized (when the status is changed to ``deposited``) - ``swh-id`` is | |||||||
populated once we have the loading result | |||||||
SWH Identifier returned | |||||||
^^^^^^^^^^^^^^^^^^^^^^^ | |||||||
:: | |||||||
The synthetic revision id | |||||||
e.g.: swh:1:rev:47dc6b4636c7f6cba0df83e3d5490bf4334d987e | |||||||
Scheduling loading | Scheduling loading | ||||||
~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~ | ||||||
All ``archive`` and ``metadata`` deposit requests should be aggregated before | All ``archive`` and ``metadata`` deposit requests should be aggregated before | ||||||
loading. | loading. | ||||||
The loading should be scheduled via the scheduler's api. | The loading should be scheduled via the scheduler's api. | ||||||
Only ``deposited`` deposit are concerned by the loading. | Only ``deposited`` deposit are concerned by the loading. | ||||||
When the loading is done and successful, the deposit entry is updated: - | When the loading is done and successful, the deposit entry is updated: | ||||||
``status`` is updated to ``done`` - ``swh-id`` is populated with the resulting | |||||||
hash (cf. `swh identifier <#swh-identifier-returned>`__) - ``complete_date`` is | |||||||
updated to the loading's finished time | |||||||
When the loading is failed, the deposit entry is updated: - ``status`` is | - ``status`` is updated to ``done`` | ||||||
updated to ``failed`` - ``swh-id`` and ``complete_data`` remains as is | - ``swh-id`` is populated with the resulting hash | ||||||
Done Inline Actionsdon't call this "hash", it's a "Software Heritage Peristent Identifier" (or SWH PID for short), and you can link that text to the PID doc instead of adding a cf. parenthesis zack: don't call this "hash", it's a "Software Heritage Peristent Identifier" (or SWH PID for short)… | |||||||
Done Inline Actionsgood call. moranegg: good call. | |||||||
(cf. `swh identifier <#swh-identifier-returned>`__) | |||||||
- ``complete_date`` is updated to the loading's finished time | |||||||
When the loading has failed, the deposit entry is updated: | |||||||
- ``status`` is updated to ``failed`` | |||||||
- ``swh-id`` and ``complete_data`` remains as is | |||||||
*Note:* As a further improvement, we may prefer having a retry policy with | *Note:* As a further improvement, we may prefer having a retry policy with | ||||||
graceful delays for further scheduling. | graceful delays for further scheduling. | ||||||
Metadata loading | Metadata loading | ||||||
~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~ | ||||||
- the metadata received with the deposit should be kept in the | - the metadata received with the deposit are also kept in the | ||||||
``origin_metadata`` table before translation as part of the loading process | ``origin_metadata`` table before translation as part of the loading process | ||||||
and an indexation process should be scheduled. | and an indexation process should be scheduled. | ||||||
- provider\_id and tool\_id are resolved by the prepare\_metadata method in the | - provider\_id and tool\_id are resolved by the prepare\_metadata method in the | ||||||
loader-core | loader-core | ||||||
- the origin\_metadata entry is sent to storage by the send\_origin\_metadata | - the origin\_metadata entry is sent to storage by the send\_origin\_metadata | ||||||
in the loader-core | in the loader-core | ||||||
Show All 11 Lines |
"this part" ... of what?
Alternative beginning suggestion: "This specification describes the …"
Also, on first use I like to have Software Heritage in full, e.g., "Software Heritage (SWH) archive", so that the acronym is defined for later.