Changeset View
Standalone View
docs/specs/spec-loading.rst
Loading specification (draft) | Loading specification | ||||||
============================= | ===================== | ||||||
This part discusses the deposit loading part on the server side. | AN important part of the deposit specifications is the loading procedure whereas | ||||||
a deposit is ingested into the Software Heritage (archive), using | |||||||
the tarball loader and the complete schema of software artifacts creation | |||||||
in the archive. | |||||||
zack: "this part" ... of what?
Alternative beginning suggestion: "This specification describes the… | |||||||
Done Inline ActionsI agree. This page is a part of a collection of specifications, this is why it starts that way, but I prefer starting fresh as you suggest also with the full definition of the acronym. moranegg: I agree. This page is a part of a collection of specifications, this is why it starts that way… | |||||||
Tarball Loading | Tarball Loading | ||||||
--------------- | --------------- | ||||||
The ``swh-loader-tar`` module is already able to inject tarballs in swh | The ``swh-loader-tar`` module is already able to inject tarballs in swh | ||||||
with very limited metadata (mainly the origin). | with very limited metadata (mainly the origin). | ||||||
The loading of the deposit will use the deposit's associated data: | The loading of the deposit will use the deposit's associated data: | ||||||
* the metadata | * the metadata | ||||||
* the archive(s) | * the archive(s) | ||||||
We will use the ``synthetic`` revision notion. | |||||||
To that revision will be associated the metadata. Those will be included | Artifacts creation | ||||||
in the hash computation, thus resulting in a unique identifier. | ---------------------- | ||||||
Loading mapping | Deposit to artifacts mapping | ||||||
~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
Some of those metadata will also be included in the ``origin_metadata`` | This is a global view of the deposit ingestion | ||||||
table. | |||||||
:: | +------------------------------------+-----------------------------------------+ | ||||||
| swh artifact | representation in deposit | | |||||||
+====================================+=========================================+ | |||||||
Done Inline ActionsIndentation of the url? ardumont: Indentation of the url? | |||||||
| origin | https://hal.inria.fr/hal-id | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
Done Inline ActionsI don't get what this means. "origin_visit" is an internal table name. You probably mean a specific field in that table? Either way, I suggest to use a descriptive name, with the SQL path as additional detail, e.g., "timestamp of the visit (origin_visit.field_name)". zack: I don't get what this means. "origin_visit" is an internal table name. You probably mean a… | |||||||
Done Inline ActionsI kept it from the original specs when there was no snapshots.
How does it sound? moranegg: I kept it from the original specs when there was no snapshots.
Maybe the best solution it to… | |||||||
| origin_metadata | aggregated metadata | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
| snapshot | reception of all occurrences (branches) | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
Done Inline Actionsoccur`r`ence ardumont: occur`r`ence | |||||||
Done Inline Actionsminor grammar issue: "at visit" ← not sure what this means zack: minor grammar issue: "at visit" ← not sure what this means | |||||||
Done Inline ActionsI agree. see comment above. moranegg: I agree. see comment above. | |||||||
| branches | master & | | |||||||
| | branch (optional): tag to release | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
Done Inline ActionsThose no longer exists in swh model, i think that should go away. ardumont: Those no longer exists in swh model, i think that should go away.
We are using snapshot now… | |||||||
Done Inline Actionsoccurrence isn't a branch in a snapshot? moranegg: `occurrence` isn't a branch in a snapshot?
and `occurrence history` is obsolete then? | |||||||
Done Inline ActionsYes, obsolete. origin -> origin-visit -> snapshot -> {revision, release} ardumont: Yes, obsolete.
They were replaced by snapshot.
origin -> origin-visit -> snapshot -> {revision… | |||||||
Not Done Inline ActionsAre the branches actually optional? Don't we always have at least one release here? Relatedly: this use case seems similar with package manager listers/loaders, we should compare with what they do and make sure we are consistent. In particular, I'm not sure we have a "master" branch there, most likely we have a "HEAD" branch, pointing to the most recent version at visit time + one branch for each release (the current one + all previous ones). zack: Are the branches actually optional? Don't we always have at least one release here?
Relatedly… | |||||||
Not Done Inline ActionsWe do not at the moment. Only master. With this specs we are introducing the concept of release to a deposit. @ardumont can you comment about package manager? moranegg: We do not at the moment. Only master.
With this specs we are introducing the concept of… | |||||||
Not Done Inline Actions
Good catch.
In the current state, we don't have release yet, only 1 revision.
We plan to refactor the deposit loader according to the package manager loader indeed. We can change it in the spec and in the implementation. ardumont: > Are the branches actually optional?
Good catch.
No, the snapshot branch is not optional. | |||||||
| release | (optional) synthetic release created | | |||||||
| | from metadata | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
| revision | synthetic revision pointing to | | |||||||
Done Inline Actionswhy do use _ here? ardumont: why do use `_` here? | |||||||
Done Inline Actionsthis is the way it was, don't mind deleting the _ moranegg: this is the way it was, don't mind deleting the `_` | |||||||
| | the expanded submitted tarball | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
| directory | root directory of the expanded submitted| | |||||||
Done Inline ActionsI see what you mean here, and I agree with the arrangement. But the description of these two can probably be simpler, how about:
zack: I see what you mean here, and I agree with the arrangement. But the description of these two… | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
| | tarball | | |||||||
+------------------------------------+-----------------------------------------+ | |||||||
Origin artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
We create an origin using the url in the deposited metadata. | |||||||
Done Inline Actionsactive voice is generally preferable to passive voice (this is a general remark that applies to most of the paragraphs in the spec) zack: active voice is generally preferable to passive voice (this is a general remark that applies to… | |||||||
Done Inline ActionsCan you show me an example, I tried to avoid using We like "We create an origin from the url used in the metadata". moranegg: Can you show me an example, I tried to avoid using `We` like "We create an origin from the url… | |||||||
Not Done Inline ActionsI'm letting this one go for now. The style change will require quite some effort, and it's more important to land the spec than block on this. zack: I'm letting this one go for now. The style change will require quite some effort, and it's more… | |||||||
The current deposit and future deposits with the same url or external_id | |||||||
will be associated to this origin. | |||||||
.. code-block:: json | |||||||
origin: { | |||||||
"id": 89283768, | |||||||
"origin_visits_url": "/api/1/origin/89283768/visits/", | |||||||
"type": "deposit", | |||||||
"url": "https://hal.archives-ouvertes.fr/hal-02140606" | |||||||
} | |||||||
Visits | |||||||
~~~~~~~ | |||||||
We identify with a visit each deposit push of the same external_id. | |||||||
Here in the example below, two snapshots are identified by two different visits. | |||||||
.. code-block:: json | |||||||
visits: [ | |||||||
{ | |||||||
"date": "2019-06-03T09:28:10.223007+00:00", | |||||||
"origin": 89283768, | |||||||
"origin_visit_url": "/api/1/origin/89283768/visit/2/", | |||||||
"snapshot": "a3773941561cc557853898773a19c07cfe2efc5a", | |||||||
"snapshot_url": "/api/1/snapshot/a3773941561cc557853898773a19c07cfe2efc5a/", | |||||||
"status": "full", | |||||||
"type": "deposit", | |||||||
"visit": 2 | |||||||
}, | |||||||
{ | |||||||
"date": "2019-05-27T12:23:31.037273+00:00", | |||||||
"origin": 89283768, | |||||||
"origin_visit_url": "/api/1/origin/89283768/visit/1/", | |||||||
"snapshot": "43fdb8291f1bf6962211c370e394f6abb1cbe01d", | |||||||
"snapshot_url": "/api/1/snapshot/43fdb8291f1bf6962211c370e394f6abb1cbe01d/", | |||||||
"status": "full", | |||||||
"type": "deposit", | |||||||
"visit": 1 | |||||||
} | |||||||
] | |||||||
Snapshot artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The snapshot represents one deposit push. The master branch points to a | |||||||
synthetic revision. We will create a second branch pointing to a release | |||||||
artifact, if the indicate that the deposit is a release with a `releaseNotes`. | |||||||
.. code-block:: json | |||||||
snapshot: { | |||||||
"branches": { | |||||||
"master": { | |||||||
"target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
"target_type": "revision", | |||||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
Done Inline ActionsIt's not exactly clear to me. Is this for information purposes? ardumont: It's not exactly clear to me.
What part generates those nice json output?
Is this for… | |||||||
Done Inline ActionsThis is the api output, I imagine you know. or maybe when adding the artifact name it is clearer: snapshot: { "branches": { "master": { "target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", "target_type": "revision", "target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" } }, "id": "a3773941561cc557853898773a19c07cfe2efc5a", "next_branch": null } I've use this schema on all api artifacts, can you say if this works? moranegg: This is the api output, I imagine you know.
Which is the json representation of the artifact.
I… | |||||||
Done Inline Actions
Yes, i just realized that's the main api indeed.
The description works. ardumont: > This is the api output, I imagine you know.
Yes, i just realized that's the main api indeed. | |||||||
"refs/tags/v1.1": { | |||||||
"target": "a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97", | |||||||
"target_type": "release", | |||||||
"target_url": "/api/1/release/a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97/" | |||||||
Done Inline ActionsThe archive is deposited with a set of descriptive metadata, in the CodeMeta vocabulary. moranegg: The archive is deposited with a set of descriptive metadata, in the CodeMeta vocabulary.
The… | |||||||
} | |||||||
}, | |||||||
"id": "a3773941561cc557853898773a19c07cfe2efc5a", | |||||||
"next_branch": null | |||||||
} | |||||||
Release artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The content is deposited with a set of descriptive metadata in the CodeMeta | |||||||
vocabulary. The following CodeMeta terms implies that the | |||||||
artifact is a release: | |||||||
- `releaseNotes` | |||||||
- `softwareVersion` | |||||||
If present, a release artifact will be created with the mapping below: | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| SWH release field | Description | CodeMeta term | Fallback value | | |||||||
+===================+===================================+=================+================+ | |||||||
| target | revision containing all metadata | X |X | | |||||||
Done Inline ActionsMaybe modify the column metadata term's title to metadata term (source). ardumont: Maybe modify the column metadata term's title to `metadata term (source)`.
That explicits the… | |||||||
Done Inline Actionshow about, specifying above from where we get the metadata and here change to CodeMeta term moranegg: how about, specifying above from where we get the metadata and here change to `CodeMeta term` | |||||||
Done Inline Actionsworks for me ;) Note that all tables are a bit misaligned. ardumont: works for me ;)
Note that all tables are a bit misaligned.
To align it properly, i'm letting… | |||||||
Done Inline Actionsit's very strange, I'm using Atom and not Emacs and I've openned the file in Emacs and it is misaligned. moranegg: it's very strange, I'm using `Atom` and not `Emacs` and I've openned the file in `Emacs` and it… | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| target_type | revision | X |X | | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| name | release or tag name (mandatory) | softwareVersion | X | | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| message | message associated with release | releaseNotes | X | | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| date | release date = publication date | datePublished | deposit_date | | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
| author | deposit client | author | client | | |||||||
+-------------------+-----------------------------------+-----------------+----------------+ | |||||||
.. code-block:: json | |||||||
release: { | |||||||
Done Inline ActionsMaybe take a real id (that's a hash as well). ardumont: Maybe take a real id (that's a hash as well).
Because otherwise, it's not consistent with other… | |||||||
Done Inline ActionsSeconded. zack: Seconded. | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
"author": { | |||||||
"email": "hal@ccsd.cnrs.fr", | |||||||
"fullname": "HAL <phal@ccsd.cnrs.fr>", | |||||||
"id": x, | |||||||
"name": "HAL" | |||||||
}, | |||||||
"author_url": "/api/1/person/x/", | |||||||
"date": "2019-05-27T16:28:33+02:00", | |||||||
"id": "a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97", | |||||||
"message": "AffectationRO Version 1.1 - added new feature\n", | |||||||
"name": "1.1", | |||||||
Done Inline ActionsThe sentence sounds a bit weird to my ears. I think there is no need to mention swh-loader-tar again. Also, that will change soon (thus this would need update everywhere it's mentioned, better keep it only once ;). ardumont: The sentence sounds a bit weird to my ears.
Maybe change this to: "the deposit `revision` is… | |||||||
Done Inline Actionsgood thinking! and you are right... it's not very readable.
moranegg: good thinking! and you are right... it's not very readable.
Going for:
> the deposit revision… | |||||||
"synthetic": true, | |||||||
"target": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
Done Inline ActionsI tend to try and shorten sentences where i can. The deposit revision is synthetically created. And then letting the readers understand the synthetic: True on their own. ardumont: I tend to try and shorten sentences where i can.
I'd go for a simpler:
```
The deposit… | |||||||
Done Inline ActionsJust remove this sentence. We have already said it's going to be a synthetic revision above, no need to further expand on the matter, IMO. zack: Just remove this sentence. We have already said it's going to be a synthetic revision above, no… | |||||||
Done Inline Actionssure. moranegg: sure. | |||||||
"target_type": "revision", | |||||||
"target_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
Revision artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The metadata sent with the deposit is included in the revision which affects | |||||||
the hash computation, thus resulting in a unique identifier. | |||||||
This way, by depositing the same content with different metadata, will result | |||||||
in two different revisions in the SWH archive. | |||||||
Done Inline ActionsThat's a judgment call, i'm not sure we want to add that here. Also i'd remove this and amend the chapter when legacy software will actually happen. ardumont: That's a judgment call, i'm not sure we want to add that here.
Also i'd remove this and amend… | |||||||
Done Inline ActionsAgreed. Just remove the above paragraph; the subsequent (factual) paragraph is all we need in a spec. zack: Agreed. Just remove the above paragraph; the subsequent (factual) paragraph is all we need in a… | |||||||
Done Inline ActionsOk. You should know it is really complicating things and I'm pulling my hair on the Legacy Software. moranegg: Ok. You should know it is really complicating things and I'm pulling my hair on the Legacy… | |||||||
The date mapping | |||||||
^^^^^^^^^^^^^^^ | |||||||
A deposit may contain 4 different dates concerning the software artifacts. | |||||||
The deposit's revision will reflect the most accurate point in time available. | |||||||
Done Inline ActionsI don't understand what "dates in a deposit" mean. Are these fields that are available via SWORD? if so, we should add a sentence specifying that this column should be interpreted in that context. zack: I don't understand what "dates in a deposit" mean. Are these fields that are available via… | |||||||
Done Inline Actionsdates in a deposit can be found via SWORD in the header or in the xml metadata. moranegg: dates in a deposit can be found via SWORD in the header or in the xml metadata.
for example the… | |||||||
Here are all dates that can be available in a deposit: | |||||||
Done Inline Actionstypo: "recieved" zack: typo: "recieved" | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
+-------------------+-----------------------------------+-----------------------------------------------+ | |||||||
| dates | location | Description | | |||||||
+===================+===================================+===============================================+ | |||||||
| reception_date | On SWORD reception (automatic) |the deposit was received at this ts | | |||||||
Done Inline ActionsThe distinction between the comitter_date and the date for a deposit should be specified and documented better. Here the date is committer_date populated from deposit with CodeMeta term publicationDate moranegg: The distinction between the comitter_date and the date for a deposit should be specified and… | |||||||
+-------------------+-----------------------------------+-----------------------------------------------+ | |||||||
| complete_date | On SWH ingestion (automatic) |the ingestion was completed by SWH at this ts | | |||||||
Done Inline ActionsHere the date is the date (aka. author_date) populated from a deposit with CodeMeta term creationDate. This is the date used by @grouss. moranegg: Here the date is the `date` (aka. `author_date`) populated from a deposit with CodeMeta term… | |||||||
+-------------------+-----------------------------------+-----------------------------------------------+ | |||||||
| dateCreated | metadata in codeMeta (optional) |the software artifact was created at this ts | | |||||||
+-------------------+-----------------------------------+----------------------+------------------------+ | |||||||
Done Inline Actionstypo "targetting" zack: typo "targetting" | |||||||
Done Inline Actionsack. moranegg: ack. | |||||||
| datePublished | metadata in codeMeta (optional) |the software was published (contributed in HAL)| | |||||||
+-------------------+-----------------------------------+----------------------+------------------------+ | |||||||
A visit targeting a snapshot contains one date: | |||||||
+-------------------+----------------------------------------------+----------------+ | |||||||
| SWH visit field | Description | value | | |||||||
+===================+==============================================+================+ | |||||||
| date | the origin pushed the deposit at this date | reception_date | | |||||||
+-------------------+----------------------------------------------+----------------+ | |||||||
A revision contains two dates: | |||||||
+-------------------+-----------------------------------------+----------------+----------------+ | |||||||
| SWH revision field| Description | CodeMeta term | Fallback value | | |||||||
+===================+=========================================+================+================+ | |||||||
| date | date of software artifact modification | dateCreated | reception_date | | |||||||
+-------------------+-----------------------------------------+----------------+----------------+ | |||||||
| comitter_date | date of the commit in VCS | datePublished | reception_date | | |||||||
ardumontUnsubmitted Done Inline Actionscommitter ardumont: `committer` | |||||||
+-------------------+-----------------------------------------+----------------+----------------+ | |||||||
A release contains one date: | |||||||
+-------------------+----------------------------------+---------------+----------------+ | |||||||
| SWH release field |Description |CodeMeta term | Fallback value | | |||||||
+===================+==================================+===============+================+ | |||||||
| date |release date = publication date |datePublished |reception_date | | |||||||
+-------------------+----------------------------------+---------------+----------------+ | |||||||
.. code-block:: json | |||||||
revision: { | |||||||
"author": { | |||||||
"email": "robot@softwareheritage.org", | |||||||
"fullname": "Software Heritage", | |||||||
"id": 18233048, | |||||||
"name": "Software Heritage" | |||||||
}, | |||||||
"author_url": "/api/1/person/18233048/", | |||||||
"committer": { | |||||||
"email": "robot@softwareheritage.org", | |||||||
"fullname": "Software Heritage", | |||||||
"id": 18233048, | |||||||
"name": "Software Heritage" | |||||||
}, | |||||||
"committer_date": "2019-05-27T16:28:33+02:00", | |||||||
"committer_url": "/api/1/person/18233048/", | |||||||
"date": "2012-01-01T00:00:00+00:00", | |||||||
"directory": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | |||||||
"directory_url": "/api/1/directory/fb13b51abbcfd13de85d9ba8d070a23679576cd7/", | |||||||
"history_url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/log/", | |||||||
"id": "396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52", | |||||||
"merge": false, | |||||||
"message": "hal: Deposit 282 in collection hal", | |||||||
"metadata": { | |||||||
"@xmlns": "http://www.w3.org/2005/Atom", | |||||||
"@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0", | |||||||
"author": { | |||||||
"email": "hal@ccsd.cnrs.fr", | |||||||
"name": "HAL" | |||||||
}, | |||||||
"client": "hal", | |||||||
"codemeta:applicationCategory": "info", | |||||||
"codemeta:author": { | |||||||
"codemeta:name": "Morane Gruenpeter" | |||||||
Done Inline Actionsarchive(s)' raw content deposited. ardumont: `archive(s)' raw content` deposited. | |||||||
}, | |||||||
"codemeta:codeRepository": "www.code-repository.com", | |||||||
"codemeta:contributor": "Morane Gruenpeter", | |||||||
"codemeta:dateCreated": "2012", | |||||||
"codemeta:datePublished": "2019-05-27T16:28:33+02:00", | |||||||
"codemeta:description": "description\\_en test v2", | |||||||
"codemeta:developmentStatus": "Inactif", | |||||||
"codemeta:keywords": "mot_cle_en,mot_cle_2_en,mot_cle_fr", | |||||||
"codemeta:license": [ | |||||||
{ | |||||||
"codemeta:name": "MIT License" | |||||||
}, | |||||||
{ | |||||||
"codemeta:name": "CeCILL Free Software License Agreement v1.1" | |||||||
} | |||||||
], | |||||||
"codemeta:name": "Test\\_20190527\\_01", | |||||||
"codemeta:operatingSystem": "OS", | |||||||
"codemeta:programmingLanguage": "Java", | |||||||
"codemeta:referencePublication": null, | |||||||
"codemeta:relatedLink": null, | |||||||
"codemeta:releaseNotes": "releaseNote", | |||||||
"codemeta:runtimePlatform": "outil", | |||||||
"codemeta:softwareVersion": "1.0.1", | |||||||
"codemeta:url": "https://hal.archives-ouvertes.fr/hal-02140606", | |||||||
"codemeta:version": "2", | |||||||
"external_identifier": "hal-02140606", | |||||||
"id": "hal-02140606", | |||||||
"original_artifact": [ | |||||||
{ | |||||||
"archive_type": "zip", | |||||||
"blake2s256": "96be3ddedfcee9669ad9c42b0bb3a706daf23824d04311c63505a4d8db02df00", | |||||||
"length": 193072, | |||||||
"name": "archive.zip", | |||||||
"sha1": "5b6ecc9d5bb113ff69fc275dcc9b0d993a8194f1", | |||||||
"sha1_git": "bd10e4d3ede17162692d7e211e08e87e67994488", | |||||||
"sha256": "3e2ce93384251ce6d6da7b8f2a061a8ebdaf8a28b8d8513223ca79ded8a10948" | |||||||
} | |||||||
] | |||||||
}, | |||||||
"parents": [ | |||||||
{ | |||||||
"id": "a9fdc3937d2b704b915852a64de2ab1b4b481003", | |||||||
"url": "/api/1/revision/a9fdc3937d2b704b915852a64de2ab1b4b481003/" | |||||||
} | |||||||
], | |||||||
"synthetic": true, | |||||||
"type": "tar", | |||||||
"url": "/api/1/revision/396b1ff29f7c75a0a3cc36f30e24ff7bae70bb52/" | |||||||
} | |||||||
Directory artifact | |||||||
~~~~~~~~~~~~~~~~ | |||||||
The directory artifact is the archive(s)' raw content deposited. | |||||||
Done Inline Actions"Artifacts creation" should be enough here, SWH is kinda implicit anyway. zack: "Artifacts creation" should be enough here, SWH is kinda implicit anyway. | |||||||
Done Inline Actionssure. moranegg: sure. | |||||||
.. code-block:: json | |||||||
directory: [ | |||||||
{ | |||||||
"dir_id": "fb13b51abbcfd13de85d9ba8d070a23679576cd7", | |||||||
"length": null, | |||||||
"name": "AffectationRO", | |||||||
"perms": 16384, | |||||||
"target": "fbc418f9ac2c39e8566b04da5dc24b14e65b23b1", | |||||||
"target_url": "/api/1/directory/fbc418f9ac2c39e8566b04da5dc24b14e65b23b1/", | |||||||
"type": "dir" | |||||||
} | |||||||
] | |||||||
origin | https://hal.inria.fr/hal-id | | |||||||
------------------------------------|----------------------------------------| | |||||||
origin_visit | 1 :reception_date | | |||||||
origin_metadata | aggregated metadata | | |||||||
occurrence & occurrence_history | branch: client's version n° (e.g hal) | | |||||||
revision | synthetic_revision (tarball) | | |||||||
directory | upper level of the uncompressed archive| | |||||||
Questions raised concerning loading | Questions raised concerning loading | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
- A deposit has one origin, yet an origin can have multiple deposits? | - A deposit has one origin, yet an origin can have multiple deposits? | ||||||
No, an origin can have multiple requests for the same deposit. Which | No, an origin can have multiple requests for the same deposit. Which | ||||||
should end up in one single deposit (when the client pushes its final | should end up in one single deposit (when the client pushes its final | ||||||
Show All 35 Lines | |||||||
:: | :: | ||||||
+ same origin | + same origin | ||||||
+ new revision | + new revision | ||||||
+ new directory | + new directory | ||||||
Technical details | |||||||
----------------- | |||||||
Requirements | |||||||
~~~~~~~~~~~~ | |||||||
* one dedicated database to store the deposit's state - swh-deposit | |||||||
* one dedicated temporary objstorage to store archives before loading | |||||||
* one client to test the communication with SWORD protocol | |||||||
Deposit reception schema | |||||||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||||||
* SWORD imposes the use of basic authentication, so we need a way to | |||||||
authenticate client. Also, a client can access collections: | |||||||
**deposit\_client** table: - id (bigint): Client's identifier - username | |||||||
(str): Client's username - password (pass): Client's crypted password - | |||||||
collections ([id]): List of collections the client can access | |||||||
* Collections group deposits together: | |||||||
**deposit\_collection** table: - id (bigint): Collection's identifier - name | |||||||
(str): Collection's human readable name | |||||||
* A deposit is the main object the repository is all about: | |||||||
**deposit** table: | |||||||
* id (bigint): deposit's identifier | |||||||
* reception\_date (date): First deposit's reception date | |||||||
* complete\_data (date): Date when the deposit is deemed complete and ready | |||||||
for loading | |||||||
* collection (id): The collection the deposit belongs to | |||||||
* external id (text): client's internal identifier (e.g hal's id, etc...). | |||||||
* client\_id (id) : Client which did the deposit | |||||||
* swh\_id (str) : swh identifier result once the loading is complete | |||||||
* status (enum): The deposit's current status | |||||||
- As mentioned, a deposit can have a status, whose possible values are: | |||||||
.. code:: text | |||||||
'partial', -- the deposit is new or partially received since it | |||||||
-- can be done in multiple requests | |||||||
'expired', -- deposit has been there too long and is now deemed | |||||||
-- ready to be garbage collected | |||||||
'deposited' -- deposit complete, it is ready to be checked to ensure data consistency | |||||||
'verified', -- deposit is fully received, checked, and ready for loading | |||||||
'loading', -- loading is ongoing on swh's side | |||||||
'done', -- loading is successful | |||||||
'failed' -- loading is a failure | |||||||
* A deposit is stateful and can be made in multiple requests: | |||||||
**deposit\_request** table: | |||||||
* id (bigint): identifier | |||||||
* type (id): deposit request's type (possible values: 'archive', 'metadata') | |||||||
* deposit\_id (id): deposit whose request belongs to | |||||||
* metadata: metadata associated to the request | |||||||
* date (date): date of the requests | |||||||
Information sent along a request are stored in a ``deposit_request`` row. | |||||||
They can be either of type ``metadata`` (atom entry, multipart's atom entry | |||||||
part) or of type ``archive`` (binary upload, multipart's binary upload part). | |||||||
When the deposit is complete (status ``deposited``), those ``metadata`` and | |||||||
``archive`` deposit requests will be read and aggregated. They will then be | |||||||
sent as parameters to the loading routine. | |||||||
During loading, some of those metadata are kept in the ``origin_metadata`` | |||||||
table and some other are stored in the ``revision`` table (see `metadata | |||||||
loading <#metadata-loading>`__). | |||||||
The only update actions occurring on the deposit table are in regards of: - | |||||||
status changing: - ``partial`` -> {``expired``/``deposited``}, - | |||||||
``deposited`` -> {``rejected``/``verified``}, - ``verified`` -> ``loading`` - | |||||||
``loading`` -> {``done``/``failed``} - ``complete_date`` when the deposit is | |||||||
finalized (when the status is changed to ``deposited``) - ``swh-id`` is | |||||||
populated once we have the loading result | |||||||
SWH Identifier returned | |||||||
^^^^^^^^^^^^^^^^^^^^^^^ | |||||||
:: | |||||||
The synthetic revision id | |||||||
e.g.: swh:1:rev:47dc6b4636c7f6cba0df83e3d5490bf4334d987e | |||||||
Scheduling loading | Scheduling loading | ||||||
~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~ | ||||||
All ``archive`` and ``metadata`` deposit requests should be aggregated before | All ``archive`` and ``metadata`` deposit requests should be aggregated before | ||||||
loading. | loading. | ||||||
The loading should be scheduled via the scheduler's api. | The loading should be scheduled via the scheduler's api. | ||||||
Only ``deposited`` deposit are concerned by the loading. | Only ``deposited`` deposit are concerned by the loading. | ||||||
When the loading is done and successful, the deposit entry is updated: - | When the loading is done and successful, the deposit entry is updated: | ||||||
``status`` is updated to ``done`` - ``swh-id`` is populated with the resulting | |||||||
hash (cf. `swh identifier <#swh-identifier-returned>`__) - ``complete_date`` is | |||||||
updated to the loading's finished time | |||||||
When the loading is failed, the deposit entry is updated: - ``status`` is | - ``status`` is updated to ``done`` | ||||||
updated to ``failed`` - ``swh-id`` and ``complete_data`` remains as is | - ``swh-id`` is populated with the resulting `SWH persistent identifier <https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html>`_ | ||||||
Done Inline Actionsdon't call this "hash", it's a "Software Heritage Peristent Identifier" (or SWH PID for short), and you can link that text to the PID doc instead of adding a cf. parenthesis zack: don't call this "hash", it's a "Software Heritage Peristent Identifier" (or SWH PID for short)… | |||||||
Done Inline Actionsgood call. moranegg: good call. | |||||||
- ``complete_date`` is updated to the loading's finished time | |||||||
When the loading has failed, the deposit entry is updated: | |||||||
- ``status`` is updated to ``failed`` | |||||||
- ``swh-id`` and ``complete_data`` remains as is | |||||||
*Note:* As a further improvement, we may prefer having a retry policy with | *Note:* As a further improvement, we may prefer having a retry policy with | ||||||
graceful delays for further scheduling. | graceful delays for further scheduling. | ||||||
Metadata loading | Metadata loading | ||||||
~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~ | ||||||
- the metadata received with the deposit should be kept in the | - the metadata received with the deposit are kept in the `metadata` fields | ||||||
``origin_metadata`` table before translation as part of the loading process | of the revision and in the ```origin_metadata`` table to facilitate search | ||||||
and an indexation process should be scheduled. | over origin metadata. | ||||||
- provider\_id and tool\_id are resolved by the prepare\_metadata method in the | - provider\_id and tool\_id are resolved by the prepare\_metadata method in the | ||||||
loader-core | loader-core | ||||||
- the origin\_metadata entry is sent to storage by the send\_origin\_metadata | - the origin\_metadata entry is sent to storage by the send\_origin\_metadata | ||||||
in the loader-core | in the loader-core | ||||||
origin\_metadata table: | origin\_metadata table: | ||||||
Show All 9 Lines |
"this part" ... of what?
Alternative beginning suggestion: "This specification describes the …"
Also, on first use I like to have Software Heritage in full, e.g., "Software Heritage (SWH) archive", so that the acronym is defined for later.