Put information (client, collection and deposit-id) inside metadata for metadata-only deposit
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	moranegg
	Nov 16 2020, 12:36 PM

Description

With the complete deposit, a revision is created with a commit message including the client, deposit number and collection.
These items will be lost with a metadata-only deposit in the ERMDS, since the revision or other elements aren't created in the archive.

To solve this discrepancy, deposit message should be added in the xml.
Here a proposal to add inside the <swh:deposit>:

<swh:receipt>
   <swh:client>HAL</swh:client>
   <swh:collection>HAL</swh:collection>
   <swh:number>160</swh:number>
   <swh:date>reception date</swh:date>
</swh:receipt>

reception date is equivalent to today's commit date
This information might be redundant with a property already used for ERMDS entries (if so, it can be deleted).

Revisions and Commits

rDDEP Push deposit
	D5239	rDDEP3a9b2fc4baa4 Add deposit info to objects added to swh-storage from metadata-only deposits

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2685 Test update of deposit with metadata-only in integration tests
Migrated	gitlab-migration	T2344 Build a connector for software deposit via Zenodo/InvenioRDM
Migrated	gitlab-migration	T2538 Add new option for the CLI swh-deposit for the metadata-only deposit
Migrated	gitlab-migration	T1021 SWORD deposit of metadata about an existing SWH object
Migrated	gitlab-migration	T2537 Extend new deposit endpoint to support metadata-only deposits
Migrated	gitlab-migration	T3128 Improve deposit integration, management and display
Migrated	gitlab-migration	T2540 support the loading of metadata-only deposits in the metadata storage
Migrated	gitlab-migration	T2779 Put information (client, collection and deposit-id) inside metadata for metadata-only deposit
Migrated	gitlab-migration	T2703 Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects
Migrated	gitlab-migration	T3017 Use hashes as keys in swh.journal.objects.raw_extrinsic_metadata
Migrated	gitlab-migration	T3018 Allow querying raw_extrinsic_metadata by hash in swh-storage
Migrated	gitlab-migration	T3022 Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields
Migrated	gitlab-migration	T3019 Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql
Migrated	gitlab-migration	T3020 Add an "index" for raw_extrinsic_metadata.id in swh.storage.cassandra
Migrated	gitlab-migration	T3074 Migrate all packages away from the old SWHID class
Migrated	gitlab-migration	T3034 generalize usage of SWHID for referencing SWH archive objects

Event Timeline

moranegg triaged this task as Normal priority.Nov 16 2020, 12:36 PM

moranegg created this task.

Possible option (discussion might be needed)
discovery_date from ERMDS = reception_date of the first deposit_request
swh:date = completed_date

I may have missed something (several actually) but where is this swh:deposit namespace specified?

I can see examples of the usage of the swh:deposit NS in docs/specs/spec-meta-deposit.rst in the context of the metadata deposit (to specify the targeted data in SWH the metadata is about), but no real specification.

To solve this discrepancy, deposit message should be added in the xml.

do you mean you want the (metadata) deposit loader to modify the deposited metadata file? Is it "acceptable"?

IMHO this is an information (well, a metadata) on the metadata loading process, so it should not be part of the original metadata. How would we then we handle say a gpg signature?

In T2779#52735, @douardda wrote:

I may have missed something (several actually) but where is this swh:deposit namespace specified?

I guess it's the very purpose of T2625, isn't it?

yes

moranegg assigned this task to vlorentz.Dec 7 2020, 4:06 PM

I see three ways to do this:

Make the server parse the Atom document, insert this info in the document, and serialize it before writing to raw_extrinsic_metadata. But this means it will syntactically change the document provided by the client, and may also change it semantically if there is a bug anywhere in the process. It alsos means the client is no longer 100% the authority for that document. (And if we ever want to introduce signatures in the deposit, this is fundamentally incompatible.)

Make the client insert it in the document themselves, and have the server check it. But it's a burden for clients

Wait for T2703 and allow depositing metadata about the metadata object itself (eg. by creating SWHIDs for metadata objects). But it adds more delay for this feature, and we may not want to allow infinite reification like this...

Opinions on this?

I don't like 1/ at all, and 2 seems indeed a burden for clients...

I don't like this either.

Make the client insert it in the document themselves, and have the server check it. But it's a burden for clients

It seems that way.

Although, we are already asking clients to use the codemeta vocabulary and now we added some new tags for the origins... (the <swh:create_origin> / <swh:add_to_origin>).

So that might not be "such" a burden in the end (no idea really).

How far are we from T2703, i thought some work were started on this.

may not want to allow infinite reification like this...

what do you mean by that?

So that might not be "such" a burden in the end (no idea really).

It means that they:

need to care the deposit id (currently, they don't need to)
can't create a deposit in a single query, because they don't yet know the deposit id when doing POST Col-IRI.
need to add these extra tags and understand what they mean.

How far are we from T2703

I don't know

what do you mean by that?

If we can have metadata on metadata, then we can also have metadata on metadata on metadata, and metadata on metadata on metadata on metadata, ...

Another drawback to option 2: it means that this info must be optional (which means most clients will omit it, making it useless) or it would break generic SWORD clients.

Thank you for your patience @douardda
You are right changing the deposited metadata should not be "acceptable", but this information is lost between a regular de posit and a metadata-only deposit, since we do not have a revision for it.
This task was the result of the discussion "do we create an origin-snapshot- revision for metadata-only deposit" which we concluded with NO due to the upcoming ERMDS.

Thank you @vlorentz for your proposed solutions, here are my thoughts:
option 3 is possible but adds a complexity layer of metadata.

option 2 isn't possible, because the client doesn't initially know the deposit_id.

option 1 is complicated because SWH ethical obligation is to archive facts and we can't state " X said Y on artifact W" since the properties would be added to Y and it isn't something that X can say.
We should discuss the concept of a metadata file- do we regard it like a source code file and the sanctity of its finger print (the SWHID)?

Another idea, adding this metadata to the indexed metadata:

raw xml in ERMDS
json object in indexer metadata table containing translated metadata + administrative metadata

at the end the web app wants to read translated metadata

Let's see why we need this:

Why do we want this metadata properties saved and archived?

We want to know who said this information and when.
information without context is bad practice, we shouldn't loose the context.
This is something that we want for all metadata files, not only deposited files.

Isn't there a field in the ERMDS for the context of the metadata entry in json?

In T2779#55830, @moranegg wrote:

Another idea, adding this metadata to the indexed metadata:

raw xml in ERMDS

json object in indexer metadata table containing translated metadata + administrative metadata

at the end the web app wants to read translated metadata

That sounds like option 3, but using the indexer storage (which is not a permanent archive) instead of swh-storage (which is)

Isn't there a field in the ERMDS for the context of the metadata entry in json?

There is not. But we could add it. Let's call this option 4.

In T2779#55809, @vlorentz wrote:

If we can have metadata on metadata, then we can also have metadata on metadata on metadata, and metadata on metadata on metadata on metadata, ...

If we can have metadata on anything (and not must) then "problem solved" (well, you know, theoretically at least...)

currently, we can't. but it's easy to allow it, it's just a couple of lines to add in swh.model.model.

vlorentz changed the task status from Open to Work in Progress.Jan 7 2021, 11:14 AM

vlorentz moved this task from Backlog to In progress on the SWORD deposit board.

I'll assume for now that we're going to option 3 and add a dependency on T2703

vlorentz added a subtask: T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects.Jan 7 2021, 1:53 PM

Yeah let's go with option 3!

moranegg added a parent task: T2540: support the loading of metadata-only deposits in the metadata storage.Jan 26 2021, 11:56 AM

vlorentz mentioned this in T3034: generalize usage of SWHID for referencing SWH archive objects.Feb 15 2021, 1:00 PM

vlorentz added a revision: D5239: Add deposit info to objects added to swh-storage from metadata-only deposits.Mar 15 2021, 1:55 PM

vlorentz added a commit: rDDEP3a9b2fc4baa4: Add deposit info to objects added to swh-storage from metadata-only deposits.Mar 15 2021, 3:55 PM

vlorentz closed this task as Resolved.Mar 16 2021, 10:53 AM

vlorentz closed subtask T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects as Resolved.Mar 23 2021, 2:33 PM

moranegg moved this task from In progress to Landed/Tests/Validations (staging) on the SWORD deposit board.Mar 25 2021, 12:38 PM

vlorentz renamed this task from Put information (client, collection and deposit-id) inside metadata for metada-only deposit to Put information (client, collection and deposit-id) inside metadata for metadata-only deposit.Apr 27 2021, 2:09 PM

vlorentz moved this task from Landed/Tests/Validations (staging) to Deployed on the SWORD deposit board.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects from Resolved to Migrated.Jan 8 2023, 10:01 PM

Put information (client, collection and deposit-id) inside metadata for metadata-only depositClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Put information (client, collection and deposit-id) inside metadata for metadata-only deposit
Closed, MigratedEdits Locked
Actions

Related Objects
Search...