Page MenuHomeSoftware Heritage

Put information (client, collection and deposit-id) inside metadata for metadata-only deposit
Closed, MigratedEdits Locked

Description

With the complete deposit, a revision is created with a commit message including the client, deposit number and collection.
These items will be lost with a metadata-only deposit in the ERMDS, since the revision or other elements aren't created in the archive.

To solve this discrepancy, deposit message should be added in the xml.
Here a proposal to add inside the <swh:deposit>:

<swh:receipt>
   <swh:client>HAL</swh:client>
   <swh:collection>HAL</swh:collection>
   <swh:number>160</swh:number>
   <swh:date>reception date</swh:date>
</swh:receipt>

reception date is equivalent to today's commit date
This information might be redundant with a property already used for ERMDS entries (if so, it can be deleted).

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

moranegg triaged this task as Normal priority.Nov 16 2020, 12:36 PM
moranegg created this task.

Possible option (discussion might be needed)
discovery_date from ERMDS = reception_date of the first deposit_request
swh:date = completed_date

I may have missed something (several actually) but where is this swh:deposit namespace specified?

I can see examples of the usage of the swh:deposit NS in docs/specs/spec-meta-deposit.rst in the context of the metadata deposit (to specify the targeted data in SWH the metadata is about), but no real specification.

To solve this discrepancy, deposit message should be added in the xml.

do you mean you want the (metadata) deposit loader to modify the deposited metadata file? Is it "acceptable"?

IMHO this is an information (well, a metadata) on the metadata loading process, so it should not be part of the original metadata. How would we then we handle say a gpg signature?

I may have missed something (several actually) but where is this swh:deposit namespace specified?

I guess it's the very purpose of T2625, isn't it?

I see three ways to do this:

  1. Make the server parse the Atom document, insert this info in the document, and serialize it before writing to raw_extrinsic_metadata. But this means it will syntactically change the document provided by the client, and may also change it semantically if there is a bug anywhere in the process. It alsos means the client is no longer 100% the authority for that document. (And if we ever want to introduce signatures in the deposit, this is fundamentally incompatible.)
  1. Make the client insert it in the document themselves, and have the server check it. But it's a burden for clients
  1. Wait for T2703 and allow depositing metadata about the metadata object itself (eg. by creating SWHIDs for metadata objects). But it adds more delay for this feature, and we may not want to allow infinite reification like this...

Opinions on this?

I don't like 1/ at all, and 2 seems indeed a burden for clients...

I don't like this either.

  1. Make the client insert it in the document themselves, and have the server check it. But it's a burden for clients

It seems that way.

Although, we are already asking clients to use the codemeta vocabulary and now we added some new tags for the origins... (the <swh:create_origin> / <swh:add_to_origin>).

So that might not be "such" a burden in the end (no idea really).

How far are we from T2703, i thought some work were started on this.

may not want to allow infinite reification like this...

what do you mean by that?

So that might not be "such" a burden in the end (no idea really).

It means that they:

  1. need to care the deposit id (currently, they don't need to)
  2. can't create a deposit in a single query, because they don't yet know the deposit id when doing POST Col-IRI.
  3. need to add these extra tags and understand what they mean.

How far are we from T2703

I don't know

what do you mean by that?

If we can have metadata on metadata, then we can also have metadata on metadata on metadata, and metadata on metadata on metadata on metadata, ...

Another drawback to option 2: it means that this info must be optional (which means most clients will omit it, making it useless) or it would break generic SWORD clients.

Thank you for your patience @douardda
You are right changing the deposited metadata should not be "acceptable", but this information is lost between a regular de posit and a metadata-only deposit, since we do not have a revision for it.
This task was the result of the discussion "do we create an origin-snapshot- revision for metadata-only deposit" which we concluded with NO due to the upcoming ERMDS.

Thank you @vlorentz for your proposed solutions, here are my thoughts:
option 3 is possible but adds a complexity layer of metadata.

option 2 isn't possible, because the client doesn't initially know the deposit_id.

option 1 is complicated because SWH ethical obligation is to archive facts and we can't state " X said Y on artifact W" since the properties would be added to Y and it isn't something that X can say.
We should discuss the concept of a metadata file- do we regard it like a source code file and the sanctity of its finger print (the SWHID)?

Another idea, adding this metadata to the indexed metadata:

  1. raw xml in ERMDS
  2. json object in indexer metadata table containing translated metadata + administrative metadata

at the end the web app wants to read translated metadata

Let's see why we need this:

Why do we want this metadata properties saved and archived?

  1. We want to know who said this information and when.
  2. information without context is bad practice, we shouldn't loose the context.
  3. This is something that we want for all metadata files, not only deposited files.

Isn't there a field in the ERMDS for the context of the metadata entry in json?

Another idea, adding this metadata to the indexed metadata:

  1. raw xml in ERMDS
  2. json object in indexer metadata table containing translated metadata + administrative metadata

at the end the web app wants to read translated metadata

That sounds like option 3, but using the indexer storage (which is not a permanent archive) instead of swh-storage (which is)

Isn't there a field in the ERMDS for the context of the metadata entry in json?

There is not. But we could add it. Let's call this option 4.

If we can have metadata on metadata, then we can also have metadata on metadata on metadata, and metadata on metadata on metadata on metadata, ...

If we can have metadata on anything (and not must) then "problem solved" (well, you know, theoretically at least...)

currently, we can't. but it's easy to allow it, it's just a couple of lines to add in swh.model.model.

vlorentz changed the task status from Open to Work in Progress.Jan 7 2021, 11:14 AM
vlorentz moved this task from Backlog to In progress on the SWORD deposit board.

I'll assume for now that we're going to option 3 and add a dependency on T2703

vlorentz renamed this task from Put information (client, collection and deposit-id) inside metadata for metada-only deposit to Put information (client, collection and deposit-id) inside metadata for metadata-only deposit.Apr 27 2021, 2:09 PM