Page MenuHomeSoftware Heritage

Separate origin-source-code and provenance-metadata in the deposit
Closed, MigratedEdits Locked

Description

Recently we have added the metadata-only deposit option to the deposit feature.
These deposits are added to the deposit storage in the same manner as the regular deposits.
If the target of the deposit is a SWHID with origin in parameter, this information will be kept in the origin column. The same might be true when the target is an origin-url.

This behavior is undesirable because the column origin in the deposit storage should be understood as the origin of the deposit, which can be different from the origin in the archive.

This is why, we need two columns and the information in these columns should be limited to the origin of the deposit (and not the targeted origin).

we need to change the protocol to extract the information from the atom about the provenance of the metadata.

Event Timeline

The same might be true when the target is an origin-url.

It's not

This behavior is undesirable because the column origin in the deposit storage should be understood as the origin of the deposit, which can be different from the origin in the archive.

To be clear, that column was defined from the beginning as being the context in which the metadata is valid.
What you are talking about is creating a new column, not repurposing the existing column, right?

This is why, we need two columns and the information in these columns should be limited to the origin of the deposit (and not the targeted origin).

Is this limited to the deposit, or do you want a generic solution for all metadata objects?

Don't we already have this for deposits, as the meta-metadata?

For a metadata-only deposit, the url of the metadata can be used from the <codemeta:url>, see example:

<codemeta:url>https://hal.halpreprod.archives-ouvertes.fr/hal-02526788</codemeta:url>.

To be pedantic, this is a (non-unique) URL of the described object, not the URL of the metadata.

The same might be true when the target is an origin-url.

It's not

This should be tested with a deposit of metadata-only with the target origin.
When the SWHID is without context the column origin in the deposit admin view shows the SWHID, which is very confusing by the wey.

This behavior is undesirable because the column origin in the deposit storage should be understood as the origin of the deposit, which can be different from the origin in the archive.

To be clear, that column was defined from the beginning as being the context in which the metadata is valid.
What you are talking about is creating a new column, not repurposing the existing column, right?

At the beginning we had only content-deposit, so it was natural that this would serve for the origins in the archive which are created for deposits.
With the new use-case (metadata-only) we used the column "abusively"- when an origin was identified.
So this column will be renamed origin-source-code (which is 100% of the deposits in production) and we will add a new column for metadata.
another option (proposed by Roberto) was to keep all origins in one column and have a new column that says which type of origin.
This option can't take into account a deposit where code's origin is X and metadata's origin is Y.

This is why, we need two columns and the information in these columns should be limited to the origin of the deposit (and not the targeted origin).

Is this limited to the deposit, or do you want a generic solution for all metadata objects?

Opened a discussion in T3681, so maybe to revisit some decisions made for extrinsic metadata.

Don't we already have this for deposits, as the meta-metadata?

If we do, please show me.
And let's organize better the meta-metadata so I can see it in the ERMDS endpoint and in the deposit admin view.

For a metadata-only deposit, the url of the metadata can be used from the <codemeta:url>, see example:

<codemeta:url>https://hal.halpreprod.archives-ouvertes.fr/hal-02526788</codemeta:url>.

To be pedantic, this is a (non-unique) URL of the described object, not the URL of the metadata.

This is true.
Can you suggest something more suitable, aligned with <deposit> <create-origin>?
or even using the 'create-origin' for this purpose, without creating an origin in the archive.

Don't we already have this for deposits, as the meta-metadata?

If we do, please show me.

Hmm, my bad, it's indeed missing from the meta-metadata; but we could add it there.

And let's organize better the meta-metadata so I can see it in the ERMDS endpoint and in the deposit admin view.

For a metadata-only deposit, the url of the metadata can be used from the <codemeta:url>, see example:

<codemeta:url>https://hal.halpreprod.archives-ouvertes.fr/hal-02526788</codemeta:url>.

To be pedantic, this is a (non-unique) URL of the described object, not the URL of the metadata.

This is true.
Can you suggest something more suitable, aligned with <deposit> <create-origin>?
or even using the 'create-origin' for this purpose, without creating an origin in the archive.

'<swh:create-origin>' would be unfit, as it serves a different purpose.

This is authority-specific information, as it describes objects internal to the authority, so I don't have a generic answer.

Looking back, it seems the reason you want this "origin-metadata" is to recognize the deposit in the UI, right? What if we allowed deposit clients to provide a <swh:deposit-name> tag, that would only be used for this UI?
And it does not have to be a URL (since there is nothing we can /locate/ with it), it could be a paper name, etc.

Why do we want that:

  1. as you analyzed: recognize the deposit in the UI
  2. recognize in the deposit storage, what is the source (origin) of the deposit
    • here it happens that the deposit itself is only metadata
  3. identify which deposits are metadata-only deposits

Your proposition is a bit ambiguous with:
<swh:deposit-name> tag
because it is not about naming the deposit, is about providing a url of the deposit birthplace :-)
It should be a url for the location of the metadata in the wild... with HAL the location is the specific HAL record.

Why do we want that:

  1. as you analyzed: recognize the deposit in the UI

After discussion with @moranegg, it appears the actual need isn't to find the location to the metadata itself or to identify an object, but provide users with a link to the original object, for discovery

  1. recognize in the deposit storage, what is the source (origin) of the deposit
    • here it happens that the deposit itself is only metadata

ditto

  1. identify which deposits are metadata-only deposits

@moranegg This is already possible, as metadata-only deposits have a <swh:reference> tag.

Your proposition is a bit ambiguous with:
<swh:deposit-name> tag
because it is not about naming the deposit, is about providing a url of the deposit birthplace :-)
It should be a url for the location of the metadata in the wild... with HAL the location is the specific HAL record.

After the discussion mentioned above, I understand the need for linking to a webpage.

We kind of settled on the term "metadata source" for now, but I want to give it some more thoughts to make sure it makes sense and is future-proof.

Following our discussion, we might call the tag metadata-source, since using the term origin for metadata can be misleading.

@vlorentz: to revisit after a good night sleep :-)

Oh and another use-case we discussed: when browsing an origin/directory/..., showing on the side the set of sources that provided metadata for that origin/directory/.... This is, again, to improve discoverability.

Oh and another use-case we discussed: when browsing an origin/directory/..., showing on the side the set of sources that provided metadata for that origin/directory/.... This is, again, to improve discoverability.

Yes exactly, when browsing the content, having a direct link to the metadata-source is also a valid use-case.

I'm not sure where we are at on this evolution.
Maybe we can discuss this next week.

moranegg renamed this task from Separate origin-source-code and origin-metadata in the deposit to Separate origin-source-code and provenance-metadata in the deposit.Feb 15 2022, 3:05 PM
moranegg updated the task description. (Show Details)
moranegg added a project: meta-task.
ardumont changed the status of subtask T3973: Deploy swh.deposit v0.17 from Open to Work in Progress.Feb 24 2022, 12:07 PM
moranegg changed the task status from Open to Work in Progress.Apr 7 2022, 12:04 PM
moranegg moved this task from Backlog to In progress on the SWORD deposit board.
moranegg changed the status of subtask T3376: Visualize metadata of a deposit in the admin (moderation) view from Open to Work in Progress.