Page MenuHomeSoftware Heritage

Make the Slug header optional for the deposit server
Closed, MigratedEdits Locked

Description

Currently, the deposit server requires the Slug header, so the client generates one when needed.

However, as @douardda pointed out to me, the SWORD specification says it is optional:

The client MAY supply a Slug header providing a suggested identifier for the deposited content

and

Server implementations MUST adopt the behaviour and requirements in Section 9.7 of [AtomPub] with respect to the Slug header.

and from the AtomPub spec:

Slug is an HTTP entity-header whose presence in a POST to a Collection constitutes a request by the client to [...]

which implies optionally as well.

So we should make the server accept the absence of a Slug header and generate a slug on its own if it is missing. And then, we can remove the slug generation from the client, as it won't be needed anymore.

Event Timeline

vlorentz triaged this task as Normal priority.Nov 9 2020, 1:06 PM
vlorentz created this task.

I agree we should stick to specification as much as possible.

Before generating the slug server side, we should clarify what is its purpose.
It's blurry to me today... I recall it's ending up in the origin the deposit
loader ingests though. Thus why we made it mandatory in the first place, iirc.

So that means, for most users [1], the ingested origin is something referenced in
the real world. We can look at hal or ipol's deposits for example [2] [3].

@moranegg Do you have some more inputs?

[1] Except for the swh user, we are using that slug generation which creates
dummy deposits for monitoring test purposes.

[2] external_id: hal-01243065, provider_url: https://hal.archives-ouvertes.fr
https://hal.archives-ouvertes.fr/hal-01243065

[3] external_id: ipol.2011.ys-dct, provider_url: https://doi.org/10.5201/
https://doi.org/10.5201/ipol.2011.ys-dct (-redirects-to-> http://www.ipol.im/pub/art/2011/ys-dct/)

I need to reflect on this.
There were two major reasons for which we use the slug:

  1. create an origin
  2. versions and deposit's paternity - the slug is the property letting us know that a deposit is another version of the same slug

This task should be also related to T2391, T2757 (actually seems like the same task) and T2752 (making it possible to have the same slug behavior with a metadata file property)

@moranegg It's not going away; users of the deposit clients won't notice the change. It's just moving the optionally further down the pipeline.

I need to reflect on this.
There were two major reasons for which we use the slug:

  1. create an origin

This is not a problem

  1. versions and deposit's paternity - the slug is the property letting us know that a deposit is another version of the same slug

This is a problem: we give a meaning to the slug that is not part of any spec (unless I missed something). The slug is not dedicated to be an identifier, it's a suggestion made by the user. We may not even take it into consideration and be perfectly fine w.r.t specifications.
So it looks to me that the identifier we need is in fact the origin, and the origin is "affected" the first time a deposit is made (potentially taking the slug into consideration for this purpose).
Then for subsequent deposits that must have a paternity relation, the identifier we need is in fact the origin. Not sure if this is actually possible to do strictly using the SWORD specs or if we need some "custom extension" (in which case, the "overusage" of the slug can be seen as such).

However, using the slug to identify a deposited origin has also another week point: if for any reason the provider url of a user changes, this will not work any more.

This task should be also related to T2391, T2757 (actually seems like the same task) and T2752 (making it possible to have the same slug behavior with a metadata file property)

@douardda That's a good point. I don't think adding an extension would be an issue, we are already doing that for the metadata-only deposit

It seems to make perfectly sense to use the same logic as the metadata deposit to handle this problem, I think.

But is there something missing in the SWORD specification to cover this?
I was having the impression that the "Replacing", "Adding" and "Deleting" a content to a deposit (sections 6.5, 6.6 and 6.7 of the SWORDv2 spec) were exactly meant to be used for this.

Since all these operations are performed via the EM-IRI or the Edit-IRI of the resource, it's "identity" is perfectly defined (and this "identity" is what's the Origin is related to).

[edit] If I understand this correctly, this "identity" is what a 'container" refers to in the SWORDv2 specs (but I'm not 100% sure I get the semantics behind "container" vs "resource" in that spec)[/edit]

Yes and no.

The problem is, the EM-IRI and the Edit-IRI are already used to allow changing a deposit as long as it's in the partial state. Then, it's loaded when it leaves that status, and the user is no longer allowed to use them.

But we could indeed require the user uses them for deposit updates, making the deposit go back in the partial state, it would be much cleaner (and in the spirit of the SWORD spec) to do that.

This is a problem: we give a meaning to the slug that is not part of any spec (unless I missed something).

I thought that was referenced somewhere... Either I dreamt it, either it got
dropped and I missed that change.

The slug is not dedicated to be an identifier, it's a suggestion made by the user...

That we voluntarily made mandatory (that's in our deposit spec somewhere).

At the time we only had hal as client and not enough background to see more
clearly. So it seems we made the wrong choice that can be improved now \o/

So it looks to me that the identifier we need is in fact the origin, and the
origin is "affected" the first time a deposit is made (potentially taking the
slug into consideration for this purpose).

yes, exactly. That's why I was inclined towards your proposal to make clear the
origin in the deposit table (implementation detail, today it's a only property
of the Deposit model instance ¯\_(ツ)_/¯).

Then for subsequent deposits that must have a paternity relation, the
identifier we need is in fact the origin.

yes!

I don't think adding an extension would be an issue, we are already doing that for the metadata-only deposit

I recall we had the origin (as an url xml entry) at some point. And we used
that as the origin, prior to do some computation with the provider url and the
external id.

I think we can enhance the <swh:deposit /> entries to add that back there?
Is that your suggestions to both?

However, using the slug to identify a deposited origin has also another week
point: if for any reason the provider url of a user changes, this will not
work any more.

I'd play the status-quo card...
Let's see when that happens?

For a next version with content we can't use the same deposit_id with an EM_IRI because it will mean we need to keep for the same deposit_id a list of different SWHIDs.
This is why we need a new deposit for each new content and link between deposits with a parent property.

vlorentz reopened this task as Open.

actually, not finished, we still need <swh:create_origin> and <swh:add_to_origin> to land for it to be really optional