Page MenuHomeSoftware Heritage

Update the usage of --slug or external_identifier in the deposit cli
Closed, MigratedEdits Locked

Description

During @douardda's tests of the deposit the only way to create an origin without a UUID was with a --slug flag.
As described here: T2391

Using the external_id from the metadata file if --slug isn't given is a better cli behavior.

I'm not sure it is feasible technically.

Event Timeline

moranegg triaged this task as Normal priority.Nov 3 2020, 11:49 AM
moranegg created this task.

I'm not sure it is feasible technically

It is.

The main change that needs to occur for this is server side, not cli side.
It's the server which expects the slug to be given (as an http request header).

The --slug from the cli just transforms it into the header when discussing with the server.
Now, there is also a change required in the cli side. If the --slug is not provided and there is an
external_id entry in the metadata xml file, then no need to generate the slug (the cli generates
an uuid if no slug is provided currently).

Can you verify if there is a check related to the coherence between the slug and external_identifier?

I do think that the change is client side, where the cli should extract the id from the metadata file and sen it as the header slug.
Is that possible?

Also, we are thinking of removing external_identifier altogether and use only the atom:id identifier in the metadata file. (I'll update here when this discussion is finalized on IRC)

also note that making the slug a MUST (server-side) is not valid w.r.t. the specs ("The client MAY supply a Slug header")

So this should not be handled client-side (the generation of the slug in swh/deposit/cli/client.py)

I also think this external_identifier should go away, the spec is rich (aka complicated) enough without we adding some layers :-)

also note that making the slug a MUST (server-side) is not valid w.r.t. the
specs ("The client MAY supply a Slug header")

yes, this part is not compliant with the sword spec (it was done so the deposit
could start being developed, there was no api update part at first...).

We need to push forward your proposal of the external_identifier (or slug
if your 2nd proposal below comes to be) within the metadata deposited by
clients then.

Because we need to have some ways of discriminating between deposit requests
(of type metadata or archive) that are creating new partial deposits
without any anterior history and other deposit requests which are creating new
deposit for the same "origin".

As far as i remember, the mandatory slug within the http headers is the one
allowing this.

To create a deposit, you post to /<collection>/ (<- there is no deposit id in
there). And we are using the external id to join correctly that information [1]

So saying that, i think i finally figure out what needs changing then (yeah)...

We need to allow providing the previous deposit id [2] as the historic deposit
id with the post collection api call. So we can actually know it's a new
deposit for an existing origin (because we want that, part of the deposit
requisite)

[1] Implementation wise, the synthetic revision created in the archive are
referencing the previous synthetic revision.

[2] or swhid? i think a core swhid won't work (clash could be possible on
directory swhid). A swhid with context though could work though!


I can't help but think that adding full fledged test around deposit scenarios
would help for this.

I also think this external_identifier should go away, the spec is rich (aka
complicated) enough without we adding some layers :-)

yes, then we need to unify everywhere on the slug term.

moranegg claimed this task.

Closing this with the recent changes in the protocol:
https://docs.softwareheritage.org/devel/swh-deposit/specs/protocol-reference.html

see also: T2860