Page MenuHomeSoftware Heritage

Remove the <external_identifier> tag from the protocol.
ClosedPublic

Authored by vlorentz on Thu, Nov 12, 2:47 PM.

Details

Summary

It didn't match the Atom specification, duplicates the Slug header,
while being optional and specified as 'SHOULD match the Slug'.

First half of the resolution of T2766.

Diff Detail

Repository
rDDEP Push deposit
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Thu, Nov 12, 2:47 PM

Build is green

Patch application report for D4463 (id=15843)

Rebasing onto 125c4831b4...

Current branch diff-target is up to date.
Changes applied before test
commit 4793f2ef85561d473a0b9932367b40d4a449ae7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/322/ for more details.

ardumont added inline comments.
swh/deposit/api/common.py
253

why remove it?

It's to retry in case of failing because of technical issues (communication for example), not trying 3 times to validate invalid deposits heh.

Something is amiss for me somewhere.

I thought we wanted to relax the slug's mandatory property (to respect the sword v2 spec).
If that's the case, we still need the external_identifier in the loop.

Because that's what's used at some point to build the origin for the loader.

What did i miss?

Because, from my understanding of another discussion with @douardda
or @moranegg, i thought it was the other way around.
Keep the external_identifier and use it if not provided by the slug.

So that way, we can go the optional slug road.

swh/deposit/api/common.py
1012

Nope, it's post alright

Something is amiss for me somewhere.

I thought we wanted to relax the slug's mandatory property (to respect the sword v2 spec).
If that's the case, we still need the external_identifier in the loop.

Because that's what's used at some point to build the origin for the loader.

What did i miss?

Because, from my understanding of another discussion with @douardda
or @moranegg, i thought it was the other way around.
Keep the external_identifier and use it if not provided by the slug.

So that way, we can go the optional slug road.

The fact that the slug is optional does not make the external_identifier in the metadata file required, it's just that if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

ardumont added a comment.EditedFri, Nov 13, 10:46 AM

if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

Let me clarify, that's a hack that was installed for a specific user (which does not use it in the end).

I don't want that hack from the cli ending up server side...

We should probably rework the origin logic first then.

if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

Let me clarify, that's a hack that was installed for a specific user (which does not use it in the end).

I don't want that hack from the cli ending up server side...

We should probably rework the origin logic first then.

This (generating an id) is not a hack, it's the SWORD specification. The Slug is optional. The Cont-IRI resulting from a deposit should use/contain this Slug. Whatever we choose as Cont-IRI scheme, it must have a server-generated part (a counter, a timestamp, an UUID...) to prevent any collision.

In fact, it's not the slug that we must be server-side, actually. It's this Cont-IRI, and this IRI may contain the slug if this later has been provided.

vlorentz added inline comments.Fri, Nov 13, 3:11 PM
swh/deposit/api/common.py
253

I don't remember changing that line

@ardumont Maybe it's clearer if we explicit the different steps.

Currently, the pipeline is:

  1. user may provide a slug
  2. client generates a slug if it's not provided
  3. client sends request to server (always with slug)
  4. server generates an URL from provider_url + slug

What @douardda and I want to do it:

  1. user may provide a slug
  2. client sends request to server (sometimes with slug)
  3. server generates a slug if it's not provided
  4. server generates an URL from provider_url + slug

1 and 4 are unchanged, we are only swapping 2 and 3.

ardumont added a comment.EditedFri, Nov 13, 3:32 PM

I understand from where we are to where you propose to go.
Thanks nonetheless ;)

Heads up, there are not much users of that generation cli (just check the db, our integration test does it,
and one user which never came through).

Full disclosure, I was reticent to have to develop that generation client side in the first place...
because the slug meant something for the origin creation. It still does.

I said as much some years ago and now we want to end up having that server side...
That bugs me.

server generates an URL from provider_url + slug

I understand that.
What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

Thus why i'd like we discuss what do we want as origin first.
And the suggestion of David about the Cont-IRI will break the current deposit behavior...

Hopefully, you'll be able to make me understand better or convince me i'm wrong ;)

Cheers,

What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

It does mean something, it's an URI (resource identifier) but not an URL (resource locator)

What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

It does mean something, it's an URI (resource identifier) but not an URL (resource locator)

IMHO, URI that uses standard protocol schemes (eg. http) but are not resolvable using this protocol are evil. Using unresolvable URI is not a problem, just don't use a URI scheme that would let you think it's something else (a URL).

Given our recent discussion about origin/external_id/slug, etc... I think this diff should be marked at least on planned changes :)

vlorentz added a comment.EditedFri, Nov 20, 11:13 AM

I disagree. We don't want to re-use it, but to create a new tag entirely.

I disagree. We don't want to re-use it, but to create a new tag entirely.

agreed

Build has FAILED

Patch application report for D4463 (id=16168)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit c5f89f7eb1e6e47d11f550a64e1d184036bb0ff6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/416/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/416/console

vlorentz updated this revision to Diff 16171.Mon, Nov 23, 12:14 PM

fix rebase

vlorentz added inline comments.Mon, Nov 23, 12:18 PM
swh/deposit/api/common.py
1012

... why did i do that

Build is green

Patch application report for D4463 (id=16171)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit 4f51deebc413fcaac23730d4f3521ed2293f7f08
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/419/ for more details.

Build is green

Patch application report for D4463 (id=16175)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit dcf3c6e390e00e34da5ac03469da9cf7733d758c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/422/ for more details.

douardda accepted this revision.Mon, Nov 23, 12:47 PM

lgtm (besides the retries_left=3 stuff)

This revision is now accepted and ready to land.Mon, Nov 23, 12:47 PM
vlorentz updated this revision to Diff 16180.Mon, Nov 23, 1:03 PM

re-add retries_left

Build is green

Patch application report for D4463 (id=16180)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit 78befb03cfd9c7b423b7e23b6ecd5580d0699941
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/425/ for more details.