Page MenuHomeSoftware Heritage

Remove the <external_identifier> tag from the protocol.
ClosedPublic

Authored by vlorentz on Nov 12 2020, 2:47 PM.

Details

Summary

It didn't match the Atom specification, duplicates the Slug header,
while being optional and specified as 'SHOULD match the Slug'.

First half of the resolution of T2766.

Diff Detail

Repository
rDDEP Push deposit
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17054
Build 26321: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 26320: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4463 (id=15843)

Rebasing onto 125c4831b4...

Current branch diff-target is up to date.
Changes applied before test
commit 4793f2ef85561d473a0b9932367b40d4a449ae7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/322/ for more details.

ardumont added inline comments.
swh/deposit/api/common.py
214

why remove it?

It's to retry in case of failing because of technical issues (communication for example), not trying 3 times to validate invalid deposits heh.

Something is amiss for me somewhere.

I thought we wanted to relax the slug's mandatory property (to respect the sword v2 spec).
If that's the case, we still need the external_identifier in the loop.

Because that's what's used at some point to build the origin for the loader.

What did i miss?

Because, from my understanding of another discussion with @douardda
or @moranegg, i thought it was the other way around.
Keep the external_identifier and use it if not provided by the slug.

So that way, we can go the optional slug road.

swh/deposit/api/common.py
911

Nope, it's post alright

Something is amiss for me somewhere.

I thought we wanted to relax the slug's mandatory property (to respect the sword v2 spec).
If that's the case, we still need the external_identifier in the loop.

Because that's what's used at some point to build the origin for the loader.

What did i miss?

Because, from my understanding of another discussion with @douardda
or @moranegg, i thought it was the other way around.
Keep the external_identifier and use it if not provided by the slug.

So that way, we can go the optional slug road.

The fact that the slug is optional does not make the external_identifier in the metadata file required, it's just that if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

Let me clarify, that's a hack that was installed for a specific user (which does not use it in the end).

I don't want that hack from the cli ending up server side...

We should probably rework the origin logic first then.

if the slug is not sent, then the id is generated (like we currently do in the cli) but server side.

Let me clarify, that's a hack that was installed for a specific user (which does not use it in the end).

I don't want that hack from the cli ending up server side...

We should probably rework the origin logic first then.

This (generating an id) is not a hack, it's the SWORD specification. The Slug is optional. The Cont-IRI resulting from a deposit should use/contain this Slug. Whatever we choose as Cont-IRI scheme, it must have a server-generated part (a counter, a timestamp, an UUID...) to prevent any collision.

In fact, it's not the slug that we must be server-side, actually. It's this Cont-IRI, and this IRI may contain the slug if this later has been provided.

swh/deposit/api/common.py
214

I don't remember changing that line

@ardumont Maybe it's clearer if we explicit the different steps.

Currently, the pipeline is:

  1. user may provide a slug
  2. client generates a slug if it's not provided
  3. client sends request to server (always with slug)
  4. server generates an URL from provider_url + slug

What @douardda and I want to do it:

  1. user may provide a slug
  2. client sends request to server (sometimes with slug)
  3. server generates a slug if it's not provided
  4. server generates an URL from provider_url + slug

1 and 4 are unchanged, we are only swapping 2 and 3.

I understand from where we are to where you propose to go.
Thanks nonetheless ;)

Heads up, there are not much users of that generation cli (just check the db, our integration test does it,
and one user which never came through).

Full disclosure, I was reticent to have to develop that generation client side in the first place...
because the slug meant something for the origin creation. It still does.

I said as much some years ago and now we want to end up having that server side...
That bugs me.

server generates an URL from provider_url + slug

I understand that.
What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

Thus why i'd like we discuss what do we want as origin first.
And the suggestion of David about the Cont-IRI will break the current deposit behavior...

Hopefully, you'll be able to make me understand better or convince me i'm wrong ;)

Cheers,

What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

It does mean something, it's an URI (resource identifier) but not an URL (resource locator)

What bugs me is that said url does not make sense (when that slug is generated), that url means nothing...

It does mean something, it's an URI (resource identifier) but not an URL (resource locator)

IMHO, URI that uses standard protocol schemes (eg. http) but are not resolvable using this protocol are evil. Using unresolvable URI is not a problem, just don't use a URI scheme that would let you think it's something else (a URL).

Given our recent discussion about origin/external_id/slug, etc... I think this diff should be marked at least on planned changes :)

I disagree. We don't want to re-use it, but to create a new tag entirely.

I disagree. We don't want to re-use it, but to create a new tag entirely.

agreed

Build has FAILED

Patch application report for D4463 (id=16168)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit c5f89f7eb1e6e47d11f550a64e1d184036bb0ff6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/416/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/416/console

swh/deposit/api/common.py
911

... why did i do that

Build is green

Patch application report for D4463 (id=16171)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit 4f51deebc413fcaac23730d4f3521ed2293f7f08
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/419/ for more details.

Build is green

Patch application report for D4463 (id=16175)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit dcf3c6e390e00e34da5ac03469da9cf7733d758c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/422/ for more details.

lgtm (besides the retries_left=3 stuff)

This revision is now accepted and ready to land.Nov 23 2020, 12:47 PM

Build is green

Patch application report for D4463 (id=16180)

Rebasing onto a8e86a92ee...

Current branch diff-target is up to date.
Changes applied before test
commit 78befb03cfd9c7b423b7e23b6ecd5580d0699941
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/425/ for more details.