Page MenuHomeSoftware Heritage

Adapt existing POST to a collection to allow metadata-only deposit
ClosedPublic

Authored by ardumont on Nov 13 2020, 6:55 PM.

Details

Summary

Endpoint:

POST /1/<collection>
HEADER: Content-type: application/atom+xml;type=entry

And the xml provided by the client contains either:

<swh:deposit>
  <swh:reference>
    <swh:object swhid="{swhid}" />
  </swh:reference>
</swh:deposit>

or:

<swh:deposit>
  <swh:reference>
    <swh:origin url="{url}" />
  </swh:reference>
</swh:deposit>

Any invalid swhid raises a 400 bad request. If everything passes, a 201
response with usual deposit receipt is received by the deposit user. The
deposit row is updated with status "done", complete_date and reception_date to
the same and current date, the swhid columns are updated as well if any.

Note: Technically, this is reusing the existing code from the metadata update
scenario. The current metadata update still happens the same way as before.

There should be an extra diff to also relax the slug constraint D4494. I keep
this modification out of this diff as it would lead otherwise to a lot more
noise in that diff (and the current diff is already too long).

Related to T2537

Test Plan

tox

Diff Detail

Repository
rDDEP Push deposit
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adapt according to review:

  • Extract an swh.deposit.utils.compute_metadata_context and use it

Also, save deposit swhid and swhid context respectively in their own dedicated
column (matching what's done for deposit with contents)

ardumont retitled this revision from Allow metadata-only deposit scenario to Adapt existing POST to a collection to allow metadata-only deposit.Nov 17 2020, 2:45 PM

Build is green

Patch application report for D4475 (id=15934)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit 58168a9164beb352861e492b2559b00f733d59b9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 18:27:23 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: This technically is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

commit ce7ab89f267d84dabe032d2dcbe26f21b7e165f0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    metadata-only: Start deposit
    
    It does nothing right now but add the detection of the metadata-only deposit.
    And fails in case of invalid input.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/338/ for more details.

swh/deposit/api/common.py
674

Hmm actually, nevermind. I just re-read the metadata spec.

It would be an issue currently because the config is not part of the unique key, but olasd is working on that.

702

ok

708–711

but i don't know what to do for that case...

return an HTTP error

swh/deposit/tests/api/test_deposit_metadata.py
159–172

Write the expected data in pytest.mark.parametrize.

ardumont marked an inline comment as done.
  • Add tests on swh.deposit.utils.compute_metadata_context
  • Simplify main tests to reuse utils.compute_metadata_context

Build is green

Patch application report for D4475 (id=15937)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit 85dc1f7f3d910ffff66ef00dd875794997c4b7b3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 18:27:23 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: This technically is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

commit ce7ab89f267d84dabe032d2dcbe26f21b7e165f0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    metadata-only: Start deposit
    
    It does nothing right now but add the detection of the metadata-only deposit.
    And fails in case of invalid input.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/339/ for more details.

swh/deposit/tests/api/test_deposit_metadata.py
159–172

I understand your suggestion as write the expected RawExtrinsicMetadata for each input...

so "Ouch", that will be quite difficult to read.
(you are regularly saying to me how it is not that easy to read when i do that ;)

I settled on updating the diff to reuse the utils function compute_metadata_context (which now has dedicated tests, so we can trust it more somehow).

swh/deposit/api/common.py
708–711

Let's discuss that in D4493 which tries to do that ;)

A few more comments and it should be good

swh/deposit/api/common.py
626–629

What does "In the nominal scenario" mean?

690–694

Simpler IMO:

swhid_core: Union[str, SWHID]
if isinstance(swhid_reference, str):
    swhid_core = swhid_reference
else:
    swhid_core = attr.evolve(swhid_reference, metadata={})
714–718

I don't understand how this conditional is related to checking if it's a metadata-only deposit

719–725

Why does a function named _store_metadata_deposit update the deposit's status? Shouldn't this be done in the caller function? (and you might be able to drop the conditional that way)

swh/deposit/tests/api/test_deposit_metadata.py
159–172

I settled on updating the diff to reuse the utils function compute_metadata_context (which now has dedicated tests, so we can trust it more somehow).

Good!

This revision now requires changes to proceed.Nov 17 2020, 5:35 PM
swh/deposit/api/common.py
626–629

well, no bump in the road, all user inputs are correct so we can do the actual job properly.

714–718

well, aside from specifically add a flag "with_metadata_only_deposit" (or something), how do you do detect it?

For me we have currently 2 cases:

  • metadata-update: in that case, the deposit status is done (it's a completed one) and its swhid and swhid context are set
  • metadata-only deposit: in that case the deposit is new, so its status is something like 'partial' and swhid and swhid context are null.

But i'm fine if you ask me to open a flag instead, it may be clearer that way.

719–725

oh yeah, indeed, it crossed my mind and then i forget to act on it.
Thanks.

swh/deposit/api/common.py
719–725

Why does a function named _store_metadata_deposit update the deposit's status?

to be fair, i mentioned it in the docstring...

Shouldn't this be done in the caller function?

yeah that'd be more clean.
But, that's tedious with the current way of dealing with errors... that returns a dict... and not a tuple of stuff i'd like to have... The signature will become ugly...

(and you might be able to drop the conditional that way)

indeed, that's the advantage.

ardumont edited the summary of this revision. (Show Details)
  • Adapt according to latest good remarks from @vlorentz ;)
  • squash last 2 commits into 1 dedicated to the metadata-only deposit

Build is green

Patch application report for D4475 (id=15947)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit 406790effb7080befe2eef6b0c7734a751e16870
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: Technically, this is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/346/ for more details.

swh/deposit/api/common.py
788–790

tuple = success, dict = error? meh

what about an exception instead?

swh/deposit/api/common.py
788–790

tuple = success, dict = error? meh

well, yeah...

what about an exception instead?

I don't really want to plunge into refactoring the current actual exception handling in the deposit right now.

swh/deposit/api/common.py
788–790

what about returning (boolean, result), like the metadata validation function?

Rework local exception handling and keep the type returned by the method
_store_metadata_deposit consistent

swh/deposit/api/common.py
788–790

I was ok with the last suggestion and i implemented it accordingly.

But in the end, that just added another indirection...
So in the end, i prefer adding some exception handling locally instead (as per your first proposal).

(We will adapt the other part of the deposit similarly later if we agree upon that ;)

Build is green

Patch application report for D4475 (id=15950)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit ec8c23e3f314234d45c6f0d50c2e366084385c4f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: Technically, this is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/349/ for more details.

Simplify the _store_metadata_deposit type [1]

[1] I kept the Union[Dict, Tuple] by mistake and mypy did not complain because
it was still true ¯\_(ツ)_/¯

Build is green

Patch application report for D4475 (id=15951)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit 345dffe98b52d0a548f6dedc4499b8eedaa886ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: Technically, this is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/350/ for more details.

ardumont added inline comments.
swh/deposit/api/common.py
652

heh, that's not a possible situation...
(if we are here, that means we have extracted the swhid or origin reference so it's not an empty body request ¯\_(ツ)_/¯).

789

and now i missed some cases, fixing as well.

  • Update docstring appropriately (add the Raises chapter)
  • Add missing test case about functionally invalid metadata
  • Drop the empty body request check which was impossible to reach in the first place (put in back in place in the deposit update scenario only)
  • Change the with_deposit_origin to deposit_origin as Optional[str] (used in deposit update scenario)

I should be done now...

Build is green

Patch application report for D4475 (id=15952)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit cf1e30350c7f5b61208595a85f352b805b0d3d7a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: Technically, this is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/351/ for more details.

  • Drop <client> mention in test xml data (i recall some diff is opened to try and remove those)
  • Fix docstring old reference to with_deposit_origin

And i'm done now

ardumont marked 13 inline comments as done.

Build is green

Patch application report for D4475 (id=15953)

Rebasing onto c1a45162d3...

Current branch diff-target is up to date.
Changes applied before test
commit b86840cd1e2278ebfbcfdc6d57f3384f28fea96c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 13 14:32:05 2020 +0100

    Adapt existing POST to a collection to allow metadata-only deposit
    
    The following endpoint is adapted to allow it:
    
    POST /1/<collection>
    HEADER: Content-type: application/atom+xml;type=entry
    And the xml provided by the client contains either:
    
    <swh:deposit>
      <swh:reference>
        <swh:object swhid="{swhid}" />
      </swh:reference>
    </swh:deposit>
    or:
    
    <swh:deposit>
      <swh:reference>
        <swh:origin url="{url}" />
      </swh:reference>
    </swh:deposit>
    
    Any invalid swhid raises a 400 bad request. If everything passes, a 201
    response with usual deposit receipt is received by the deposit user. The
    deposit row is updated with status "done", complete_date and reception_date to
    the same and current date, the swhid columns are updated as well if any.
    
    Note: Technically, this is reusing the existing code from the metadata update
    scenario. The current metadata update still happens the same way as before.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/352/ for more details.

swh/deposit/api/common.py
788–790

T2792 to rework the exception handling.

This revision is now accepted and ready to land.Nov 18 2020, 10:14 AM

Looks good.
I have only a couple of naming remarks.

and a bigger question about identifying the same SWHID in the ERMDS.

swh/deposit/api/common.py
633

object referenced (SWhiD or origin url)

641

the word origin here doesn't mean a url origin, right?
maybe use with_original_deposit?

663

there is a notion of first_name and last_name for a client?

694

naming here is difficult to follow.
A suggestion:
use object_reference which can be an origin or a swhid_core

for the context without the swhid_core use swhid_only_context
for the swhid wth context swhid_with_context

797

looks like the same date, is that normal behavior? it's when there is only one deposit request without in-progress right?

swh/deposit/tests/api/test_deposit_metadata.py
145

is swhid_context a full SWHID or just the context?

swh/deposit/tests/data/atom/entry-data-with-swhid-fail-metadata-functional-checks.xml
6

is the ur:uuid supposed to be the external_id?
or the external_id is now only the slug?

swh/deposit/utils.py
109

isn't it called visit in the SWHID context? is that different on the ERMDS?

114

this is called anchor in a SWHID.
Looks like the context names sent to the ERMDS is a bit different than what is found in a SWHID.
Isn't it risky to not agree on the same thing that is identified?
@vlorentz

swh/deposit/tests/data/atom/entry-data-with-swhid-fail-metadata-functional-checks.xml
6

that's not an external_id, we don't do anything with this value

swh/deposit/utils.py
109

yes and yes.

114

this is called anchor in a SWHID.
Looks like the context names sent to the ERMDS is a bit different than what is found in a SWHID.

yes and it's intentional

Isn't it risky to not agree on the same thing that is identified?

what do you mean?

I'll land this in this state to avoid a gazillionth rebase again on the subsequent diffs.

swh/deposit/api/common.py
641

yes, it does mean an origin url as per the deposit update scenario.


and lol, I initially made that a bool and called this with_origin_deposit... pretty much similar.

swh/deposit/tests/api/test_deposit_metadata.py
145

a full swhid with context

here, well in all the diff, a:

  • swhid core is really just a core swhid (no qualifiers in there) as per the specification
  • swhid_context is a swhid with qualifiers
  • swhid_reference is just one or the other, completely dependent of what the deposit client sent to the server.
swh/deposit/tests/data/atom/entry-data-with-swhid-fail-metadata-functional-checks.xml
6

I think some users sent this along and to ensure it passed fine, that ended up in some of this xml test data.

and we can add another diff which discuss better names.
It will be simpler to follow through (that diff is too big).