Page MenuHomeSoftware Heritage

Allow deposit metadata update on deposit already complete
ClosedPublic

Authored by ardumont on Sep 22 2020, 7:14 PM.

Details

Summary

This allows the premise of the scenario [1] to occur.

Premise because the tids and bits about the metadata information to send to the
metadata storage is not completely clear. A first review from @moranegg and
@vlorentz to clarify that part would be nice.

[1] Update metadata for deposit whose archive and metadata are already stored
in the archive. Described in:
https://hackmd.io/yUZAp78KTdSvE4drf0_MqQ#2-Cas-d%E2%80%99utilisations-contributeur-HAL

Depends on D4057

Test Plan

tox

Diff Detail

Repository
rDDEP Push deposit
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 15440
Build 23790: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 23789: arc lint + arc unit

Unit TestsFailed

TimeTest
293 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_converters::test_convert_status_detail
deposit_config_path = '/tmp/pytest-of-jenkins/pytest-0/test_convert_status_detail0/deposit.yml' swh_scheduler_config = {'db': "user=postgres password=xxx dbname=tests host=127.0.0.1 port=17789 options=''"}
395 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_converters::test_convert_status_detail_2
deposit_config_path = '/tmp/pytest-of-jenkins/pytest-0/test_convert_status_detail_20/deposit.yml' swh_scheduler_config = {'db': "user=postgres password=xxx dbname=tests host=127.0.0.1 port=17789 options=''"}
397 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_converters::test_convert_status_detail_3
deposit_config_path = '/tmp/pytest-of-jenkins/pytest-0/test_convert_status_detail_30/deposit.yml' swh_scheduler_config = {'db': "user=postgres password=xxx dbname=tests host=127.0.0.1 port=17789 options=''"}
393 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_converters::test_convert_status_detail_edge_case
deposit_config_path = '/tmp/pytest-of-jenkins/pytest-0/test_convert_status_detail_edg0/deposit.yml' swh_scheduler_config = {'db': "user=postgres password=xxx dbname=tests host=127.0.0.1 port=17789 options=''"}
395 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_converters::test_convert_status_detail_empty
deposit_config_path = '/tmp/pytest-of-jenkins/pytest-0/test_convert_status_detail_emp0/deposit.yml' swh_scheduler_config = {'db': "user=postgres password=xxx dbname=tests host=127.0.0.1 port=17789 options=''"}
View Full Test Results (144 Failed)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/deposit/api/deposit_update.py
207

I'm kinda surprised that HAL-preprod had the Malformed xml metadata error
I saw you opened a task to improve the error messages.
+1 on this in a separate diff.

238

an update of the metadata is only in the REMD, not in the storage, so no metadata on revisions.
But I'm not sure I understand the question.

240

The authority is the entity that gave the information.
In this case it's the client.

Let's check with @vlorentz how to format correctly the authority, because it shouldn't be just a string, but maybe it is.

ardumont added inline comments.
swh/deposit/config.py
63

a priori irrelevant, so can go away.

swh/deposit/api/deposit_update.py
238

ok, so here is the kicker ;)

Tthe loader deposit, when sending the metadata to the metadata storage, should attach those to the directory SWHID and not the revision like it is currently doing.

@vlorentz heads up ^

Question: Is the current metadata revision migration script already doing that change?

swh/deposit/api/deposit_update.py
238

ooooh... now I get it.
With the content deposit, the metadata is sent to the loader to be added to the revision.

This is another big issue !! I didn't see coming, and I might need to check @vlorentz's script:
At the moment the metadata are on the revision because that's the place we had the metadata field.
But !!!
The content-deposit is a full tree with a snapshot, a revision, a directory. Metdata concerns all nodes of this tree.
The SWHID used in HAL is the directory SWHID, so most appropriate place is on the directory.

Maybe it should be on all nodes?

  • Add missing tests scenario (malformed xml, empty body request)
  • Fix corner case in empty existing scenario for empty request (which happens to be a possibility for the new scenario as well so it's fixed within the diff as well)
  • Drop wrong use of SWH_METADATA_AUTHORITY and send the metadata with the right metadata authority.
  • Clarify test docstrings and some comments

TODO:

  • runtime: Add functional checks on metadata received (same checks as what the checker does for the standard deposit scenario)
  • Tests: Checks the metadata written in the metadata storage are as expected

Build is green

Patch application report for D4013 (id=14265)

Rebasing onto 047588a03f...

Current branch diff-target is up to date.
Changes applied before test
commit f94c9effa2437ae5fdfd5311361dfff20d4f0a83
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/118/ for more details.

runtime: Add functional checks on metadata received (same checks as what the checker does for the standard deposit scenario)

D4047 (so the current diff does not grow too much)

swh/deposit/api/deposit_update.py
158

missing test for another diff.

swh/deposit/tests/api/test_deposit_update.py
312

@moranegg empty body use case i kept talking about in D4047 ;)

swh/deposit/api/deposit_update.py
158
ardumont retitled this revision from wip: Allow deposit metadata update on deposit already complete to Allow deposit metadata update on deposit already complete.Sep 25 2020, 3:25 PM
ardumont edited the test plan for this revision. (Show Details)
swh/deposit/api/deposit_update.py
216

^ shouldn't it be about the deposit client (hal, ipol, etc...) here?

@vlorentz @moranegg ^

vlorentz added inline comments.
swh/deposit/api/deposit_update.py
215–222

I'm not sure we should do that. It means we can't allow deposit updates while a storage is being replayed.

216

the provider is the deposit client

231

no, it should not replace, but add a new request.

swh/deposit/tests/api/test_deposit_update.py
138–139

it's not updated, it's added in both

146

you should check deposit.swh_id is a dir swhid, just to be sure there's no inconsistency

150–154

Remove this (see comment above)

161–162

move this at the top of the test (and also check there's a deposit request with the metadata)

177

2 (see comment above)

swh/deposit/tests/data/atom/entry-data-empty-body.xml
2 ↗(On Diff #14265)

why this change?

This revision now requires changes to proceed.Sep 25 2020, 4:58 PM
swh/deposit/api/deposit_update.py
215–222

i did not know the storage and storage metadata had the same requirements...
I'm using storage today to say storage-metadata, maybe that should change btw.

231

well, then it should be in another endpoint then, we are in the one that replaces stuff.

swh/deposit/tests/api/test_deposit_atom.py
60 ↗(On Diff #14265)

That test was green for the wrong reason...

When debugging that test because i did not understand something, i saw the 400 failure was about the missing slug, not because of the empty body request.

I needed to have another equivalent test for the new endpoint.
So I fixed it here nonetheless.

@vlorentz tl; dr; that test failed without the .xml change you asked about.

swh/deposit/tests/data/atom/entry-data-empty-body.xml
2 ↗(On Diff #14265)

because the tests was not correctly testing the empty body case.
I plan to have a follow up on that in another diff (but i forgot to mention it).

swh/deposit/api/deposit_update.py
216

yes, thanks for confirming.

So that part is wrong, thanks.
There is no self.provider to use then...

231

we are in the put endpoint not post...
which I realize now is not the right place, thanks for spotting it.

swh/deposit/tests/api/test_deposit_update.py
138–139

yes, i used the wrong endpoint, should be in the other iri that allows adding new metadata to be added.
Will change that on monday.

Thanks!

swh/deposit/tests/data/atom/entry-data-empty-body.xml
2 ↗(On Diff #14265)

I'm removing it from that diff as it's arguably not its place.

That's D4057 now which this diff will depend on (because it uses the same empty file to check something similar from another endpoint though).

Build has FAILED

Patch application report for D4013 (id=14313)

Could not rebase; Attempt merge onto b820c61d67...

Merge made by the 'recursive' strategy.
 swh/deposit/__init__.py                            |  11 +
 swh/deposit/api/__init__.py                        |   7 -
 swh/deposit/api/common.py                          |  38 ++-
 swh/deposit/api/deposit_update.py                  | 220 +++++++++++--
 swh/deposit/api/private/deposit_read.py            |  10 -
 swh/deposit/config.py                              |   7 +
 .../api/test_deposit_private_read_metadata.py      |   2 +-
 swh/deposit/tests/api/test_deposit_update.py       | 362 ++++++++++++++++++++-
 swh/deposit/tests/conftest.py                      |  24 +-
 swh/deposit/tests/test_init.py                     |   2 +-
 10 files changed, 609 insertions(+), 74 deletions(-)
Changes applied before test
commit f7613cb430b6e73e7f8e9cdd0b0787719ce65e34
Merge: b820c61d e66a4a77
Author: Jenkins user <jenkins@localhost>
Date:   Sat Sep 26 16:57:58 2020 +0000

    Merge branch 'diff-target' into HEAD

commit e66a4a77e77b4a1a4cb2d9f6820a6f131eb1c7d9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

commit 7271e750bca43692fb88b4fbd9126e88462460ac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Sep 26 18:37:50 2020 +0200

    Fix deposit atom test to actually check the response message
    
    Prior to this commit, the actual test was wrongfully ok because the slug was
    missing. This resulted in a 400 response which passed because we missed
    checking the actual response content. This now fixes it.

commit bb0167a8945c12a22e8d63e42b291dc414b59525
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 15:15:37 2020 +0200

    Fix transition deposit to deposited state with empty post on SE IRI
    
    Prior to this the content-length could be None instead of 0, failing the
    transition.
    
    As the coverage was missing, that peculiar case was missed. This fixes it.

commit c6debbd53a1c055be30abbf738fb81aed382da49
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:28:10 2020 +0200

    test: Update missing scenario on update metadata with multipart (post)

commit 8c40c1e093913734549590a5893d4d610e331de3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:14:20 2020 +0200

    test: Update missing scenario on update metadata with multipart (put)

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/134/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/134/console

Build is green

Patch application report for D4013 (id=14314)

Could not rebase; Attempt merge onto b820c61d67...

Merge made by the 'recursive' strategy.
 swh/deposit/__init__.py                            |  11 +
 swh/deposit/api/__init__.py                        |   7 -
 swh/deposit/api/common.py                          |  38 ++-
 swh/deposit/api/deposit_update.py                  | 220 +++++++++++--
 swh/deposit/api/private/deposit_read.py            |  10 -
 swh/deposit/config.py                              |   7 +
 .../api/test_deposit_private_read_metadata.py      |   2 +-
 swh/deposit/tests/api/test_deposit_update.py       | 362 ++++++++++++++++++++-
 swh/deposit/tests/conftest.py                      |  24 +-
 .../tests/data/atom/entry-data-empty-body.xml      |   2 +-
 swh/deposit/tests/test_init.py                     |   2 +-
 11 files changed, 610 insertions(+), 75 deletions(-)
Changes applied before test
commit e7e5a24e64de82bca972db3c7c88e9714d97ca72
Merge: b820c61d 96ada866
Author: Jenkins user <jenkins@localhost>
Date:   Sat Sep 26 17:05:29 2020 +0000

    Merge branch 'diff-target' into HEAD

commit 96ada86633e473064904bab7ed1ea7168a965fe2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

commit 7271e750bca43692fb88b4fbd9126e88462460ac
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Sep 26 18:37:50 2020 +0200

    Fix deposit atom test to actually check the response message
    
    Prior to this commit, the actual test was wrongfully ok because the slug was
    missing. This resulted in a 400 response which passed because we missed
    checking the actual response content. This now fixes it.

commit bb0167a8945c12a22e8d63e42b291dc414b59525
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 15:15:37 2020 +0200

    Fix transition deposit to deposited state with empty post on SE IRI
    
    Prior to this the content-length could be None instead of 0, failing the
    transition.
    
    As the coverage was missing, that peculiar case was missed. This fixes it.

commit c6debbd53a1c055be30abbf738fb81aed382da49
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:28:10 2020 +0200

    test: Update missing scenario on update metadata with multipart (post)

commit 8c40c1e093913734549590a5893d4d610e331de3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:14:20 2020 +0200

    test: Update missing scenario on update metadata with multipart (put)

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/135/ for more details.

Build is green

Patch application report for D4013 (id=14320)

Could not rebase; Attempt merge onto b820c61d67...

Updating b820c61d..221e4155
Fast-forward
 swh/deposit/__init__.py                            |  11 +
 swh/deposit/api/__init__.py                        |   7 -
 swh/deposit/api/common.py                          |  38 ++-
 swh/deposit/api/deposit_update.py                  | 220 +++++++++++--
 swh/deposit/api/private/deposit_read.py            |  10 -
 swh/deposit/config.py                              |   7 +
 .../api/test_deposit_private_read_metadata.py      |   2 +-
 swh/deposit/tests/api/test_deposit_update.py       | 362 ++++++++++++++++++++-
 swh/deposit/tests/conftest.py                      |  24 +-
 .../tests/data/atom/entry-data-empty-body.xml      |   2 +-
 swh/deposit/tests/test_init.py                     |   2 +-
 11 files changed, 610 insertions(+), 75 deletions(-)
Changes applied before test
commit 221e415573d1356049103eff26152076d9f23d1a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

commit 45c7c35e471ce28529bc495a076d10e2c67b9cf7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Sep 26 18:37:50 2020 +0200

    Fix deposit atom test to actually check the response message
    
    Prior to this commit, the actual test was wrongfully ok because the slug was
    missing. This resulted in a 400 response which passed because we missed
    checking the actual response content. This now fixes it.

commit b04c457d65689ae5e278edac7faee34bb3a2908f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 15:15:37 2020 +0200

    Fix transition deposit to deposited state with empty post on SE IRI
    
    Prior to this the content-length could be None instead of 0, failing the
    transition.
    
    As the coverage was missing, that peculiar case was missed. This fixes it.

commit f26ea4492695238164e8ac1bcec984feed68ed9d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:28:10 2020 +0200

    test: Update missing scenario on update metadata with multipart (post)

commit 251a0031eada21011d1683e6d755c1813830b97b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 14:14:20 2020 +0200

    test: Update missing scenario on update metadata with multipart (put)

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/141/ for more details.

Adapt according to remaining unadapted suggestions (and more):

  • Actually add new metadata information during update instead of replacing those (technically also move the code from the PUT to the POST method part)
  • Update docstrings accordingly
  • Rename storage attribute to storage_metadata
  • Drop self.provider setup to actually reference the deposit client as the provider (and adapt the other part already using it)
  • Send raw metadata into the metadata storage (instead of the parsed json one)

Remaining part is to actually check the metadata format stored (I'll do that in another diff so we can focus the discussion on that if need be). That diff is now way too long due to the discussions...

Build is green

Patch application report for D4013 (id=14324)

Could not rebase; Attempt merge onto f26ea44926...

Updating f26ea449..f23ad6cb
Fast-forward
 swh/deposit/__init__.py                            |  11 +
 swh/deposit/api/__init__.py                        |   7 -
 swh/deposit/api/common.py                          |  38 ++--
 swh/deposit/api/deposit_update.py                  | 221 +++++++++++++++++--
 swh/deposit/api/private/deposit_read.py            |  23 +-
 swh/deposit/config.py                              |   6 +
 .../api/test_deposit_private_read_metadata.py      |   2 +-
 swh/deposit/tests/api/test_deposit_update.py       | 236 ++++++++++++++++++++-
 swh/deposit/tests/conftest.py                      |  30 +--
 .../tests/data/atom/entry-data-empty-body.xml      |   2 +-
 swh/deposit/tests/test_init.py                     |   2 +-
 11 files changed, 500 insertions(+), 78 deletions(-)
Changes applied before test
commit f23ad6cb2a550d94b0a81b1d718f24256acb94e7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

commit 45c7c35e471ce28529bc495a076d10e2c67b9cf7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Sep 26 18:37:50 2020 +0200

    Fix deposit atom test to actually check the response message
    
    Prior to this commit, the actual test was wrongfully ok because the slug was
    missing. This resulted in a 400 response which passed because we missed
    checking the actual response content. This now fixes it.

commit b04c457d65689ae5e278edac7faee34bb3a2908f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 25 15:15:37 2020 +0200

    Fix transition deposit to deposited state with empty post on SE IRI
    
    Prior to this the content-length could be None instead of 0, failing the
    transition.
    
    As the coverage was missing, that peculiar case was missed. This fixes it.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/143/ for more details.

@moranegg @vlorentz heads up (for tomorrow), I should have adressed all concerns now ;)

Like i said in the update diff message, there remains the check on the actual metadata stored (FIXME for now).
I'd like to dedicate to this in another iteration (diff) though.
So the eventual discussion on that can happen there.

This current diff is way too long now (due to the good discussions we had).

Cheers,

Like i said in the update diff message, there remains the check on the actual metadata stored (FIXME for now).
I'd like to dedicate to this in another iteration (diff) though.
So the eventual discussion on that can happen there.

D4058

as discussed on side-channels, it should be a PUT

This revision now requires changes to proceed.Sep 28 2020, 12:03 PM

Build is green

Patch application report for D4013 (id=14329)

Rebasing onto 45c7c35e47...

Current branch diff-target is up to date.
Changes applied before test
commit d1f3f69c9016ad6bb7d49515fc187dd1c85039eb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 19:15:19 2020 +0200

    Allow deposit metadata update on deposit already completed
    
    This will allow to update the metadata of a deposit in "done" state. The
    metadata update occurs both in the deposit backend and in the metadata
    storage (provided some basic checks pass).

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/146/ for more details.

Let's not forget to remove the directory_missing check later

This revision is now accepted and ready to land.Sep 28 2020, 1:14 PM
swh/deposit/tests/data/atom/entry-data-empty-body.xml
2 ↗(On Diff #14265)

because the tests was not correctly testing the empty body case.
I plan to have a follow up on that in another diff (but i forgot to mention it).

D4067 ;)