Page MenuHomeSoftware Heritage

Add validation step after metadata ingestion of a metadata-only deposit
AbandonedPublic

Authored by ardumont on Oct 14 2020, 11:45 AM.

Details

Summary

Related to T2681

Depends on D4245

Test Plan

tox

Diff Detail

Repository
rDDEP Push deposit
Branch
add-validation-step
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16312
Build 25112: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 25111: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4255 (id=15029)

Could not rebase; Attempt merge onto a8e6b830bb...

Updating a8e6b830..acf202c5
Fast-forward
 swh/deposit/api/deposit_update.py    |  24 +++++
 swh/deposit/cli/admin.py             |  65 ++++++------
 swh/deposit/tests/cli/conftest.py    |  12 +++
 swh/deposit/tests/cli/test_admin.py  | 189 +++++++++++++++++++++++++++++++++++
 swh/deposit/tests/cli/test_client.py |   8 +-
 5 files changed, 263 insertions(+), 35 deletions(-)
 create mode 100644 swh/deposit/tests/cli/conftest.py
 create mode 100644 swh/deposit/tests/cli/test_admin.py
Changes applied before test
commit acf202c5bb810f5700f1a317b72b8d2734f47c49
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 11:44:08 2020 +0200

    Add validation step after metadata ingestion of a metadata-only deposit
    
    Related to T2681

commit f17263e2ad2ac5f52fd4f2eb0deac58e9b9dc374
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 13 14:59:50 2020 +0200

    deposit.cli.admin: Add types and coverage on module
    
    This will allow to be more confident when we will refactor the configuration
    part.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/234/ for more details.

ardumont added inline comments.
swh/deposit/api/deposit_update.py
302

No real idea on how to check that part... nor how to convey that information to the client.
That'd be an error on our side so around the 5xx code...

@moranegg @vlorentz any idea?

Build is green

Patch application report for D4255 (id=15040)

Could not rebase; Attempt merge onto 2f5300e60a...

Updating 2f5300e6..06f43725
Fast-forward
 swh/deposit/api/deposit_update.py    |  24 +++++
 swh/deposit/cli/admin.py             |  65 ++++++------
 swh/deposit/tests/cli/conftest.py    |  12 +++
 swh/deposit/tests/cli/test_admin.py  | 189 +++++++++++++++++++++++++++++++++++
 swh/deposit/tests/cli/test_client.py |   8 +-
 5 files changed, 263 insertions(+), 35 deletions(-)
 create mode 100644 swh/deposit/tests/cli/conftest.py
 create mode 100644 swh/deposit/tests/cli/test_admin.py
Changes applied before test
commit 06f437255a222dbd42d788f6f143366ce1b41979
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 11:44:08 2020 +0200

    Add validation step after metadata ingestion of a metadata-only deposit
    
    Related to T2681

commit 51446a11df5cbd6608b282135ea5f2ff30791397
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 13 14:59:50 2020 +0200

    deposit.cli.admin: Add types and coverage on module
    
    This will allow to be more confident when we will refactor the configuration
    part.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/238/ for more details.

Build is green

Patch application report for D4255 (id=15051)

Could not rebase; Attempt merge onto 51446a11df...

Updating 51446a11..92634764
Fast-forward
 swh/deposit/api/deposit_update.py   |  24 +++++++
 swh/deposit/cli/admin.py            |  15 +++--
 swh/deposit/tests/cli/test_admin.py | 128 ++++++++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+), 7 deletions(-)
Changes applied before test
commit 92634764f00a0c561523f4fed8073cd9132a5c81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 11:44:08 2020 +0200

    Add validation step after metadata ingestion of a metadata-only deposit
    
    Related to T2681

commit 4bb8a36d7a65ce65cf7302fd7a6b7b137fe7531e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 12:40:03 2020 +0200

    deposit.cli.admin: Add remaining coverage on `swh deposit reschedule`

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/247/ for more details.

I don't see why we need this validation. The same function already called raw_extrinsic_metadata_add. If _add is unreliable (which it is not), then we have a lot of other "validations" to do everywhere else.

Additionally, the check may fail for the wrong reasons as there is no guarantee that an added object is immediately gettable, for example:

  • eventually consistent DBs, such as Cassandra
  • buffer proxy in the middle
  • a "multiplexing" proxy I'm considering to solve T2602
swh/deposit/api/deposit_update.py
291–292

why this assumption?

It has the potential to cause bugs, so just use swh.core.api.classes.stream_results. It fetches new pages lazily, so if extra pages aren't needed, it won't fetch them.

I don't see why we need this validation. The same function already called raw_extrinsic_metadata_add. If _add is unreliable (which it is not), then we have a lot of other "validations" to do everywhere else.

Additionally, the check may fail for the wrong reasons as there is no guarantee that an added object is immediately gettable, for example:

  • eventually consistent DBs, such as Cassandra
  • buffer proxy in the middle
  • a "multiplexing" proxy I'm considering to solve T2602

Yes, i was dubious to the utility of it...

But yesterday, I was in no mood to argue (I could not find relevant counter
arguments to it either).

So thanks for those \m/

@moranegg ^

swh/deposit/api/deposit_update.py
291–292

I think that since we start at "around" now (after=date_now), the
probabibility that happens tends towards 0.

But it's mostly because I thought of calling stream_results but could not
make sense on how to properly call it ¯\_(ツ)_/¯

I would not mind a sample if you have that in store ;)

swh/deposit/api/deposit_update.py
291–292

it's not 0 if for some reason HAL sent us a burst of updates at the same time (eg. a user hammering an update button in their UI)

results: Iterable[RawExtrinsicMetadata] = stream_results(
    self.storage_metadata.raw_extrinsic_metadata_get,MetadataTargetType.DIRECTORY,
            deposit_swhid,
            metadata_authority,
            after=date_now,
        )

Improve implementation (use stream_results)

swh/deposit/api/deposit_update.py
291–292

i did not say 0, i said tends towards 0.

Even with a user bursting updates, i'm not convinced we'll hit the 10k limits (default limit to the get call).
but meh...

Thanks for stream_results snippet .oO(d'oh)
I'll adapt the diff so the implementation is better ;)

Build is green

Patch application report for D4255 (id=15086)

Rebasing onto 4bb8a36d7a...

Current branch diff-target is up to date.
Changes applied before test
commit 237e75c6e5b05c8bb38cec3fe94f28d3f40ec112
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 11:44:08 2020 +0200

    Add validation step after metadata ingestion of a metadata-only deposit
    
    Related to T2681

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/263/ for more details.

Rebase on top of stacked diffs (which improves the overall test module)

Build is green

Patch application report for D4255 (id=15100)

Could not rebase; Attempt merge onto 6f9b0d99ee...

Updating 6f9b0d9..f20b1bc
Fast-forward
 requirements-swh.txt                     |   2 +-
 swh/icinga_plugins/deposit.py            |  57 ++++++-
 swh/icinga_plugins/tests/test_deposit.py | 274 ++++++++++++++++++++++++-------
 3 files changed, 270 insertions(+), 63 deletions(-)
Changes applied before test
commit f20b1bc8ce07668d081c2a735064af6bf59074b6
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 13 18:23:22 2020 +0200

    Add deposit metadata update step in deposit icinga check
    
    This extends the current icinga checks on deposit to trigger a metadata update
    when the main deposit is done. It retrieves the swhid from the deposit status
    calls. Then triggers a metadata update on the deposit using the latest deposit
    client abilities (0.3).
    
    Related to T2685

commit 5d43396856cf15f7989d5ce4e2ea8d289e9c05aa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Oct 15 14:50:52 2020 +0200

    test_deposit: Clarify unexpected output in case of assertion failures

commit 2c326c4a0a8b2f77bc7f73332301f8352b81b2c9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Oct 15 14:44:42 2020 +0200

    test_deposit: Introduce status_template helper function
    
    To ease incoming scenario updates (notably the deposit update metadata).

commit be18e1f9dd4ac7220d573443492d66a57b4981b9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Oct 15 14:42:18 2020 +0200

    test_deposit: Use f-string instead of concatenate strings

See https://jenkins.softwareheritage.org/job/DICP/job/tests-on-diff/14/ for more details.

Build is green

Patch application report for D4255 (id=15102)

Rebasing onto 7e05dbe497...

Current branch diff-target is up to date.
Changes applied before test
commit d8f1d946656e3bc896a26406ca85f06de4c756db
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 14 11:44:08 2020 +0200

    Add validation step after metadata ingestion of a metadata-only deposit
    
    Related to T2681

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/264/ for more details.

Thanks @ardumont for this diff.

Thanks @vlorentz for the comment.

I asked for that task because with the content deposit we know that an ingested deposit returns a SWHID, the SWHID is the certificate that the ingestion has succeeded.
With metadata, for now, we don't have this kind of certificate and we assume that it worked if it didn't return an error (which is in my opinion not a sufficient guarantee).

For the meantime, maybe returning something with the add function that can be similar to a certificate, for example the primary key, can play this role.
If metadata will be hashed and an identifier could be returned as a certificate this can solve my issue, maybe this diff and task needs to wait for that implementation.

swh/deposit/api/deposit_update.py
295

@vlorentz I saw somewhere :
sword-v2-atom-codemeta-in-json
should this be:
sword-v2-atom-codemeta-in-xml

If yes, this means that the format inserted to the ERMDS on the deposit side should match the chosen format.

I asked for that task because with the content deposit we know that an ingested deposit returns a SWHID, the SWHID is the certificate that the ingestion has succeeded.

The SWHID is computer by the loader. If we had a storage that silently dropped all objects, the SWHID would still be computed.

What you are saying is that the SWHID is no guarantee, so we shouldn't create guarantee for a metadata insert.

So what I would prefer is a better guarantee.

We can keep this diff or drop this diff.

In any case, seems we need to open a task about better insertion guarantees, before changing status to done.

stand-bye

this needs to be clarified.