Page MenuHomeSoftware Heritage

Check document bodies contain mandatory fields, instead of just checking for emptiness.
AbandonedPublic

Authored by vlorentz on Nov 12 2020, 6:37 PM.

Details

Reviewers
ardumont
Group Reviewers
Reviewers
Summary
  1. in one of the cases it was redundant to check for emptiness before checking the fields anyway
  2. checking for emptiness does not make sense, as there are a lot of 'errors' it does not catch in non-obvious ways: a) xmlns attributes count as non-emptiness b) unknown tags/attributes count as non-emptiness

Diff Detail

Repository
rDDEP Push deposit
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17070
Build 26346: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 26345: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4469 (id=15861)

Could not rebase; Attempt merge onto 4d7d80c1b8...

Removing swh/deposit/tests/data/atom/tei-sample.xml
Merge made by the 'recursive' strategy.
 docs/endpoints/content.rst                         |  2 -
 docs/metadata.rst                                  |  9 ---
 docs/specs/metadata_example.xml                    |  1 -
 docs/specs/spec-loading.rst                        |  3 +-
 docs/specs/spec-meta-deposit.rst                   |  2 -
 docs/specs/spec-sparse-deposit.rst                 |  2 -
 docs/user-manual.rst                               |  2 -
 swh/deposit/api/common.py                          | 27 ++++++--
 swh/deposit/api/deposit_update.py                  | 10 +--
 swh/deposit/tests/api/test_deposit.py              |  6 +-
 swh/deposit/tests/api/test_deposit_atom.py         | 75 ++++++++--------------
 .../api/test_deposit_private_read_metadata.py      |  2 +-
 swh/deposit/tests/api/test_deposit_update.py       | 14 ++--
 swh/deposit/tests/conftest.py                      |  2 +-
 swh/deposit/tests/data/atom/codemeta-sample.xml    |  3 +-
 .../entry-data-fail-metadata-functional-checks.xml |  1 -
 swh/deposit/tests/data/atom/entry-data-minimal.xml |  3 +-
 swh/deposit/tests/data/atom/entry-data0.xml        |  2 -
 swh/deposit/tests/data/atom/entry-data1.xml        |  1 -
 swh/deposit/tests/data/atom/entry-data2.xml        |  2 +-
 swh/deposit/tests/data/atom/entry-data3.xml        |  1 +
 swh/deposit/tests/data/atom/error-with-decimal.xml |  2 -
 .../data/atom/error-with-external-identifier.xml   |  7 ++
 swh/deposit/tests/data/atom/metadata.xml           |  2 -
 swh/deposit/tests/data/atom/tei-sample.xml         |  1 -
 25 files changed, 73 insertions(+), 109 deletions(-)
 create mode 100644 swh/deposit/tests/data/atom/error-with-external-identifier.xml
 delete mode 100644 swh/deposit/tests/data/atom/tei-sample.xml
Changes applied before test
commit 6bbc72af330eabb4365ee3603048fd2c839457ba
Merge: 4d7d80c1 d3f03178
Author: Jenkins user <jenkins@localhost>
Date:   Thu Nov 12 17:37:39 2020 +0000

    Merge branch 'diff-target' into HEAD

commit d3f03178191fa65bbe7bf23826a5e43b47173ab4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 18:37:20 2020 +0100

    Check document bodies contain mandatory fields, instead of just checking for emptiness.
    
    1. in one of the cases it was redundant to check for emptiness before
       checking the fields anyway
    2. checking for emptiness does not make sense, as there are a lot of
       'errors' it does not catch in non-obvious ways:
       a) xmlns attributes count as non-emptiness
       b) unknown tags/attributes count as non-emptiness

commit 21b667dc79142e2bd7560ead2324543406531800
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 18:34:08 2020 +0100

    Remove test_post_deposit_atom_entry_tei.
    
    We barely support it, and no one uses it.
    
    Additionally, a future commit will outright refuse documents without
    an atom:author or atom:title.

commit a85b22ca7e5f3e2e47c3b06b4ea323bb830f1565
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 15:01:38 2020 +0100

    Remove the <client> tag from the protocol.
    
    It didn't match the Atom specification, and isn't used anywhere.

commit 4793f2ef85561d473a0b9932367b40d4a449ae7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 12 14:47:09 2020 +0100

    Remove the <external_identifier> tag from the protocol.
    
    It didn't match the Atom specification, duplicates the Slug header,
    while being optional and specified as 'SHOULD match the Slug'.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/327/ for more details.

ardumont added a subscriber: ardumont.

Maybe the check for emptiness "does not make sense", i don't really see that.

In any case, I disagree with those technical changes.

With this the current deposit behavior changes...

There is a reason for the metadata checks to happen later (as in an asynchronous check task
done by a worker when the status is in deposited state).

It's to be less restrictive on the metadata input.
Because users could decide to send lots of deposit requests.
And only on the last one could there be the required metadata.
As long as the deposit state stays partial, it's fine not to send immediately
those required metadata.

With this, you are forcing the users to provide immediately the minimum required metadata.
The change is thus not exhaustive enough.

And also, what happens if that deposit request metadata is not a final one?
That then forces the subsequent eventual deposit metadata request to also send the requisite metadata.
(or to change the current deposit code to actually check previous deposit metadata request to check if
we already have those mandatory one)...

This revision now requires changes to proceed.Nov 13 2020, 9:47 AM

ack, I didn't consider partial deposits. I'll remove the check I added