Page MenuHomeSoftware Heritage

Check CFF Value Types
ClosedPublic

Authored by VickyMerzOwn on Jun 24 2022, 8:01 PM.

Details

Summary

Fixes T4275

Test Plan

Test with:

  • Empty values to certain fields (eg. name)
  • Values of invalid types to some fields (eg. if name set to integer, the name field should be skipped)
  • Values that do not conform to regular expression of the field should be skipped (eg. email, date)

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8036 (id=28937)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit b47215aa548e38983979bdd689eacb0f93df9099
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Fri Jun 24 23:20:41 2022 +0530

    cff: Add checks for value types

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/273/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/273/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 24 2022, 8:06 PM
Harbormaster failed remote builds in B30030: Diff 28937!

Fix introduced type errors

  • Check that content_dict is a dict prior to checking field-value-types
  • Verify presence of key in dictionary prior to checking value of the key

Build is green

Patch application report for D8036 (id=28939)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit a1dccdde953ed3169bf95717f33cc4a8cf6924be
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/274/ for more details.

you can move the inner for loop in normalize_authors and combine the conditionals with the existing ones. This will considerably simplify your code

swh/indexer/metadata_dictionary/cff.py
63

bool() is redundant, and you forgot to check it is a string before running the regexp.

too restrictive; this rejects valid addresses like foo+bar@example.com. Validating email addresses with regexps (or even recursive parsers) is generally a lost cause, and in this particular case we don't really need to reject invalid email addresses anyway (as long as they are strings)

This revision now requires changes to proceed.Jun 25 2022, 7:18 AM
swh/indexer/metadata_dictionary/cff.py
69–75

you are changing the behavior here, by allowing arbitrary fieldes in the output; this would add a new source of potentially invalid data. And even when valid, CFF field names are never valid in expanded JSON-LD, so they would most likely be discarded anyway.

Checks value types in normalize methods
Deletes the extra function added for type checking

Build is green

Patch application report for D8036 (id=28965)

Could not rebase; Attempt merge onto bbf9dc8047...

Updating bbf9dc8..9d2f1c9
Fast-forward
 swh/indexer/metadata_dictionary/cff.py | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Changes applied before test
commit 9d2f1c9888475ccbd8b95779c429a25e578ba36a
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jun 28 13:49:57 2022 +0530

    cff: Check value types in normalize methods

commit a1dccdde953ed3169bf95717f33cc4a8cf6924be
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/275/ for more details.

Checks value types in normalize methods
Deletes the extra function added for type checking

Build is green

Patch application report for D8036 (id=28967)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit 9d2f1c9888475ccbd8b95779c429a25e578ba36a
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jun 28 13:49:57 2022 +0530

    cff: Check value types in normalize methods

commit a1dccdde953ed3169bf95717f33cc4a8cf6924be
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/276/ for more details.

Adds test for empty fields in CFFMapping

Build is green

Patch application report for D8036 (id=28969)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit 9a281c41d93ad8e092a68bbaea7ca55c14fd6421
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jun 28 15:15:38 2022 +0530

    Test cffmapping when empty fields are present

commit 9d2f1c9888475ccbd8b95779c429a25e578ba36a
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jun 28 13:49:57 2022 +0530

    cff: Check value types in normalize methods

commit a1dccdde953ed3169bf95717f33cc4a8cf6924be
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/277/ for more details.

Build is green

Patch application report for D8036 (id=28973)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit 4d6798c766a5fe5d9250cfbf1ee71192b55f5420
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/280/ for more details.

Adds test for invalid value for a field.
If a list(or any other type) is provided where a string is expected, the field is simply ignored.

Build is green

Patch application report for D8036 (id=28979)

Rebasing onto bbf9dc8047...

Current branch diff-target is up to date.
Changes applied before test
commit e564baf9504df744afc03fad21164c8a095bec67
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/283/ for more details.

This revision is now accepted and ready to land.Jun 29 2022, 10:23 AM

Applied commit on master (using cherry-pick)

Build is green

Patch application report for D8036 (id=28982)

Rebasing onto 70c7e91fa8...

Current branch diff-target is up to date.
Changes applied before test
commit 1be4e184d4b9425b10715e17b783d103c852015e
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Sat Jun 25 00:16:23 2022 +0530

    cff: Add checks for value types

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/284/ for more details.

This revision was automatically updated to reflect the committed changes.