Page MenuHomeSoftware Heritage

metadata_dictionary: Systematically check input URLs before adding to graph
ClosedPublic

Authored by vlorentz on Oct 25 2022, 4:03 PM.

Details

Summary

This is hopefully the definitive workaround for the PyLD issue.

Closes T4656

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8772 (id=31621)

Rebasing onto a51cbf3965...

Current branch diff-target is up to date.
Changes applied before test
commit 4148da3ef7f6b7a5b98151aa6179502842f215ba
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 25 16:02:16 2022 +0200

    metadata_dictionary: Systematically check input URLs before adding to graph
    
    This is hopefully the definitive workaround for the PyLD issue.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/517/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/517/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 25 2022, 4:07 PM
Harbormaster failed remote builds in B32574: Diff 31621!

Build is green

Patch application report for D8772 (id=31622)

Rebasing onto a51cbf3965...

Current branch diff-target is up to date.
Changes applied before test
commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 25 16:02:16 2022 +0200

    metadata_dictionary: Systematically check input URLs before adding to graph
    
    This is hopefully the definitive workaround for the PyLD issue.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/518/ for more details.

anlambert added a subscriber: anlambert.

LGTM, added some nitpicks about typing as inline comments.

swh/indexer/metadata_dictionary/utils.py
79

Typing could be more precise here.

url: Optional[str]
107–110

Could be merged into a single if block.

if url is None or " " in url or not urllib.parse.urlparse(url).netloc:
        return
This revision is now accepted and ready to land.Oct 26 2022, 3:14 PM
swh/indexer/metadata_dictionary/utils.py
79

not really, url comes from arbitrary JSON, YAML, or XML files.