Page MenuHomeSoftware Heritage

metadata_dictionary: Systematically check input URLs before adding to graph
ClosedPublic

Authored by vlorentz on Oct 25 2022, 4:03 PM.

Details

Summary

This is hopefully the definitive workaround for the PyLD issue.

Closes T4656

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32575
Build 51028: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51027: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8772 (id=31621)

Rebasing onto a51cbf3965...

Current branch diff-target is up to date.
Changes applied before test
commit 4148da3ef7f6b7a5b98151aa6179502842f215ba
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 25 16:02:16 2022 +0200

    metadata_dictionary: Systematically check input URLs before adding to graph
    
    This is hopefully the definitive workaround for the PyLD issue.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/517/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/517/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 25 2022, 4:07 PM
Harbormaster failed remote builds in B32574: Diff 31621!

Build is green

Patch application report for D8772 (id=31622)

Rebasing onto a51cbf3965...

Current branch diff-target is up to date.
Changes applied before test
commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 25 16:02:16 2022 +0200

    metadata_dictionary: Systematically check input URLs before adding to graph
    
    This is hopefully the definitive workaround for the PyLD issue.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/518/ for more details.

anlambert added a subscriber: anlambert.

LGTM, added some nitpicks about typing as inline comments.

swh/indexer/metadata_dictionary/utils.py
80

Typing could be more precise here.

url: Optional[str]
108–111

Could be merged into a single if block.

if url is None or " " in url or not urllib.parse.urlparse(url).netloc:
        return
This revision is now accepted and ready to land.Oct 26 2022, 3:14 PM
swh/indexer/metadata_dictionary/utils.py
80

not really, url comes from arbitrary JSON, YAML, or XML files.