Page MenuHomeSoftware Heritage

package/utils: Handle downloads for urls with missing schema
AcceptedPublic

Authored by KShivendu on Aug 26 2021, 6:21 AM.

Details

Reviewers
olasd
Group Reviewers
Reviewers
Summary

Some of the URLs don't have a schema (ex: http) and it blocks the loader from downloading the corresponding repos.
This diff should fix the issue.

Related T3294

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
missing-url-schema
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23367
Build 36463: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 36462: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6138 (id=22211)

Rebasing onto 63a8a61cbc...

Current branch diff-target is up to date.
Changes applied before test
commit c0873e9a435a0cdc523cf71e048f4dafed2fb556
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Aug 26 09:46:48 2021 +0530

    package/utils: Handle downloads for urls with missing schema
    
    Summary:
    Some of the URLs don't have a schema (ex: http) and it blocks the loader from downloading the corresponding repos.
    This diff should fix the issue.
    
    Related T3294
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/508/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/508/console

Harbormaster returned this revision to the author for changes because remote builds failed.Aug 26 2021, 6:34 AM
Harbormaster failed remote builds in B23169: Diff 22211!

Build is green

Patch application report for D6138 (id=22211)

Rebasing onto 50b062adc7...

First, rewinding head to replay your work on top of it...
Applying: package/utils: Handle downloads for urls with missing schema
Changes applied before test
commit 3ef8c5213311f658d288cff4f9e39020429b9bdb
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Aug 26 09:46:48 2021 +0530

    package/utils: Handle downloads for urls with missing schema
    
    Summary:
    Some of the URLs don't have a schema (ex: http) and it blocks the loader from downloading the corresponding repos.
    This diff should fix the issue.
    
    Related T3294
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/523/ for more details.

I've triggered back the build and now it's in a happy place.

swh/loader/package/utils.py
84 ↗(On Diff #22211)

what about https?

olasd requested changes to this revision.Sep 1 2021, 2:05 PM
olasd added a subscriber: olasd.

I don't think this is something that we should do by default, because the semantics of a URI with no scheme are ambiguous in general.

If the semantics of the nix sources.json is that download URIs with no protocol have an implicit http scheme added, then such url mangling should be implemented in the nix loader directly.

This revision now requires changes to proceed.Sep 1 2021, 2:05 PM

I don't think this is something that we should do by default, because the semantics of
a URI with no scheme are ambiguous in general. If the semantics of the nix
sources.json is that download URIs with no protocol have an implicit http scheme
added, then such url mangling should be implemented in the nix loader directly.

Good point. The function is indeed used by all package loaders. As the issue is only
raised for the nixguix loader, let's try to fix it only for it first. We'll see if
that's enough.

So, that probably means you want to override the main loader implementation which calls
the download function [1] in swh/loader/package/nixguix/loader.py with something like:

def download_package(
    self, p_info: TPackageInfo, tmpdir: str
) -> List[Tuple[str, Mapping]]:
    if not urlparse(p_info.url).scheme:
        p_info = attr.evolve(p_info, url=f"http://{url}")  # or https... ;)
    return super().download_package(p_info, tmpdir)

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/loader.py$343-378

You could either override download_package to try https, then fall back to http if that fails, or change the NixGuixPackageInfo to add http:// to urls missing a scheme (https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/nixguix/loader.py$52). Changing NixGuixPackageInfo feels cleaner/more explicit, but overriding the download method (to get the https -> http fallback) may be more flexible.

You could either override download_package to try https, then fall back to http if that fails, or change the NixGuixPackageInfo to add http:// to urls missing a scheme (https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/nixguix/loader.py$52). Changing NixGuixPackageInfo feels cleaner/more explicit, but overriding the download method (to get the https -> http fallback) may be more flexible.

I agree as well. I liked your cleaner approach indeed.

Note that we have some future improvments regarding ftp [1] which is unsupported as well (not for that diff).
Just pointing out that issue which may help in deciding how to implement best the missing or unsupported schema
(That may also not help at all ¯\_(ツ)_/¯).

oh there is even a task for it ;)

[1] https://sentry.softwareheritage.org/share/issue/7dc92745d96442d482a493c64b6eae91/

[2] T2687

  • Handle missing schema only for niguix (instead of doing it for all types of downloads)

Build is green

Patch application report for D6138 (id=22413)

Rebasing onto 50b062adc7...

First, rewinding head to replay your work on top of it...
Applying: nixguix: Handle downloads for urls with missing schema
Changes applied before test
commit 87211e3a078590a1a6b8e7d38df6e39ff848d2ac
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Aug 26 09:46:48 2021 +0530

    nixguix: Handle downloads for urls with missing schema
    
    Some of the URLs don't have a schema (ex: http) and it blocks the loader from downloading the corresponding repos.
    This diff should fix the issue.
    
    Related T3294

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/533/ for more details.

@ardumont @olasd can we do this:

  • Add a new property fallback_schema = ['https', 'http'] in PackageLoader
  • Change download_package to try different schemas from fallback_schema list (if url/schema is valid)
  • Change this (fallback_schema) property in the inherited classes (from PackageLoader) as per requirements

Thanks for this change!

If we're to generalize this change for other loaders, then I think the default should be "no fallback" (i.e. the fallback behavior should be opt-in).

swh/loader/package/nixguix/loader.py
53

Could this get a comment/pointer to the nix/guix specification where this fallback is defined?

This revision is now accepted and ready to land.Sep 15 2021, 11:42 AM

Is it safe to land this now?

And should we implement multiple fallbacks feature?

swh/loader/package/nixguix/loader.py
53

Are you talking about Package loader specifications, or something else?

vlorentz added inline comments.
swh/loader/package/nixguix/loader.py
53

documentation of Nix and/or Guix themselves, not documentation of Software Heritage