Page MenuHomeSoftware Heritage

common/archive: Handle single slash after protocol in lookup_origin
ClosedPublic

Authored by anlambert on Dec 11 2020, 5:44 PM.

Details

Summary

Some external resolver or software might mangle the "://" character sequence
into ":/" for the origin qualifier value of a SWHID.

This can lead to 404 errors when trying to resolve such a SWHID as the origin
will be found missing due to the typo.

So handle that special case to avoid such errors.

Related to T2797

Closes T2878

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4729 (id=16751)

Rebasing onto ed7a4a3705...

Current branch diff-target is up to date.
Changes applied before test
commit 2c98fe5f6f36530700544347d94dbfd895c9b823
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Dec 11 17:38:23 2020 +0100

    common/archive: Handle single slash after protocol in lookup_origin
    
    Some external resolver or software might mangle the "://" character sequence
    into ":/" for the origin qualifier value of a SWHID.
    
    This can lead to 404 errors when trying to resolve such a SWHID as the origin
    will be found missing due to the typo.
    
    So handle that special case to avoid such errors.
    
    Related to T2797
    
    Closes T2878

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/534/ for more details.

ardumont added inline comments.
swh/web/tests/common/test_archive.py
982

Maybe add some other checks on urls which contains swhid with context in there.

afaiu, this will replace all mispatterns, might as well make that explicit in that test.

looks to do what it sets to do (and also *ugh*, come on "external resolvers!" ¯\_(ツ)_/¯ ;)

Maybe, add some more data in your test as suggested inline.

This revision is now accepted and ready to land.Dec 11 2020, 6:19 PM

This won't work if the origin has a double slash in the path or parameters, though. But still better than nothing, I guess

This won't work if the origin has a double slash in the path or parameters, though. But still better than nothing, I guess

I guess we could use urlparse then:

10:47 $ python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse("http://github.com/SoftwareHeritage/swh-web")
ParseResult(scheme='http', netloc='github.com', path='/SoftwareHeritage/swh-web', params='', query='', fragment='')
>>> urlparse("http:/github.com/SoftwareHeritage/swh-web")
ParseResult(scheme='http', netloc='', path='/github.com/SoftwareHeritage/swh-web', params='', query='', fragment='')

If the scheme is not empty but the netloc is, it means the URL is malformed and we can do the ":/" replacement to "://".

If the scheme is not empty but the netloc is, it means the URL is malformed and we can do the ":/" replacement to "://".

sounds like a plan ;)

Update:

  • Robustify implementation
  • Correct malformed origin URL when resolving a SWHID
  • Update and add more tests

Build is green

Patch application report for D4729 (id=16762)

Rebasing onto ed7a4a3705...

Current branch diff-target is up to date.
Changes applied before test
commit 9a1169466fe109f1f20373aa7b809c540184f2cc
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Dec 11 17:38:23 2020 +0100

    common/archive: Handle single slash after protocol in lookup_origin
    
    Some external resolver or software might mangle the "://" character sequence
    into ":/" for the origin qualifier value of a SWHID.
    
    This can lead to 404 errors when trying to resolve such a SWHID as the origin
    will be found missing due to the typo.
    
    So handle that special case to avoid such errors.
    
    Related to T2797
    
    Closes T2878

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/535/ for more details.

Update: Improve malformed URL checks

Build is green

Patch application report for D4729 (id=16763)

Rebasing onto ed7a4a3705...

Current branch diff-target is up to date.
Changes applied before test
commit 126e3ea6c9bfaa9b93940d000182d4deab472b9a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Dec 11 17:38:23 2020 +0100

    common/archive: Handle single slash after protocol in lookup_origin
    
    Some external resolver or software might mangle the "://" character sequence
    into ":/" for the origin qualifier value of a SWHID.
    
    This can lead to 404 errors when trying to resolve such a SWHID as the origin
    will be found missing due to the typo.
    
    So handle that special case to avoid such errors.
    
    Related to T2797
    
    Closes T2878

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/536/ for more details.

swh/web/common/archive.py
266

any particular reason for this snippet of code to raise?

And if it raises, do we really want to continue silently, won't that pose problems further down the line?

swh/web/tests/common/test_archive.py
982

addressed by the new test_resolve_swhid_with_malformed_origin_url below.

swh/web/common/archive.py
266

[urlparse](vhttps://docs.python.org/3/library/urllib.parse.html) can raise exceptions and any origin URLs can be passed in a SWHID so ...

We can silent the exception as if the origin URL does not exist in the archive, an exception will be raised at the end of that function.