Page MenuHomeSoftware Heritage

package/utils: Improve downloaded filename extraction
ClosedPublic

Authored by anlambert on Sep 14 2021, 1:52 PM.

Details

Summary

That diff improves the filename extraction for a download URL.

Two specific cases are considered, each in a dedicated commit:

  • requests follows URL redirection by default for GET requests so filename should be extracted from targetted URL when a redirection has been performed

This should fix that kind of sentry reported issue.

  • some URLs for downloading a file do not contain any filename but rather provide it in the "content-disposition" response header so ensure to extract the filename from that response header when available to avoid possible file processing issues afterwards.

This should fix the extraction of some tarballs downloaded by the opam loader for instance.

anlambert@carnavalet:/tmp$ curl -i https://codeload.github.com/abella-prover/abella/tar.gz/v2.0.2
HTTP/2 200 
access-control-allow-origin: https://render.githubusercontent.com
content-disposition: attachment; filename=abella-2.0.2.tar.gz
content-security-policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
content-type: application/x-gzip
etag: "66393ca915087abb7e474f0d976918630ebb8d23250de2bd70bab0752c01708a"
strict-transport-security: max-age=31536000
vary: Authorization,Accept-Encoding,Origin
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
date: Tue, 14 Sep 2021 11:51:16 GMT
x-github-request-id: CE86:E15A:45CCD8:573AE8:61408CB3

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

Related to T3468

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
download-filename-extraction-fixes
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23582
Build 36790: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 36789: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6252 (id=22641)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.
Changes applied before test
commit a60ba533470fd4d46a753f8bc765ad706b071384
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/546/ for more details.

ardumont added a subscriber: ardumont.

Awesome. Thanks a lot.

I guess the 2nd part should take care of the P1159 opam origins ;)

For information, there is a kibana board [1] dedicated to the last run for opam (staging) where you can found those.

[1] http://kibana0.internal.softwareheritage.org:5601/goto/079dbfb481d31f3a86b8f41c3133e884

This revision is now accepted and ready to land.Sep 14 2021, 2:03 PM
olasd requested changes to this revision.Sep 14 2021, 2:07 PM
olasd added a subscriber: olasd.

Looks like the format you're expecting for the content-disposition header isn't quite standards-compliant.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition says the content-disposition filename entry is supposed to be a quoted string.

This revision now requires changes to proceed.Sep 14 2021, 2:07 PM
In D6252#161761, @olasd wrote:

Looks like the format you're expecting for the content-disposition header isn't quite standards-compliant.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition says the content-disposition filename entry is supposed to be a quoted string.

Apparently, unquoted (see github example in diff description) and quoted strings can be used so we must handle both cases. Thanks for spotting !

Update: Also handle quoted filename in content-disposition header parsing.

Please note that I have simplified tests afterwards in a separate diff (D6254).

Build is green

Patch application report for D6252 (id=22646)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.
Changes applied before test
commit a11695f9cebc49cdbe1972de8ed89b98b5fd1311
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/548/ for more details.

Thanks!

I've suggested some refinements to the regexps, and I'd like the tests to have some more "adversarial" examples (e.g. filenames with spaces, headers with out of order fields, etc.), but this is still a good improvement over the status quo!

I think it should be possible to coalesce the two regexes into a single one too, but that's probably not critical.

swh/loader/package/utils.py
102

The first one will match until the next double quote. The second one will stop at the first space in the header (which may happen depending how broken it is).

This revision is now accepted and ready to land.Sep 14 2021, 3:06 PM
vlorentz added a subscriber: vlorentz.

The filename should be sanitized. What about using just the extension, and only if it matches [a-zA-Z0-9_-]+?

This revision now requires changes to proceed.Sep 14 2021, 3:10 PM

Update: Also handle quoted filename in content-disposition header parsing.

And if we want to really follow the spec, we need to support unescaping characters: https://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html (ctrl-f "quoted-string =")

Update:

  • Use single regexp and strip quotes and spaces
  • Handle UTF-8 encoding defined in rfc5987
  • Add more test cases

This should be sufficient to handle most of the content-disposition header value.

Build is green

Patch application report for D6252 (id=22655)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.
Changes applied before test
commit 716951dc7284968a49ed028bd33697da6b818675
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/550/ for more details.

Update: Add extracted filename extra sanitization pass.

Build is green

Patch application report for D6252 (id=22684)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.
Changes applied before test
commit cc73c630bd9be291705e34938cd146696106af29
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/552/ for more details.

Build is green

Patch application report for D6252 (id=22730)

Rebasing onto 7329998421...

Current branch diff-target is up to date.
Changes applied before test
commit 1bd1827fa27466365bee22d9e10d759667242990
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 049c41f6144d9cf80462c82e4f03f2a25fc45ebb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/557/ for more details.