Details

Reviewers

ardumont
olasd
vlorentz

Group Reviewers

Reviewers

Maniphest Tasks

T3468: staging: current opam loading issues

Commits

rDLDBASE1bd1827fa274: package/utils: Try to extract download filename from response headers
rDLDBASE049c41f6144d: package/utils: Use download response URL to extract filename

Summary

That diff improves the filename extraction for a download URL.

Two specific cases are considered, each in a dedicated commit:

requests follows URL redirection by default for GET requests so filename should be extracted from targetted URL when a redirection has been performed

This should fix that kind of sentry reported issue.

some URLs for downloading a file do not contain any filename but rather provide it in the "content-disposition" response header so ensure to extract the filename from that response header when available to avoid possible file processing issues afterwards.

This should fix the extraction of some tarballs downloaded by the opam loader for instance.

anlambert@carnavalet:/tmp$ curl -i https://codeload.github.com/abella-prover/abella/tar.gz/v2.0.2
HTTP/2 200 
access-control-allow-origin: https://render.githubusercontent.com
content-disposition: attachment; filename=abella-2.0.2.tar.gz
content-security-policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
content-type: application/x-gzip
etag: "66393ca915087abb7e474f0d976918630ebb8d23250de2bd70bab0752c01708a"
strict-transport-security: max-age=31536000
vary: Authorization,Accept-Encoding,Origin
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
date: Tue, 14 Sep 2021 11:51:16 GMT
x-github-request-id: CE86:E15A:45CCD8:573AE8:61408CB3

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

Related to T3468

Diff Detail

Repository

rDLDBASE Generic VCS/Package Loader

Branch

download-filename-extraction-fixes

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 23662
Build 36936: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 36935: arc lint + arc unit

Event Timeline

anlambert created this revision.Sep 14 2021, 1:52 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 14 2021, 1:53 PM

Build is green

Patch application report for D6252 (id=22641)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.

Changes applied before test

commit a60ba533470fd4d46a753f8bc765ad706b071384
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/546/ for more details.

Harbormaster completed remote builds in B23577: Diff 22641.Sep 14 2021, 1:57 PM

anlambert requested review of this revision.Sep 14 2021, 1:57 PM

ardumont edited the summary of this revision. (Show Details)Sep 14 2021, 1:59 PM

ardumont added a task: T3468: staging: current opam loading issues.

anlambert added a child revision: D6254: package/tests/test_utils: Remove code duplication.Sep 14 2021, 2:00 PM

Awesome. Thanks a lot.

I guess the 2nd part should take care of the P1159 opam origins ;)

For information, there is a kibana board [1] dedicated to the last run for opam (staging) where you can found those.

[1] http://kibana0.internal.softwareheritage.org:5601/goto/079dbfb481d31f3a86b8f41c3133e884

This revision is now accepted and ready to land.Sep 14 2021, 2:03 PM

Looks like the format you're expecting for the content-disposition header isn't quite standards-compliant.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition says the content-disposition filename entry is supposed to be a quoted string.

This revision now requires changes to proceed.Sep 14 2021, 2:07 PM

In D6252#161761, @olasd wrote:

Looks like the format you're expecting for the content-disposition header isn't quite standards-compliant.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition says the content-disposition filename entry is supposed to be a quoted string.

Apparently, unquoted (see github example in diff description) and quoted strings can be used so we must handle both cases. Thanks for spotting !

Update: Also handle quoted filename in content-disposition header parsing.

Please note that I have simplified tests afterwards in a separate diff (D6254).

Build is green

Patch application report for D6252 (id=22646)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.

Changes applied before test

commit a11695f9cebc49cdbe1972de8ed89b98b5fd1311
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/548/ for more details.

Harbormaster completed remote builds in B23582: Diff 22646.Sep 14 2021, 2:31 PM

Thanks!

I've suggested some refinements to the regexps, and I'd like the tests to have some more "adversarial" examples (e.g. filenames with spaces, headers with out of order fields, etc.), but this is still a good improvement over the status quo!

I think it should be possible to coalesce the two regexes into a single one too, but that's probably not critical.

swh/loader/package/utils.py
116	The first one will match until the next double quote. The second one will stop at the first space in the header (which may happen depending how broken it is).

This revision is now accepted and ready to land.Sep 14 2021, 3:06 PM

The filename should be sanitized. What about using just the extension, and only if it matches [a-zA-Z0-9_-]+?

This revision now requires changes to proceed.Sep 14 2021, 3:10 PM

In D6252#161774, @anlambert wrote:

Update: Also handle quoted filename in content-disposition header parsing.

And if we want to really follow the spec, we need to support unescaping characters: https://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html (ctrl-f "quoted-string =")

Update:

Use single regexp and strip quotes and spaces
Handle UTF-8 encoding defined in rfc5987
Add more test cases

This should be sufficient to handle most of the content-disposition header value.

Build is green

Patch application report for D6252 (id=22655)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.

Changes applied before test

commit 716951dc7284968a49ed028bd33697da6b818675
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/550/ for more details.

Harbormaster completed remote builds in B23591: Diff 22655.Sep 14 2021, 4:17 PM

ardumont accepted this revision.Sep 14 2021, 6:53 PM

Update: Add extracted filename extra sanitization pass.

Build is green

Patch application report for D6252 (id=22684)

Rebasing onto d5e54a5eea...

Current branch diff-target is up to date.

Changes applied before test

commit cc73c630bd9be291705e34938cd146696106af29
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 0e9cd127168e3fdd7b3ee78ccb6ea55cceb3f94f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/552/ for more details.

Harbormaster completed remote builds in B23617: Diff 22684.Sep 15 2021, 1:52 PM

Rebase

Build is green

Patch application report for D6252 (id=22730)

Rebasing onto 7329998421...

Current branch diff-target is up to date.

Changes applied before test

commit 1bd1827fa27466365bee22d9e10d759667242990
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 13:22:37 2021 +0200

    package/utils: Try to extract download filename from response headers
    
    Some URLs for downloading a file do not contain any filename but
    rather provide it in the "content-disposition" response header.
    
    So ensure to extract the filename from that response header when
    available to avoid possible file processing issues afterwards.

commit 049c41f6144d9cf80462c82e4f03f2a25fc45ebb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 14 12:08:53 2021 +0200

    package/utils: Use download response URL to extract filename
    
    requests follows URL redirection by default for GET requests so
    update input URL to response one to ensure correct filename will
    be extracted from it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/557/ for more details.

Harbormaster completed remote builds in B23662: Diff 22730.Sep 16 2021, 10:45 AM

This revision was not accepted when it landed; it landed in state Needs Review.Sep 16 2021, 10:48 AM

Closed by commit rDLDBASE049c41f6144d: package/utils: Use download response URL to extract filename (authored by anlambert). · Explain Why

This revision was automatically updated to reflect the committed changes.

anlambert added a commit: rDLDBASE049c41f6144d: package/utils: Use download response URL to extract filename.

anlambert added a commit: rDLDBASE1bd1827fa274: package/utils: Try to extract download filename from response headers.

package/utils: Improve downloaded filename extraction
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D6252 (id=22641)

Changes applied before test

Patch application report for D6252 (id=22646)

Changes applied before test

Patch application report for D6252 (id=22655)

Changes applied before test

Patch application report for D6252 (id=22684)

Changes applied before test

Patch application report for D6252 (id=22730)

Changes applied before test

Revision Contents
Changeset List

Diff 22730

swh/loader/package/tests/test_utils.py

swh/loader/package/utils.py

package/utils: Improve downloaded filename extractionClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D6252 (id=22641)

Changes applied before test

Patch application report for D6252 (id=22646)

Changes applied before test

Patch application report for D6252 (id=22655)

Changes applied before test

Patch application report for D6252 (id=22684)

Changes applied before test

Patch application report for D6252 (id=22730)

Changes applied before test

Revision ContentsChangeset List

Diff 22730

swh/loader/package/tests/test_utils.py

swh/loader/package/utils.py

package/utils: Improve downloaded filename extraction
ClosedPublic
Actions

Revision Contents
Changeset List