Page MenuHomeSoftware Heritage

package.loader: Handle tarball download erroneously marked as gzipped
ClosedPublic

Authored by anlambert on Jun 10 2021, 3:22 PM.

Details

Summary

It exists cases where a tarball to dowload is marked as gzipped in the
Content-Encoding HTTP response header while in fact it is not.

So handle ContentDecodingError exception that can be raised by the
dowload method: try to download tarball raw bytes again without
attempting to uncompress the input stream.

Real word example encountered:

swh-loader_1                    | [2021-06-10 09:18:08,876: DEBUG/ForkPoolWorker-1] package_info: ArchivePackageInfo(url='http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', filename='cpm80.tar.gz', directory_extrinsic_metadata=[], raw_info={'url': 'http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', 'time': '2011-08-13T23:05:09', 'length': 1894400, 'version': 'cpm80'}, length=1894400, time='2011-08-13T23:05:09', version='cpm80')
swh-loader_1                    | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filename: cpm80.tar.gz
swh-loader_1                    | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filepath: /tmp/tmpqydd_7xw/cpm80.tar.gz
swh-loader_1                    | [2021-06-10 09:18:09,044: ERROR/ForkPoolWorker-1] Failed loading branch releases/cpm80 for https://www.kermitproject.org/archive.html
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 401, in _decode
swh-loader_1                    |     data = self._decoder.decompress(data)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 88, in decompress
swh-loader_1                    |     ret += self._obj.decompress(data)
swh-loader_1                    | zlib.error: Error -3 while decompressing data: incorrect header check
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 753, in generate
swh-loader_1                    |     for chunk in self.raw.stream(chunk_size, decode_content=True):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
swh-loader_1                    |     data = self.read(amt=amt, decode_content=decode_content)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 548, in read
swh-loader_1                    |     data = self._decode(data, decode_content, flush_decoder)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 407, in _decode
swh-loader_1                    |     e,
swh-loader_1                    | urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 576, in load
swh-loader_1                    |     res = self._load_revision(p_info, origin)
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 713, in _load_revision
swh-loader_1                    |     dl_artifacts = self.download_package(p_info, tmpdir)
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 364, in download_package
swh-loader_1                    |     return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/utils.py", line 93, in download
swh-loader_1                    |     for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 758, in generate
swh-loader_1                    |     raise ContentDecodingError(e)
swh-loader_1                    | requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
non-gzipped-tarballs-handling
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21912
Build 34077: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 34076: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5852 (id=20927)

Rebasing onto 644134a86d...

Current branch diff-target is up to date.
Changes applied before test
commit bdad1e75dddfab78fbc2d34a422a3ac1f22c1e6f
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Jun 10 15:16:47 2021 +0200

    package.loader: Handle tarball download erroneously marked as gzipped
    
    It exists cases where a tarball to dowload is marked as gzipped in the
    Content-Encoding HTTP response header while in fact it is not.
    
    So handle ContentDecodingError exception that can be raised by the
    dowload method: try to download tarball raw bytes again without
    attempting to uncompress the input stream.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/479/ for more details.

ardumont added a subscriber: ardumont.
ardumont added inline comments.
swh/loader/package/archive/tests/test_archive.py
432
This revision is now accepted and ready to land.Jun 10 2021, 3:29 PM

Fix typo

swh/loader/package/archive/tests/test_archive.py
432

good catch, thanks !

Build is green

Patch application report for D5852 (id=20931)

Rebasing onto 644134a86d...

Current branch diff-target is up to date.
Changes applied before test
commit 4448faf3d8bdd251c228924b967a5b38b6b31cc7
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Jun 10 15:16:47 2021 +0200

    package.loader: Handle tarball download erroneously marked as gzipped
    
    It exists cases where a tarball to dowload is marked as gzipped in the
    Content-Encoding HTTP response header while in fact it is not.
    
    So handle ContentDecodingError exception that can be raised by the
    dowload method: try to download tarball raw bytes again without
    attempting to uncompress the input stream.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/480/ for more details.

Build is green

Patch application report for D5852 (id=20933)

Rebasing onto 644134a86d...

Current branch diff-target is up to date.
Changes applied before test
commit ad79654a531674a2be954a9aa6823a7387f7d96e
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Thu Jun 10 15:16:47 2021 +0200

    package/loader: Handle tarball download erroneously marked as gzipped
    
    It exists cases where a tarball to dowload is marked as gzipped in the
    Content-Encoding HTTP response header while in fact it is not.
    
    So handle ContentDecodingError exception that can be raised by the
    dowload method: try to download tarball raw bytes again without
    attempting to uncompress the input stream.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/481/ for more details.