Page MenuHomeSoftware Heritage

tarball: fallback using tar command when shutil.unpack_archive failed
ClosedPublic

Authored by anlambert on Oct 11 2022, 6:05 PM.

Details

Summary

When shutil.unpack_archive failed to unpack a tarball, fallback using
the tar command to perform that task.

Such issues were encountered when trying to unpack old tarballs coming
from CPAN.

[2022-10-11 12:02:07,753: DEBUG/ForkPoolWorker-71] package_info: CpanPackageInfo(url='https://cpan.metacpan.org/authors/id/S/SO/SOFTDIA/Tie-Layers-0.06.tar.gz', filename='Tie-Layers-0.06.tar.gz', directory_extrinsic_metadata=[RawExtrinsicMetadataCore(format='origin-artifacts-json', metadata=b'[{"url": "https://cpan.metacpan.org/authors/id/S/SO/SOFTDIA/Tie-Layers-0.06.tar.gz", "length": 71373, "filename": "Tie-Layers-0.06.tar.gz", "checksums": {"sha256": "4ce51555efbaf4760c4deed7877efa6117e665ddce216eb7d56244ad47ed421e"}}]', discovery_date=None), RawExtrinsicMetadataCore(format='cpan-module-json', metadata=b'{\n   "total" : 1,\n   "took" : 2,\n   "release" : {\n      "maturity" : "released",\n      "changes_file" : "",\n      "resources" : {},\n      "checksum_md5" : "9b55170f8cbd2b25e6c2d48b04c1a6f3",\n      "authorized" : true,\n      "tests" : {\n         "pass" : 85,\n         "unknown" : 1,\n         "fail" : 9,\n         "na" : 0\n      },\n      "status" : "latest",\n      "abstract" : "test script for Tie::Layers",\n      "download_url" : "https://cpan.metacpan.org/authors/id/S/SO/SOFTDIA/Tie-Layers-0.06.tar.gz",\n      "first" : false,\n      "distribution" : "Tie-Layers",\n      "dependency" : [],\n      "archive" : "Tie-Layers-0.06.tar.gz",\n      "date" : "2004-05-28T20:21:29",\n      "checksum_sha256" : "4ce51555efbaf4760c4deed7877efa6117e665ddce216eb7d56244ad47ed421e",\n      "license" : "unknown",\n      "stat" : {\n         "gid" : 1009,\n         "mtime" : 1085775689,\n         "mode" : 33204,\n         "size" : 71373,\n         "uid" : 1009\n      },\n      "author" : "SOFTDIA",\n      "provides" : [\n         "Docs::Site_SVD::Tie_Layers",\n         "Tie::Layers"\n      ],\n      "deprecated" : false,\n      "version" : "0.06",\n      "id" : "sIm3r625C34wLbOjiQqSJu5f7yw",\n      "version_numified" : 0.06,\n      "metadata" : {\n         "version" : "0.06",\n         "abstract" : "unknown",\n         "dynamic_config" : 1,\n         "release_status" : "stable",\n         "name" : "Tie-Layers",\n         "meta-spec" : {\n            "version" : "2",\n            "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec"\n         },\n         "prereqs" : {},\n         "license" : [\n            "unknown"\n         ],\n         "generated_by" : "CPAN::Meta::Converter version 2.150005",\n         "author" : [\n            "unknown"\n         ],\n         "no_index" : {\n            "directory" : [\n               "t",\n               "xt",\n               "inc",\n               "local",\n               "perl5",\n               "fatlib",\n               "example",\n               "blib",\n               "examples",\n               "eg"\n            ]\n         }\n      },\n      "name" : "Tie-Layers-0.06",\n      "main_module" : "Tie::Layers"\n   }\n}\n', discovery_date=None)], checksums={'sha256': '4ce51555efbaf4760c4deed7877efa6117e665ddce216eb7d56244ad47ed421e'}, name='Tie-Layers', version='0.06', last_modified=datetime.datetime(2004, 5, 28, 20, 21, 29, tzinfo=datetime.timezone.utc), author=Person(fullname=b'SOFTDIA', name=b'SOFTDIA', email=None))
docker-swh-loader-1  | [2022-10-11 12:02:07,888: DEBUG/ForkPoolWorker-71] filename: Tie-Layers-0.06.tar.gz
docker-swh-loader-1  | [2022-10-11 12:02:07,888: DEBUG/ForkPoolWorker-71] filepath: /tmp/tmprzukutm6/Tie-Layers-0.06.tar.gz
docker-swh-loader-1  | [2022-10-11 12:02:07,901: DEBUG/ForkPoolWorker-71] extrinsic_metadata
docker-swh-loader-1  | [2022-10-11 12:02:07,905: ERROR/ForkPoolWorker-71] Failed to load branch releases/0.06 for https://metacpan.org/dist/Tie-Layers
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.OQGDOsvCsy/swh-loader-core/swh/loader/package/loader.py", line 688, in load
docker-swh-loader-1  |     res = self._load_release(p_info, origin)
docker-swh-loader-1  |   File "/tmp/tmp.OQGDOsvCsy/swh-loader-core/swh/loader/package/loader.py", line 873, in _load_release
docker-swh-loader-1  |     (uncompressed_path, directory) = self._load_directory(dl_artifacts, tmpdir)
docker-swh-loader-1  |   File "/tmp/tmp.OQGDOsvCsy/swh-loader-core/swh/loader/package/loader.py", line 826, in _load_directory
docker-swh-loader-1  |     uncompressed_path = self.uncompress(dl_artifacts, dest=tmpdir)
docker-swh-loader-1  |   File "/tmp/tmp.OQGDOsvCsy/swh-loader-core/swh/loader/package/loader.py", line 452, in uncompress
docker-swh-loader-1  |     uncompress(a_path, dest=uncompressed_path)
docker-swh-loader-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/tarball.py", line 160, in uncompress
docker-swh-loader-1  |     shutil.unpack_archive(tarpath, extract_dir=dest, format=format)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/shutil.py", line 993, in unpack_archive
docker-swh-loader-1  |     func(filename, extract_dir, **dict(format_info[2]))
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/shutil.py", line 937, in _unpack_tarfile
docker-swh-loader-1  |     tarobj.extractall(extract_dir)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/tarfile.py", line 2002, in extractall
docker-swh-loader-1  |     numeric_owner=numeric_owner)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/tarfile.py", line 2044, in extract
docker-swh-loader-1  |     numeric_owner=numeric_owner)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/tarfile.py", line 2114, in _extract_member
docker-swh-loader-1  |     self.makefile(tarinfo, targetpath)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/tarfile.py", line 2155, in makefile
docker-swh-loader-1  |     with bltn_open(targetpath, "wb") as target:
docker-swh-loader-1  | NotADirectoryError: [Errno 20] Not a directory: '/tmp/tmprzukutm6/src/Tie-Layers-0.06/lib'

Related to T2833

Diff Detail

Repository
rDCORE Foundations and core functionalities
Branch
uncompress-tar-fallback
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32242
Build 50503: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50502: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8661 (id=31280)

Rebasing onto be9403c676...

Current branch diff-target is up to date.
Changes applied before test
commit fb85dc6dc1e0c6fefd0cd3e5a9ea486b5a2f9504
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Oct 11 18:00:37 2022 +0200

    tarball: fallback using tar command when shutil.unpack_archive failed
    
    When shutil.unpack_archive failed to unpack a tarball, fallback using
    the tar command to perform that task.
    
    Such issues were encountered when trying to unpack old tarballs coming
    from CPAN.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/479/ for more details.

Here is how to reproduce the test without the 25KB tarball:

>>> import io
>>> import tarfile
>>> tf = tarfile.open("repro.tar.gz", "w:gz")
>>> ti = tarfile.TarInfo("dir")
>>> ti.mode = 0o777
>>> tf.addfile(ti)
>>> ti = tarfile.TarInfo("dir/file")
>>> tf.addfile(ti, io.BytesIO(b"hello world"))
>>> tf.close()

>>> from swh.core import tarball
>>> tarball.uncompress("repro.tar.gz", "/tmp/foo2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dev/swh-environment/swh-core/swh/core/tarball.py", line 161, in uncompress
    shutil.unpack_archive(tarpath, extract_dir=dest, format=format)
  File "/usr/lib/python3.9/shutil.py", line 1236, in unpack_archive
    func(filename, extract_dir, **dict(format_info[2]))
  File "/usr/lib/python3.9/shutil.py", line 1178, in _unpack_tarfile
    tarobj.extractall(extract_dir)
  File "/usr/lib/python3.9/tarfile.py", line 2036, in extractall
    self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/usr/lib/python3.9/tarfile.py", line 2077, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/usr/lib/python3.9/tarfile.py", line 2150, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/lib/python3.9/tarfile.py", line 2191, in makefile
    with bltn_open(targetpath, "wb") as target:
NotADirectoryError: [Errno 20] Not a directory: '/tmp/foo2/dir/file'

(though we should probably commit the repro.tar.gz file this generates, in case tarfile becomes smarter in future Python versions and prevents generating tarballs like this)

Here is how to reproduce the test without the 25KB tarball:

>>> import io
>>> import tarfile
>>> tf = tarfile.open("repro.tar.gz", "w:gz")
>>> ti = tarfile.TarInfo("dir")
>>> ti.mode = 0o777
>>> tf.addfile(ti)
>>> ti = tarfile.TarInfo("dir/file")
>>> tf.addfile(ti, io.BytesIO(b"hello world"))
>>> tf.close()

>>> from swh.core import tarball
>>> tarball.uncompress("repro.tar.gz", "/tmp/foo2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dev/swh-environment/swh-core/swh/core/tarball.py", line 161, in uncompress
    shutil.unpack_archive(tarpath, extract_dir=dest, format=format)
  File "/usr/lib/python3.9/shutil.py", line 1236, in unpack_archive
    func(filename, extract_dir, **dict(format_info[2]))
  File "/usr/lib/python3.9/shutil.py", line 1178, in _unpack_tarfile
    tarobj.extractall(extract_dir)
  File "/usr/lib/python3.9/tarfile.py", line 2036, in extractall
    self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/usr/lib/python3.9/tarfile.py", line 2077, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/usr/lib/python3.9/tarfile.py", line 2150, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/lib/python3.9/tarfile.py", line 2191, in makefile
    with bltn_open(targetpath, "wb") as target:
NotADirectoryError: [Errno 20] Not a directory: '/tmp/foo2/dir/file'

(though we should probably commit the repro.tar.gz file this generates, in case tarfile becomes smarter in future Python versions and prevents generating tarballs like this)

Thanks , I managed to reproduce a tarball than can be extracted by tar but not by tarfile the following way:

archive_path = os.path.join(tmp_path, "repro.tar.gz")
tf = tarfile.open(archive_path, "w:gz")
ti = tarfile.TarInfo("dir")
ti.mode = 0o777
ti.type = tarfile.DIRTYPE
tf.addfile(ti)
ti = tarfile.TarInfo("dir/file")
tf.addfile(ti, io.BytesIO(b"hello world"))
tf.close()

Without the ti.type = tarfile.DIRTYPE instruction, tar also fails to extract it.

hah, good catch, I didn't think of that

This revision is now accepted and ready to land.Oct 12 2022, 11:24 AM

Build is green

Patch application report for D8661 (id=31288)

Rebasing onto be9403c676...

Current branch diff-target is up to date.
Changes applied before test
commit 7dd209df68a22947ad89720923cfdfae05b61dc0
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Oct 11 18:00:37 2022 +0200

    tarball: fallback using tar command when shutil.unpack_archive failed
    
    When shutil.unpack_archive failed to unpack a tarball, fallback using
    the tar command to perform that task.
    
    Such issues were encountered when trying to unpack old tarballs coming
    from CPAN.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/480/ for more details.

Build is green

Patch application report for D8661 (id=31291)

Rebasing onto 6012be5499...

Current branch diff-target is up to date.
Changes applied before test
commit 6a5ad7618587c806d6a9a149de3b42375362e9e6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Oct 11 18:00:37 2022 +0200

    tarball: fallback using tar command when shutil.unpack_archive failed
    
    When shutil.unpack_archive failed to unpack a tarball, fallback using
    the tar command to perform that task.
    
    Such issues were encountered when trying to unpack old tarballs coming
    from CPAN.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/482/ for more details.