Page MenuHomeSoftware Heritage

package/archive: Handle tarball artifact with null time
ClosedPublic

Authored by anlambert on Tue, Jun 7, 3:14 PM.

Details

Summary

An artifact without time info can be provided in the artifacts list
parameter of the loader.

For instance last modification date is not available for tarballs coming from github tags
(the date header below corresponds to request time, not tarball last modification).

15:09 $ curl -Li https://github.com/chromium/chromium/archive/refs/tags/104.0.5106.1.tar.gz
HTTP/2 302 
server: GitHub.com
date: Tue, 07 Jun 2022 13:10:44 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X-Requested-With
permissions-policy: interest-cohort=()
location: https://codeload.github.com/chromium/chromium/tar.gz/refs/tags/104.0.5106.1
cache-control: max-age=0, private
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; child-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com objects-origin.githubusercontent.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events *.actions.githubusercontent.com wss://*.actions.githubusercontent.com online.visualstudio.com/api/v1/locations github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src render.githubusercontent.com viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com github-cloud.s3.amazonaws.com secured-user-images.githubusercontent.com/ github-production-user-asset-6210df.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/
content-length: 0
x-github-request-id: D358:4A4C:9CBB6E:BCAB87:629F4E54

HTTP/2 200 
access-control-allow-origin: https://render.githubusercontent.com
content-disposition: attachment; filename=chromium-104.0.5106.1.tar.gz
content-security-policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
content-type: application/x-gzip
etag: "2ebec60c73390de10b6e84d75838466d939f03a7b468f10873c9023f549a5242"
strict-transport-security: max-age=31536000
vary: Authorization,Accept-Encoding,Origin
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
date: Tue, 07 Jun 2022 13:10:45 GMT
x-github-request-id: 867A:7031:7EED7:179E4C:629F4E54

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

That case was not handled by the archive loader wich was resulting
in loading error so add fix for it.

swh-loader_1                        | [2022-06-07 10:00:56,998: INFO/MainProcess] Task swh.loader.package.archive.tasks.LoadArchive[d61d54e5-3163-439a-95a5-2ab57bd75a7d] received
swh-loader_1                        | [2022-06-07 10:00:57,001: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1                        | [2022-06-07 10:00:59,059: DEBUG/ForkPoolWorker-1] last snapshot: None
swh-loader_1                        | [2022-06-07 10:00:59,064: DEBUG/ForkPoolWorker-1] package_info: ArchivePackageInfo(url='https://github.com/chromium/chromium/archive/refs/tags/104.0.5106.1.tar.gz', filename='104.0.5106.1.tar.gz', version='104.0.5106.1', directory_extrinsic_metadata=[], raw_info={'url': 'https://github.com/chromium/chromium/archive/refs/tags/104.0.5106.1.tar.gz', 'time': None, 'length': None, 'version': '104.0.5106.1'}, length=None, time=None)
swh-loader_1                        | [2022-06-07 10:01:00,790: DEBUG/ForkPoolWorker-1] filename: 104.0.5106.1.tar.gz
swh-loader_1                        | [2022-06-07 10:01:00,791: DEBUG/ForkPoolWorker-1] filepath: /tmp/tmpgnd1w9fy/104.0.5106.1.tar.gz
swh-loader_1                        | [2022-06-07 10:08:40,664: DEBUG/ForkPoolWorker-1] extrinsic_metadata
swh-loader_1                        | [2022-06-07 10:10:02,826: DEBUG/ForkPoolWorker-1] uncompressed_path: /tmp/tmpgnd1w9fy/src
swh-loader_1                        | [2022-06-07 10:11:38,076: DEBUG/ForkPoolWorker-1] Number of skipped contents: 0
swh-loader_1                        | [2022-06-07 10:11:38,076: DEBUG/ForkPoolWorker-1] Number of contents: 367501
swh-loader_1                        | [2022-06-07 10:11:38,558: DEBUG/ForkPoolWorker-1] Flushing 367501 objects of type content (3423607967 bytes)
swh-loader_1                        | [2022-06-07 10:32:41,504: DEBUG/ForkPoolWorker-1] Number of directories: 34530
swh-loader_1                        | [2022-06-07 10:32:41,542: DEBUG/ForkPoolWorker-1] Flushing 34530 objects of type directory (432087 entries)
swh-loader_1                        | [2022-06-07 10:33:20,750: ERROR/ForkPoolWorker-1] Failed to load branch releases/104.0.5106.1 for https://github.com/chromium/chromium/tags
swh-loader_1                        | Traceback (most recent call last):
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 648, in load
swh-loader_1                        |     res = self._load_release(p_info, origin)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 826, in _load_release
swh-loader_1                        |     p_info, uncompressed_path, directory=directory.hash
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/archive/loader.py", line 148, in build_release
swh-loader_1                        |     normalized_time = TimestampWithTimezone.from_datetime(parsed_time)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/model/model.py", line 488, in from_datetime
swh-loader_1                        |     return cls.from_dict(dt)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/model/model.py", line 482, in from_dict
swh-loader_1                        |     f"TimestampWithTimezone.from_dict received non-integer timestamp: "
swh-loader_1                        | ValueError: TimestampWithTimezone.from_dict received non-integer timestamp: None
swh-loader_1                        | [2022-06-07 10:33:20,752: DEBUG/ForkPoolWorker-1] default version: 104.0.5106.1
swh-loader_1                        | [2022-06-07 10:33:20,755: DEBUG/ForkPoolWorker-1] extra branches: {}
swh-loader_1                        | [2022-06-07 10:33:20,755: DEBUG/ForkPoolWorker-1] releases: {'104.0.5106.1': []}
swh-loader_1                        | [2022-06-07 10:33:20,755: DEBUG/ForkPoolWorker-1] snapshot: {'branches': {}}
swh-loader_1                        | [2022-06-07 10:33:20,755: DEBUG/ForkPoolWorker-1] snapshot: Snapshot(branches=ImmutableDict({}), id=hash_to_bytes('1a8893e6a86f444e8be8e7bda6cb34fb1735a00e'))
swh-loader_1                        | [2022-06-07 10:33:20,755: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type snapshot
swh-loader_1                        | [2022-06-07 10:33:22,355: WARNING/ForkPoolWorker-1] 1 failed branches
swh-loader_1                        | [2022-06-07 10:33:22,356: WARNING/ForkPoolWorker-1] Failed branches: releases/104.0.5106.1

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7967 (id=28706)

Rebasing onto ba69cab5a5...

Current branch diff-target is up to date.
Changes applied before test
commit d925d06e6f1a51a4a7e8f0d1250a1c3bc45db891
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jun 7 15:05:09 2022 +0200

    package/archive: Handle tarball artifact with null time
    
    An artifact without time info can be provided in the artifacts list
    parameter of the loader (for instance last modification date
    is not available for tarballs coming from github releases).
    
    That case was not handled by the archive loader wich was resulting
    in loading error so add fix for it.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/797/ for more details.

This revision is now accepted and ready to land.Tue, Jun 7, 3:24 PM