Page MenuHomeSoftware Heritage

Disambiguate uploaded files with a prefix instead of a suffix.
AbandonedPublic

Authored by vlorentz on May 28 2020, 1:31 PM.

Details

Summary

Django tries to add the suffix before the extension, but it doesn't
account for compound extensions, so 'foo.tar.gz' becomes
'foo.tar_XXXXXXX.gz', which messes with shutil's file type
detection.

Using a prefix instead solves this issue.

Resolves T2393.

Test Plan

I can't reproduce the exact issue in a test for some reason,
but we already have a function checking the name, so that should be
good enough.

Diff Detail

Repository
rDDEP Push deposit
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 12535
Build 19040: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 19039: arc lint + arc unit

Unit TestsFailed

TimeTest
598 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_deposit_binary::test_post_deposit_binary_and_post_to_add_another_archive
authenticated_client = <rest_framework.test.APIClient object at 0x7f8bcd474be0> deposit_collection = <DepositCollection: {'id': 26, 'name': 'test'}> sample_archive = {'data': b'PK\x03\x04\x14\x00\x00\x00\x00\x00\x97\\\xbcP\xcba\xb4c\x14\x00\x00\x00\x14\x00\x00\x00\x05\x00\x00\x00file...ytest-0/test_post_deposit_binary_and_p0/tmppopa8o1z', 'length': 128, 'md5sum': 'a1144f40c0ef1281bbbe7a297028e4e1', ...}
573 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_deposit_binary::test_post_deposit_binary_upload_ok
authenticated_client = <rest_framework.test.APIClient object at 0x7f8bcd5da278> deposit_collection = <DepositCollection: {'id': 20, 'name': 'test'}> sample_archive = {'data': b'PK\x03\x04\x14\x00\x00\x00\x00\x00\x96\\\xbcP\xcba\xb4c\x14\x00\x00\x00\x14\x00\x00\x00\x05\x00\x00\x00file...ytest-0/test_post_deposit_binary_uploa0/tmpagusghof', 'length': 128, 'md5sum': 'b6e7116692069765f21f83ba1492fadb', ...}
614 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_deposit_binary::test_post_deposit_then_update_refused
authenticated_client = <rest_framework.test.APIClient object at 0x7f8bcd481898> deposit_collection = <DepositCollection: {'id': 27, 'name': 'test'}> sample_archive = {'data': b'PK\x03\x04\x14\x00\x00\x00\x00\x00\x98\\\xbcP\xcba\xb4c\x14\x00\x00\x00\x14\x00\x00\x00\x05\x00\x00\x00file...ytest-0/test_post_deposit_then_update_0/tmps30y42c7', 'length': 128, 'md5sum': 'f77d810f710e718bae934e48e4b20824', ...}
606 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_deposit_multipart::test_post_deposit_multipart_put_to_replace_metadata
authenticated_client = <rest_framework.test.APIClient object at 0x7f8bcd5019e8> deposit_collection = <DepositCollection: {'id': 38, 'name': 'test'}> atom_dataset = {'codemeta-sample': '<?xml version="1.0"?>\n <entry xmlns="http://www.w3.org/2005/Atom"\n xmlns:d...ntry>\n', 'entry-data-empty-body': '<?xml version="1.0"?>\n<entry xmlns="http://www.w3.org/2005/Atom"></entry>\n', ...}
501 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.deposit.tests.api.test_deposit_multipart::test_post_deposit_multipart_tar
authenticated_client = <rest_framework.test.APIClient object at 0x7f8bcd506ac8> deposit_collection = <DepositCollection: {'id': 37, 'name': 'test'}> atom_dataset = {'codemeta-sample': '<?xml version="1.0"?>\n <entry xmlns="http://www.w3.org/2005/Atom"\n xmlns:d...ntry>\n', 'entry-data-empty-body': '<?xml version="1.0"?>\n<entry xmlns="http://www.w3.org/2005/Atom"></entry>\n', ...}
View Full Test Results (9 Failed · 132 Passed)

Event Timeline

Build has FAILED

Patch application report for D3192 (id=11341)

Rebasing onto 4b4cebc477...

Current branch diff-target is up to date.
Changes applied before test
commit b0c7d0e36a5761637a1ec679a12bbe82e605506e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 28 13:30:39 2020 +0200

    Disambiguate uploaded files with a prefix instead of a suffix.
    
    Django tries to add the suffix before the extension, but it doesn't
    account for compound extensions, so 'foo.tar.gz' becomes
    'foo.tar_XXXXXXX.gz', which messes with shutil's file type
    detection.
    
    Using a prefix instead solves this issue.

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/36/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/36/console

Build has FAILED

Patch application report for D3192 (id=11342)

Rebasing onto 4b4cebc477...

Current branch diff-target is up to date.
Changes applied before test
commit e387bdab2ad0d145b3992a5366ded588d62975ae
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 28 13:30:39 2020 +0200

    Disambiguate uploaded files with a prefix instead of a suffix.
    
    Django tries to add the suffix before the extension, but it doesn't
    account for compound extensions, so 'foo.tar.gz' becomes
    'foo.tar_XXXXXXX.gz', which messes with shutil's file type
    detection.
    
    Using a prefix instead solves this issue.

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/37/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/37/console

The tests are failing. I recall there is a something on the model about filename in the deposit model... [1] [2]

It's possibly related.

[1] definition: https://forge.softwareheritage.org/source/swh-deposit/browse/master/swh/deposit/models.py$164-176

[2] use: https://forge.softwareheritage.org/source/swh-deposit/browse/master/swh/deposit/models.py$196

Wondering if there isn't a simpler implementation leveraging that field's upload_to after all [1]
(instead of modifying the django system storage...)

[1] https://docs.djangoproject.com/en/2.2/ref/models/fields/#django.db.models.FileField.upload_to

Wondering if there isn't a simpler implementation leveraging that field's upload_to after all [1]
(instead of modifying the django system storage...)

[1] https://docs.djangoproject.com/en/2.2/ref/models/fields/#django.db.models.FileField.upload_to

D3197 as an example of that suggestion.

Although if we are going the initial proposed path, providing a dedicated
storage implementation...

We may want to have a django storage implementation with an objstorage-like
collaborator, to store the archive with no duplication. At the moment, i'd
expect a lot of duplication there (user failing trying back their deposit,
current deposit infra checks, ...)

(Not necessarily to act on it, just mentioning it because it's on my mind since
the beginning of the deposit... There may even be mention of this in some
deposit specification draft somewhere in the forge ;)