Page MenuHomeSoftware Heritage

Two deposits of files the same name and a "double" extension fails the checker
Open, HighPublic

Description

If I upload the same file name to the deposit twice, Django adds a random string before the extension in the file name, so we have things like:

swh@4f4897bb636a:/$ ls /tmp/swh-deposit/uploads/client_1/
swh-deposit.tar.gz  swh-deposit.tar_9bEbkyF.gz  swh-deposit.tar_LHM1Y7e.gz  swh-deposit.tar_ZdRLQAZ.gz  swh-deposit.tar_axXOAqS.gz

while this is fine for "single" extensions (such as .tgz), it is not for .tar.gz and the like.

I believe the right fix would be to change Django's storage class to deduplicate file names in a different ways (add the random string as prefix, or create a dir): https://docs.djangoproject.com/en/3.0/topics/files/#the-built-in-filesystem-storage-class

In Django 3, we could simply subclass FileSystemStorage and override get_alternative_name; but the deposit currently uses Django 2, so we would need to override this entire function: https://github.com/django/django/blob/98ef3829e96ebc73d4d446f92465e671ff520d2b/django/core/files/storage.py#L63-L92

Event Timeline

vlorentz triaged this task as Normal priority.Wed, May 6, 12:35 PM
vlorentz created this task.
vlorentz raised the priority of this task from Normal to High.EditedThu, May 28, 11:54 AM
vlorentz updated the task description. (Show Details)

Raising priority, as we are hitting this issue in production

Thanks @vlorentz.
Is this something that can be fixed easily or should I ask the client to change the names on the deposited archives?

It should be easy to fix