Page MenuHomeSoftware Heritage

Two deposits of files the same name and a "compound" extension fails the checker
Closed, MigratedEdits Locked

Description

If I upload the same file name to the deposit twice, Django adds a random string before the extension in the file name, so we have things like:

swh@4f4897bb636a:/$ ls /tmp/swh-deposit/uploads/client_1/
swh-deposit.tar.gz  swh-deposit.tar_9bEbkyF.gz  swh-deposit.tar_LHM1Y7e.gz  swh-deposit.tar_ZdRLQAZ.gz  swh-deposit.tar_axXOAqS.gz

while this is fine for "single" extensions (such as .tgz), it is not for .tar.gz and the like.

I believe the right fix would be to change Django's storage class to deduplicate file names in a different ways (add the random string as prefix, or create a dir): https://docs.djangoproject.com/en/3.0/topics/files/#the-built-in-filesystem-storage-class

In Django 3, we could simply subclass FileSystemStorage and override get_alternative_name; but the deposit currently uses Django 2, so we would need to override this entire function: https://github.com/django/django/blob/98ef3829e96ebc73d4d446f92465e671ff520d2b/django/core/files/storage.py#L63-L92

Event Timeline

vlorentz triaged this task as Normal priority.May 6 2020, 12:35 PM
vlorentz created this task.
vlorentz raised the priority of this task from Normal to High.EditedMay 28 2020, 11:54 AM
vlorentz updated the task description. (Show Details)

Raising priority, as we are hitting this issue in production

Thanks @vlorentz.
Is this something that can be fixed easily or should I ask the client to change the names on the deposited archives?

vlorentz renamed this task from Two deposits of files the same name and a "double" extension fails the checker to Two deposits of files the same name and a "compound" extension fails the checker.May 29 2020, 9:58 AM

I believe the right fix would be to change Django's storage class to deduplicate file
names in a different ways (add the random string as prefix, or create a dir):
https://docs.djangoproject.com/en/3.0/topics/files/#the-built-in-filesystem-storage-class

We (@vlorentz and I) agreed on a simpler fix: changing the upload file path on
the deposit model (which was already defined). Current implementation is thus
D3197.

Cheers,

Deployed.

Triggered a check (deposit id 650) and now everything is still fine.

Here is the path where archive are uploaded:

softwareheritage-deposit=> select dr.archive from deposit d inner join deposit_request dr on d.id=dr.deposit_id where d.id=650 and dr.archive is not null;
                    archive
-----------------------------------------------

 client_3/20200601-105256.359626/jesuisgpl.tgz

The timestamp 20200601-105256.359626 is built out of the deposit reception
date. So there should be no longer clash in names.

Closing this task. Feel free to reopen or comment if anything is wrong.

Cheers,