Page MenuHomeSoftware Heritage

swh.deposit.models: Upload deposit archives to dedicated folder
ClosedPublic

Authored by ardumont on May 29 2020, 2:31 PM.

Details

Summary

This allows to avoid the clash names when depositing multiple archives with the
same name.

This serves the same purpose as file disambiguation needed to solve the issue.
In a, most possibly, simpler manner than changing altogether the current file
storage implementation.

The folder name is a timestamp to the microsecond. So it might not be completely
bulletproof though.

Related to T2393
Related to D3192

Test Plan

tox

Diff Detail

Repository
rDDEP Push deposit
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D3197 (id=11354)

Rebasing onto 4b4cebc477...

Current branch diff-target is up to date.
Changes applied before test
commit 366ac7aa2fe6cc909b6176008d5dd3dd2d90311d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri May 29 14:27:07 2020 +0200

    swh.deposit.models: Upload deposit archives to dedicated folder
    
    This allows to avoid the clash names when depositing multiple archives with the
    same name.
    
    This serves the same purpose as file disambiguation needed to solve the issue.
    In a most possibly simpler manner than changing altogether the current file
    storage.
    
    The folder name is a timestamp to the second. So it might not be completely
    bulletproof though.
    
    Related to T2393
    Related to D3192

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/38/ for more details.

This revision is now accepted and ready to land.May 29 2020, 3:42 PM

I think you should change check_archive to detect Django's name changes (see D3192)

This revision now requires changes to proceed.May 29 2020, 3:44 PM

No test?

It's a proposal for now so no test indeed (+ other tests pass here already).

Also, i figure there would need for some migration.

The proposal is good imo. Maybe add milliseconds just to be sure though, in case the client is scripting multiple uploads at the same time

The proposal is good imo. Maybe add milliseconds just to be sure though, in
case the client is scripting multiple uploads at the same time

Unfortunately, I did not find the millisecond stanza in the strftime
documentation [1] the django docs [2] refers to.

Thus the "not bulletproof" solution part in the description.

[1] https://docs.python.org/3/library/time.html#time.strftime

[2] https://docs.djangoproject.com/en/2.2/ref/models/fields/#django.db.models.FileField.upload_to

Or you could just use a random value every time.

Oh also, since we'll be creating a lot of directories with this, we need something to remove them. (Or switch to prefixing a random string to the file name)

Another documentation mentions %f for micro-seconds... which seems to work:

In [1]: from datetime import datetime

In [2]: from time import strftime

In [3]: datetime.now().strftime("%Y%m%d-%H%M%S.%f")
Out[3]: '20200529-160744.270565'

Oh also, since we'll be creating a lot of directories with this, we need
something to remove them. (Or switch to prefixing a random string to the file
name)

I'm not sure I follow.

We do not clean those up, we store them. A reference to the archive path is
kept in the model (db: DepositRequest).

Thus, why i'm mentioning that your initial proposal could be the right
direction in the long run [2]

[2] https://forge.softwareheritage.org/D3192#78011

Also, i figure there would need for some migration.

We do not clean those up, they stay and a reference is kept to them in the
deposit db (as far as i understood django on that part).

Hey, this means no migration (I thought of migrating from the old values to the
new but no need). The old archive path will target the old files so we should
be good.

  • Use f-string, it's clearer
  • Add ".%f" to go to the microseconds

TODO:

  • Need to check where to add assertions around the change (it's already covered by other tests since it's blue in the diff)

Build is green

Patch application report for D3197 (id=11370)

Rebasing onto 4b4cebc477...

Current branch diff-target is up to date.
Changes applied before test
commit 9835b8ba367b9c8c2fe5f6949e652b51cdec8a51
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri May 29 14:27:07 2020 +0200

    swh.deposit.models: Upload deposit archives to dedicated folder
    
    This allows to avoid the clash names when depositing multiple archives with the
    same name.
    
    This serves the same purpose as file disambiguation needed to solve the issue.
    In a most possibly simpler manner than changing altogether the current file
    storage.
    
    The folder name is a timestamp to the second. So it might not be completely
    bulletproof though.
    
    Related to T2393
    Related to D3192

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/39/ for more details.

swh/deposit/models.py
180–183

lol, pdb-ing this, that does not work...
The strftime thing described in the docs must work if we don't use a callable.

Nonetheless, we can do this ourselves...

reception_date = instance.deposit.reception_date
folder = reception_date.strftime('%Y%m%d-%H%M%S.%f')
return f"client_{instance.deposit.client.id}/{folder}/{filename}"

which I found clearer in the end.

swh/deposit/models.py
180–183

Now we are talking ;)

'client_1/20200601-092624.421886/archive1.zip'

Note: Also I was uneasy about the %f which was not in the official doc, so now i know it's working \m/

  • Add tests
  • Improve implementation

Need to check where to add assertions around the change (it's already covered
by other tests since it's blue in the diff)

done

Build is green

Patch application report for D3197 (id=11374)

Rebasing onto 4b4cebc477...

Current branch diff-target is up to date.
Changes applied before test
commit c586ff17402790e5de44aa59d69092fb35cd28b7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri May 29 14:27:07 2020 +0200

    swh.deposit.models: Upload deposit archives to dedicated folder
    
    This allows to avoid the clash names when depositing multiple archives with the
    same name.
    
    The folder name is a timestamp to the microsecond so there should no longer be
    any clash (famous last work and all that).
    
    Related to T2393
    Related to D3192

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/40/ for more details.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 1 2020, 12:25 PM
This revision was automatically updated to reflect the committed changes.

We do not clean those up, we store them. A reference to the archive path is
kept in the model (db: DepositRequest).

We do both. Django uploads files to /tmp, which we copy to a permanent location and clean.

We do both. Django uploads files to /tmp, which we copy to a permanent location and clean.

Ack. So it's missing a clean up routine somewhere (around the temporary location)?

We do both. Django uploads files to /tmp, which we copy to a permanent location and clean.

Ack. So it's missing a clean up routine somewhere (around the temporary location)?

Everything is fine. Checked the deposit server, no dangling files in /tmp/swh-deposit/archive.