Page MenuHomeSoftware Heritage

tar.loader: Make the loader-tar able to download remote artifact
ClosedPublic

Authored by ardumont on Dec 8 2018, 5:54 PM.

Details

Summary

Also:

  • Keep the legacy behavior (use the revision and snapshot's branch provided by the caller, e.g. the deposit loader depends on this)
  • Increase code coverage (50 -> 93%)
  • Remove dir loader inheritance (nightmare to maintain). We still need the dir loader for some functions but the main issue is resolved).
  • Remove deprecated producer code (we use the scheduler now)

Related T1351
Related T1431
Depends on D794

Use sample:

>>> url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz'
>>> origin = {'url': url, 'type': 'tar'}
>>> visit_date = 'Tue, 3 May 2017 17:16:32 +0200'
>>> last_modified = '2016-04-22 16:35'
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.tar.tasks import LoadTarRepository
>>> l = LoadTarRepository()
>>> l.run_task(origin=origin, visit_date=visit_date,
...            last_modified=last_modified)
DEBUG:swh.scheduler.task.LoadTarRepository:Creating tar origin for https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz
DEBUG:swh.scheduler.task.LoadTarRepository:Done creating tar origin for https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz
DEBUG:swh.scheduler.task.LoadTarRepository:Creating origin_visit for origin 3 at time Tue, 3 May 2017 17:16:32 +0200
DEBUG:swh.scheduler.task.LoadTarRepository:Done Creating origin_visit for origin 3 at time Tue, 3 May 2017 17:16:32 +0200
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): ftp.gnu.org:443
DEBUG:urllib3.connectionpool:https://ftp.gnu.org:443 "GET /gnu/8sync/8sync-0.1.0.tar.gz HTTP/1.1" 200 221837
INFO:swh.scheduler.task.LoadTarRepository:Uncompress /tmp/swh.loader.tar._nh29j5e-18075/8sync-0.1.0.tar.gz to /tmp/swh.loader.tar._nh29j5e-18075/swh.loader.tar-7wox5_rx
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 28 contents
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 28 contents
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 7 directories
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 7 directories
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 1 revisions
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 1 revisions
DEBUG:swh.scheduler.task.LoadTarRepository:Updating origin_visit for origin 3 with status full
DEBUG:swh.scheduler.task.LoadTarRepository:Done updating origin_visit for origin 3 with status full
DEBUG:swh.scheduler.task.LoadTarRepository:Clean up /tmp/swh.loader.tar._nh29j5e-18075
{'status': 'eventful'}
Test Plan

tox

Diff Detail

Repository
rDLDTAR Tarball Loader
Branch
improve-loader-tar
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2955
Build 3765: tox-on-jenkinsJenkins
Build 3764: arc lint + arc unit

Event Timeline

Remove print statement T.T

Fix d/control tab instead of spaces T.T

Dependent diff landed so trigger the new build for this

ardumont marked an inline comment as done.
  • tar.tasks: Improve docstring and test the task
  • README: Update documentation and remove obsolete documentation
ardumont added inline comments.
swh/loader/tar/loader.py
169

That will need to be provided by the caller (so the lister here).

swh/loader/tar/tests/test_build.py
48

Apparently my initial mtime was wrong, thus the slight change here.
Other than that, i used the same date as before with another readable format.

52

I used the same date as before with another format.

swh/loader/tar/tests/test_loader.py
74

Same i use the same date as before but with another format.

Heads up, it's cool and all but:

  • the diff is too big so i'll split this in multiple ones.
  • i need to update the diff (done locally and tests ok \m/) to add back the legacy behavior as deposit depends on it.
  • loader.tar: Add back the legacy behavior (deposit depends on it)
  • tar.loader: Fix typos
  • resources/producer: Remove dead configuration
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
debian/rules
4

Why this change?

swh/loader/tar/loader.py
157

Why not self.log.debug(...)?

167

with urllib.parse

169

Maybe rename the method to get_tarball_url_to_retrieve, to make it obvious it returns the value (rather than setting an object attribute)

169–170

raise NotImplementedError() (or make the class abstract)

204

s/tarballs/tarball/

218

rename it to LocalTarLoader?

This revision is now accepted and ready to land.Dec 10 2018, 9:34 AM
ardumont added inline comments.
debian/rules
4

The initial version is nose related.
That new one is pytest ;)

swh/loader/tar/loader.py
157

because i want the user to be warned it will mess up it's local environment for debug reasons.
That matches other implementations btw (pypi, etc...)

167

Thank you, will look into it.

169

Right

169–170

Right, i forgot!

204

indeed.

218

I wanted to avoid having to migrate the loader deposit.
I could do as you did with aliasing if you want.

ardumont marked an inline comment as done.
  • tar.loader: Rename method with clearer name
  • tar.loader: Explicit the need for an implementation
  • tar.loader: Fix typo
  • tar.loader: Improve class name and use alias for retro-compatibility
  • tar.loader: Improve class docstrings
  • loader: Only explicit needed params & adapt docstring accordingly
  • tar.loader: Improve docstring
  • tar.loader: Use urllib.parse.urlparse to actually parse url
This revision was automatically updated to reflect the committed changes.