Page MenuHomeSoftware Heritage

tar.loader: Make the loader-tar able to download remote artifact
ClosedPublic

Authored by ardumont on Dec 8 2018, 5:54 PM.

Details

Summary

Also:

  • Keep the legacy behavior (use the revision and snapshot's branch provided by the caller, e.g. the deposit loader depends on this)
  • Increase code coverage (50 -> 93%)
  • Remove dir loader inheritance (nightmare to maintain). We still need the dir loader for some functions but the main issue is resolved).
  • Remove deprecated producer code (we use the scheduler now)

Related T1351
Related T1431
Depends on D794

Use sample:

>>> url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz'
>>> origin = {'url': url, 'type': 'tar'}
>>> visit_date = 'Tue, 3 May 2017 17:16:32 +0200'
>>> last_modified = '2016-04-22 16:35'
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.tar.tasks import LoadTarRepository
>>> l = LoadTarRepository()
>>> l.run_task(origin=origin, visit_date=visit_date,
...            last_modified=last_modified)
DEBUG:swh.scheduler.task.LoadTarRepository:Creating tar origin for https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz
DEBUG:swh.scheduler.task.LoadTarRepository:Done creating tar origin for https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz
DEBUG:swh.scheduler.task.LoadTarRepository:Creating origin_visit for origin 3 at time Tue, 3 May 2017 17:16:32 +0200
DEBUG:swh.scheduler.task.LoadTarRepository:Done Creating origin_visit for origin 3 at time Tue, 3 May 2017 17:16:32 +0200
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): ftp.gnu.org:443
DEBUG:urllib3.connectionpool:https://ftp.gnu.org:443 "GET /gnu/8sync/8sync-0.1.0.tar.gz HTTP/1.1" 200 221837
INFO:swh.scheduler.task.LoadTarRepository:Uncompress /tmp/swh.loader.tar._nh29j5e-18075/8sync-0.1.0.tar.gz to /tmp/swh.loader.tar._nh29j5e-18075/swh.loader.tar-7wox5_rx
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 28 contents
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 28 contents
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 7 directories
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 7 directories
DEBUG:swh.scheduler.task.LoadTarRepository:Sending 1 revisions
DEBUG:swh.scheduler.task.LoadTarRepository:Done sending 1 revisions
DEBUG:swh.scheduler.task.LoadTarRepository:Updating origin_visit for origin 3 with status full
DEBUG:swh.scheduler.task.LoadTarRepository:Done updating origin_visit for origin 3 with status full
DEBUG:swh.scheduler.task.LoadTarRepository:Clean up /tmp/swh.loader.tar._nh29j5e-18075
{'status': 'eventful'}
Test Plan

tox

Diff Detail

Repository
rDLDTAR Tarball Loader
Branch
improve-loader-tar
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2961
Build 3777: tox-on-jenkinsJenkins
Build 3776: arc lint + arc unit

Event Timeline

Remove print statement T.T

Fix d/control tab instead of spaces T.T

Dependent diff landed so trigger the new build for this

ardumont marked an inline comment as done.
  • tar.tasks: Improve docstring and test the task
  • README: Update documentation and remove obsolete documentation
ardumont added inline comments.
swh/loader/tar/loader.py
175

That will need to be provided by the caller (so the lister here).

swh/loader/tar/tests/test_build.py
48

Apparently my initial mtime was wrong, thus the slight change here.
Other than that, i used the same date as before with another readable format.

52

I used the same date as before with another format.

swh/loader/tar/tests/test_loader.py
168

Same i use the same date as before but with another format.

Heads up, it's cool and all but:

  • the diff is too big so i'll split this in multiple ones.
  • i need to update the diff (done locally and tests ok \m/) to add back the legacy behavior as deposit depends on it.
  • loader.tar: Add back the legacy behavior (deposit depends on it)
  • tar.loader: Fix typos
  • resources/producer: Remove dead configuration
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
debian/rules
4

Why this change?

swh/loader/tar/loader.py
79

with urllib.parse

153

Why not self.log.debug(...)?

175

Maybe rename the method to get_tarball_url_to_retrieve, to make it obvious it returns the value (rather than setting an object attribute)

179

raise NotImplementedError() (or make the class abstract)

228

s/tarballs/tarball/

284

rename it to LocalTarLoader?

This revision is now accepted and ready to land.Dec 10 2018, 9:34 AM
ardumont added inline comments.
debian/rules
4

The initial version is nose related.
That new one is pytest ;)

swh/loader/tar/loader.py
79

Thank you, will look into it.

153

because i want the user to be warned it will mess up it's local environment for debug reasons.
That matches other implementations btw (pypi, etc...)

175

Right

179

Right, i forgot!

228

indeed.

284

I wanted to avoid having to migrate the loader deposit.
I could do as you did with aliasing if you want.

ardumont marked an inline comment as done.
  • tar.loader: Rename method with clearer name
  • tar.loader: Explicit the need for an implementation
  • tar.loader: Fix typo
  • tar.loader: Improve class name and use alias for retro-compatibility
  • tar.loader: Improve class docstrings
  • loader: Only explicit needed params & adapt docstring accordingly
  • tar.loader: Improve docstring
  • tar.loader: Use urllib.parse.urlparse to actually parse url
This revision was automatically updated to reflect the committed changes.