Page MenuHomeSoftware Heritage

nixguix: Add support for downloads over FTP
Closed, MigratedEdits Locked


python-requests does not support download via FTP by default, so this causes errors like this one:

This could possibly be done using

Event Timeline

vlorentz triaged this task as Normal priority.Oct 12 2020, 2:18 PM
vlorentz created this task.
vlorentz added a project: Nixguix loader.

Hey @vlorentz, can you please give me some hints for this and an example URL for testing the code?

Run the nixguix loader with url=, you'll get a bunch of ftp-related errors. eg. this one

Hey @vlorentz, the sentry link isn't working (or maybe isn't publically accessible).

I tried using sudo docker-compose exec swh-loader swh loader run nixguix "" on my self hosted swh instance and got the following error:

ERROR:swh.loader.package.loader:Failed to initialize origin_visit for
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/", line 389, in load[origin])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/", line 181, in meth_
    return, post_data)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/", line 278, in post
    return self._decode_response(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/", line 354, in _decode_response
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/", line 29, in raise_for_status
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 KafkaDeliveryError: ['flush() exceeded timeout (120s)', [['origin', {'url': ''}, 'No delivery before flush() timeout', 'SWH_FLUSH_TIMEOUT']]]>
{'status': 'failed'}

The error is coming from Kafka. I initiated the containers using docker-compose up -f docker-compose.yml. Did I do anything wrong here?

That's an unrelated error, could you open a task for this?

The 2nd url [2] targetted from val is ok for my part (not the first one [1] though, it goes 404),
here is the content from the [2]nd url inlined just in case:

InvalidSchema: No connection adapters were found for ''

EXCEPTION(most recent call first)
InvalidSchema: No connection adapters were found for ''
  File "swh/loader/package/", line 576, in load
    res = self._load_revision(p_info, origin)
  File "swh/loader/package/", line 713, in _load_revision
    dl_artifacts = self.download_package(p_info, tmpdir)
  File "swh/loader/package/", line 364, in download_package
    return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
  File "swh/loader/package/", line 79, in download
    response = requests.get(url, **params, timeout=timeout, stream=True)
  File "requests/", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "requests/", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "requests/", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "requests/", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "requests/", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)

You might be able to reproduce it directly into the unit test though,
not necessarily by running a full-fledged nixguix loader into docker.
Copy paste and adapt an existing one with an ftp url like.


[2] T2687#62313

This could possibly be done using

I'm not entirely sold on using that repository.
Given what's said in the description... and the absence of tests (still according to the readme/description).

This library is not intended to be an example of Transport Adapters best practices. This library was cowboyed together in about 4 hours of total work, has no tests, and relies on a few ugly hacks. Instead, it is intended as both a starting point for future development and a useful example for how to implement transport adapters.

Possibly a composition of requests adapter [1] and urllib.request.urlretrieve [2] should
or could be enough instead.



$ cd swh-environment
(swh) $ ipython
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import urllib

In [3]: url = ''

In [4]: import os

In [7]: os.path.exists('file')
Out[7]: False

In [10]: urllib.request.urlretrieve(url, 'file')
Out[10]: ('file', <email.message.Message at 0x7fd9c73b5c50>)

In [11]: os.path.exists('file')
Out[11]: True

also i just realize i should have mentioned this earlier @KShivendu.
To reproduce the issue, no need for any loader or whatever else,
just ipython in your venv:

$ workon swh
(swh) $ ipython
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import requests

In [2]: url = ''

In [3]: requests.get(url)
InvalidSchema                             Traceback (most recent call last)
<ipython-input-3-b80aa89477da> in <module>
----> 1 requests.get(url)

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/ in get(url, params, **kwargs)
     73     """
---> 75     return request('get', url, params=params, **kwargs)

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/ in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/ in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    544         return resp

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/ in send(self, request, **kwargs)
    648         # Get the appropriate adapter to use
--> 649         adapter = self.get_adapter(url=request.url)
    651         # Start time (approximately) of the request

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/ in get_adapter(self, url)
    741         # Nothing matches :-/
--> 742         raise InvalidSchema("No connection adapters were found for {!r}".format(url))
    744     def close(self):

InvalidSchema: No connection adapters were found for ''