Page MenuHomeSoftware Heritage

nixguix: Add support for downloads over FTP
Closed, ResolvedPublic

Description

python-requests does not support download via FTP by default, so this causes errors like this one: https://sentry.softwareheritage.org/share/issue/96e6c7c0442e4343919d510c9a0fc977/

This could possibly be done using https://pypi.org/project/requests-ftp/

Event Timeline

vlorentz triaged this task as Normal priority.Oct 12 2020, 2:18 PM
vlorentz created this task.
vlorentz added a project: Nixguix loader.

Hey @vlorentz, can you please give me some hints for this and an example URL for testing the code?

Run the nixguix loader with url=https://guix.gnu.org/sources.json, you'll get a bunch of ftp-related errors. eg. this one https://sentry.softwareheritage.org/share/issue/7dc92745d96442d482a493c64b6eae91/

Hey @vlorentz, the sentry link isn't working (or maybe isn't publically accessible).

I tried using sudo docker-compose exec swh-loader swh loader run nixguix "https://guix.gnu.org/sources.json" on my self hosted swh instance and got the following error:

ERROR:swh.loader.package.loader:Failed to initialize origin_visit for https://guix.gnu.org/sources.json
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 389, in load
    self.storage.origin_add([origin])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 354, in _decode_response
    self.raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 29, in raise_for_status
    super().raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 KafkaDeliveryError: ['flush() exceeded timeout (120s)', [['origin', {'url': 'https://guix.gnu.org/sources.json'}, 'No delivery before flush() timeout', 'SWH_FLUSH_TIMEOUT']]]>
{'status': 'failed'}

The error is coming from Kafka. I initiated the containers using docker-compose up -f docker-compose.search.yml docker-compose.yml. Did I do anything wrong here?

That's an unrelated error, could you open a task for this?

The 2nd url [2] targetted from val is ok for my part (not the first one [1] though, it goes 404),
here is the content from the [2]nd url inlined just in case:

InvalidSchema: No connection adapters were found for 'ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz'

EXCEPTION(most recent call first)
InvalidSchema: No connection adapters were found for 'ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz'
  File "swh/loader/package/loader.py", line 576, in load
    res = self._load_revision(p_info, origin)
  File "swh/loader/package/loader.py", line 713, in _load_revision
    dl_artifacts = self.download_package(p_info, tmpdir)
  File "swh/loader/package/loader.py", line 364, in download_package
    return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
  File "swh/loader/package/utils.py", line 79, in download
    response = requests.get(url, **params, timeout=timeout, stream=True)
  File "requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "requests/sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "requests/sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)

You might be able to reproduce it directly into the unit test though,
not necessarily by running a full-fledged nixguix loader into docker.
Copy paste and adapt an existing one with an ftp url like.

[1] https://sentry.softwareheritage.org/share/issue/96e6c7c0442e4343919d510c9a0fc977/

[2] T2687#62313

This could possibly be done using https://pypi.org/project/requests-ftp/

I'm not entirely sold on using that repository.
Given what's said in the description... and the absence of tests (still according to the readme/description).

This library is not intended to be an example of Transport Adapters best practices. This library was cowboyed together in about 4 hours of total work, has no tests, and relies on a few ugly hacks. Instead, it is intended as both a starting point for future development and a useful example for how to implement transport adapters.

Possibly a composition of requests adapter [1] and urllib.request.urlretrieve [2] should
or could be enough instead.

[1] https://3.python-requests.org/user/advanced/#transport-adapters

[2]

$ cd swh-environment
(swh) $ ipython
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import urllib

In [3]: url = 'ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz'

In [4]: import os

In [7]: os.path.exists('file')
Out[7]: False

In [10]: urllib.request.urlretrieve(url, 'file')
Out[10]: ('file', <email.message.Message at 0x7fd9c73b5c50>)

In [11]: os.path.exists('file')
Out[11]: True

also i just realize i should have mentioned this earlier @KShivendu.
To reproduce the issue, no need for any loader or whatever else,
just ipython in your venv:

$ workon swh
(swh) $ ipython
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import requests

In [2]: url = 'ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz'

In [3]: requests.get(url)
---------------------------------------------------------------------------
InvalidSchema                             Traceback (most recent call last)
<ipython-input-3-b80aa89477da> in <module>
----> 1 requests.get(url)

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     73     """
     74
---> 75     return request('get', url, params=params, **kwargs)
     76
     77

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)
     62
     63

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    543
    544         return resp

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
    647
    648         # Get the appropriate adapter to use
--> 649         adapter = self.get_adapter(url=request.url)
    650
    651         # Start time (approximately) of the request

~/.virtualenvs/swh/lib/python3.7/site-packages/requests/sessions.py in get_adapter(self, url)
    740
    741         # Nothing matches :-/
--> 742         raise InvalidSchema("No connection adapters were found for {!r}".format(url))
    743
    744     def close(self):

InvalidSchema: No connection adapters were found for 'ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz'