Another issue exists, sometimes the worker just hangs forever... For example,
right now a nixguix process (runs on worker0.internal.staging.swh.network) is
currently hanging on a download connection [1]
The only solution i see is to kill the process. Which will result in an
unfinished visit state (stuck in ongoing state). Which gives credits to the
origin visit reaper proposition btw T2310#43199.
Adding some timeout to the download connection sounds sensible [2] to avoid
this kind of caveat [3]. Quoting the documentation of requests [2], "Failure to
do so can cause your program to hang indefinitely". Well we had been warned :D
Note: It's probably shared to other package loaders. Right now, it's more
obvious with this one as it treats a lot of artifacts in one round.
[1]
Last log entry as of now:
Apr 09 17:37:09 worker0 python3[1914]: [2020-04-09 17:37:09,838: DEBUG/ForkPoolWorker-1] package_info: {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'raw': {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'integrity': 'sha256-+EkmHcfJfvHxXyIulVsNPa+ZTsE8nbd2bxrH53uqQEI='}}
Stracing the issue, it's currently waiting on file descriptor 95:
# strace -p 1914 strace: Process 1914 attached recvfrom(95,
Which leads to socket:
# file /proc/1914/fd/95 /proc/1914/fd/95: symbolic link to socket:[74794390]
Indeed, it's stuck at the http connection.
root@worker0:~# lsof -p 1914 | grep 74794390 python3 1914 swhworker 95u IPv4 74794390 0t0 TCP worker0.internal.staging.swh.network:58952->hx-xfer-prod.ebi.ac.uk:http (ESTABLISHED)
[2] https://2.python-requests.org/en/master/user/quickstart/#timeoutsnn
[3] Also, relatedly to download, we discussed with @lewo a possibility to
improve the download process to be done in parallel.