Page MenuHomeSoftware Heritage

nixguix: fails to finish as downloading artifacts step hangs
Closed, ResolvedPublic

Description

Another issue exists, sometimes the worker just hangs forever... For example,
right now a nixguix process (runs on worker0.internal.staging.swh.network) is
currently hanging on a download connection [1]

The only solution i see is to kill the process. Which will result in an
unfinished visit state (stuck in ongoing state). Which gives credits to the
origin visit reaper proposition btw T2310#43199.

Adding some timeout to the download connection sounds sensible [2] to avoid
this kind of caveat [3]. Quoting the documentation of requests [2], "Failure to
do so can cause your program to hang indefinitely". Well we had been warned :D

Note: It's probably shared to other package loaders. Right now, it's more
obvious with this one as it treats a lot of artifacts in one round.

[1]

Last log entry as of now:

Apr 09 17:37:09 worker0 python3[1914]: [2020-04-09 17:37:09,838: DEBUG/ForkPoolWorker-1] package_info: {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'raw': {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'integrity': 'sha256-+EkmHcfJfvHxXyIulVsNPa+ZTsE8nbd2bxrH53uqQEI='}}

Stracing the issue, it's currently waiting on file descriptor 95:

# strace -p 1914
strace: Process 1914 attached
recvfrom(95,

Which leads to socket:

# file /proc/1914/fd/95
/proc/1914/fd/95: symbolic link to socket:[74794390]

Indeed, it's stuck at the http connection.

root@worker0:~# lsof -p 1914 | grep 74794390
python3 1914 swhworker   95u     IPv4           74794390      0t0      TCP worker0.internal.staging.swh.network:58952->hx-xfer-prod.ebi.ac.uk:http (ESTABLISHED)

[2] https://2.python-requests.org/en/master/user/quickstart/#timeoutsnn

[3] Also, relatedly to download, we discussed with @lewo a possibility to
improve the download process to be done in parallel.

Event Timeline

ardumont triaged this task as Normal priority.Apr 11 2020, 11:58 AM
ardumont created this task.
ardumont renamed this task from nixguix: fails to finish as download hanging to nixguix: fails to finish as downloading artifacts step.Apr 11 2020, 12:58 PM
ardumont renamed this task from nixguix: fails to finish as downloading artifacts step to nixguix: fails to finish as downloading artifacts step hangs.Apr 11 2020, 4:50 PM
ardumont changed the task status from Open to Work in Progress.Apr 14 2020, 6:16 PM