Page MenuHomeSoftware Heritage

The docker-dev build is often failing
Closed, MigratedEdits Locked

Description

The docker-dev build is regularly failing for no real reason.

2 problems are identifed:

  • timeouts during directory_ls:
09:56:25 swh-web_1                           | [06/Dec/2021 08:56:20] [DEBUG] Traceback (most recent call last):
09:56:25 swh-web_1                           |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 445, in _make_request
09:56:25 swh-web_1                           |     six.raise_from(e, None)
09:56:25 swh-web_1                           |   File "<string>", line 3, in raise_from
09:56:25 swh-web_1                           | 172.19.0.6 - - [06/Dec/2021:08:56:20 +0000] "GET /api/1/directory/846f39f7fddd09c3c1ad2c7c9d3149b4fa055ff1/ HTTP/1.0" 500 4996 "-" "curl/7.64.0"
09:56:25 swh-web_1                           |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 440, in _make_request
09:56:25 swh-web_1                           |     httplib_response = conn.getresponse()
09:56:25 swh-web_1                           |   File "/usr/local/lib/python3.7/http/client.py", line 1373, in getresponse
09:56:25 swh-web_1                           |     response.begin()
09:56:25 swh-web_1                           |   File "/usr/local/lib/python3.7/http/client.py", line 319, in begin
09:56:25 swh-web_1                           |     version, status, reason = self._read_status()
09:56:25 swh-web_1                           |   File "/usr/local/lib/python3.7/http/client.py", line 280, in _read_status
09:56:25 swh-web_1                           |     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
09:56:25 swh-web_1                           |   File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
09:56:25 swh-web_1                           |     return self._sock.recv_into(b)
09:56:25 swh-web_1                           | socket.timeout: timed out
  • concurrency issue between the swh-web and swh-web-cron startup, probably not blocker
09:53:18 swh-web-cron_1                      | Start periodic save code now refresh statuses routine (in background)
09:53:18 swh-web-cron_1                      | Mon Dec  6 08:52:56 UTC 2021
09:53:18 swh-web-cron_1                      | [06/Dec/2021 08:52:58] [DEBUG] urllib3.connectionpool._new_conn:231 - Starting new HTTP connection (1): swh-scheduler:5008
09:53:18 swh-web-cron_1                      | [06/Dec/2021 08:52:58] [DEBUG] urllib3.connectionpool._make_request:461 - http://swh-scheduler:5008 "POST /task_type/get_all HTTP/1.1" 200 5640
09:53:18 swh-web-cron_1                      | Traceback (most recent call last):
09:53:18 swh-web-cron_1                      |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
09:53:18 swh-web-cron_1                      |     return self.cursor.execute(sql, params)
09:53:18 swh-web-cron_1                      | psycopg2.errors.UndefinedColumn: column save_origin_request.visit_status does not exist
09:53:18 swh-web-cron_1                      | LINE 1: ...ND ("save_origin_request"."visit_date" IS NULL OR "save_orig...
09:53:18 swh-web-cron_1                      |                                                              ^
09:53:18 swh-web-cron_1                      | The above exception was the direct cause of the following exception:
09:53:18 swh-web-cron_1                      | Traceback (most recent call last):
09:53:18 swh-web-cron_1                      |   File "/srv/softwareheritage/venv/bin/django-admin", line 8, in <module>
09:53:18 swh-web-cron_1                      |     sys.exit(execute_from_command_line())
09:53:18 swh-web-cron_1                      |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
09:53:18 swh-web-cron_1                      |     utility.execute()
09:53:18 swh-web-cron_1                      |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
09:53:18 swh-web-cron_1                      |     self.fetch_command(subcommand).run_from_argv(self.argv)
...
09:53:18 swh-web_1                           | starting the swh-web server
09:53:18 swh-web_1                           | [2021-12-06 08:53:02 +0000] [1] [INFO] Starting gunicorn 20.1.0
09:53:18 swh-web_1                           | [2021-12-06 08:53:02 +0000] [1] [INFO] Listening at: http://0.0.0.0:5004 (1)
09:53:18 swh-web_1                           | [2021-12-06 08:53:02 +0000] [1] [INFO] Using worker: gthread
09:53:18 swh-web_1                           | [2021-12-06 08:53:02 +0000] [42] [INFO] Booting worker with pid: 42
09:53:18 swh-web_1                           | [2021-12-06 08:53:02 +0000] [43] [INFO] Booting worker with pid: 43

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 7 2021, 2:30 PM
vsellier triaged this task as Normal priority.
vsellier created this task.

concurrency issue between the swh-web and swh-web-cron startup, probably not blocker

Adding wait-for-it swh-web:5004 -s --timeout=0 here should fix that issue.

thanks @anlambert for the tips, it works as expected

the last builds were successful and are not indicating any response time too long.
let's see tomorrow if the response times are slower at the usual build time.

The running builds will be recorded each 30s with this command:

while true; do curl -s 'https://jenkins.softwareheritage.org/queue/api/json?pretty=true' > output-$(date +%Y%m%d-%H%M).json; sleep 30; date; done

The timeout occurs after 1s on the swh-web side on a directory/ls call.

04:11:10 nginx_1                             | 172.23.0.1 - - [08/Dec/2021:03:11:09 +0000] "GET /api/1/directory/877df54c7dda406e9ad56ca09f793799aedbb26b/ HTTP/1.1" 500 4996 "-" "curl/7.64.0" 1.013

There is a high I/O load on the build server during the build. I didn't succeed to identify if it's related to a high number of concurrent builds or just this build.

The configuration will be updated to increase the timeout of the queries to swh-storage, we could improve the swh-storage (or more probably the db) config later once the build will be more stable.

No requests took more than 1s during the last build this night.
I will continue to monitor the builds and try to diagnose the problem more accurately

It seems since the 8th of december, there were no requests > 1s in the builds
I will monitor it during the current week, if it not occurs again, I will change the status to resolved

The build can be quickly checked with this small script : P1243

the last failing build:

~/wip ❯ ./check_request_durations.sh 1049                                                                                                                                                                                                     09:45:27
Getting the logs...
--2021-12-14 09:48:37--  https://jenkins.softwareheritage.org/job/swh-docker-dev/1049/consoleText
Resolving jenkins.softwareheritage.org (jenkins.softwareheritage.org)... 128.93.166.12
Connecting to jenkins.softwareheritage.org (jenkins.softwareheritage.org)|128.93.166.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘/tmp/log’

/tmp/log                                                          [   <=>                                                                                                                                           ]   6.47M  12.9MB/s    in 0.5s    

2021-12-14 09:48:37 (12.9 MB/s) - ‘/tmp/log’ saved [6785854]

Extract storage logs
Extract nginx logs
Requests >1s in storage
swh-storage_1                       | [08/Dec/2021:03:07:47 +0000] POST /origin/visit/add HTTP/1.1 200 143 1966
swh-storage_1                       | [08/Dec/2021:03:08:27 +0000] POST /content/add HTTP/1.1 200 39 37919
swh-storage_1                       | [08/Dec/2021:03:08:30 +0000] POST /directory/add HTTP/1.1 200 18 1423
swh-storage_1                       | [08/Dec/2021:03:08:35 +0000] POST /revision/add HTTP/1.1 200 17 1319
swh-storage_1                       | [08/Dec/2021:03:08:38 +0000] POST /revision/add HTTP/1.1 200 17 1401
swh-storage_1                       | [08/Dec/2021:03:08:39 +0000] POST /snapshot/add HTTP/1.1 200 15 1002


Requests > 1s in nginx
nginx_1                             | 172.23.0.1 - - [08/Dec/2021:03:11:09 +0000] "GET /api/1/directory/877df54c7dda406e9ad56ca09f793799aedbb26b/ HTTP/1.1" 500 4996 "-" "curl/7.64.0" 1.013 <---- WRONG

Any recent build:

~/wip ❯ ./check_request_durations.sh 1056                                                                                                                                                                                                     09:48:37
Getting the logs...
--2021-12-14 09:49:24--  https://jenkins.softwareheritage.org/job/swh-docker-dev/1056/consoleText
Resolving jenkins.softwareheritage.org (jenkins.softwareheritage.org)... 128.93.166.12
Connecting to jenkins.softwareheritage.org (jenkins.softwareheritage.org)|128.93.166.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘/tmp/log’

/tmp/log                                                          [     <=>                                                                                                                                         ]  11.31M  12.1MB/s    in 0.9s    

2021-12-14 09:49:25 (12.1 MB/s) - ‘/tmp/log’ saved [11861712]

Extract storage logs
Extract nginx logs
Requests >1s in storage
swh-storage_1                       | [14/Dec/2021:08:30:31 +0000] POST /origin/visit/add HTTP/1.1 200 143 1980
swh-storage_1                       | [14/Dec/2021:08:30:55 +0000] POST /content/add HTTP/1.1 200 39 22040
swh-storage_1                       | [14/Dec/2021:08:30:58 +0000] POST /directory/add HTTP/1.1 200 18 1358


Requests > 1s in nginx

During the week, only one request took more than 1s.
As it looks rare, it seems it's relative to the load on the server during the build, so I'm not sure it worst the case to investigate further.

The request:

~/wip ❯ ./check_request_durations.sh 1058                                                                                                                                  16:21:09
Getting the logs...
...

Requests > 1s in nginx
nginx_1                             | 172.22.0.1 - - [16/Dec/2021:03:04:00 +0000] "GET /api/1/content/sha1_git:f609a026cebedcf85a76f5367703c924f445523d/ HTTP/1.1" 200 750 "-" "curl/7.74.0" 7.723