Page MenuHomeSoftware Heritage

Fix vault end-to-end check
Closed, MigratedEdits Locked

Description

Something is amiss in that check somehow [1] [2]

It's working from the main archive, taking the directory id from the stacktrace, make that a swhid, lookup for it, trigger the cooking.
Result, it's cooked and the download can happen so it works afaiui.

[1] https://icinga.softwareheritage.org/dashboard#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=Check%20vault%20end-to-end

[2] Reproduced directly from the host triggering the check:

swhworker@pergamon:~$ /usr/bin/swh icinga_plugins check-vault --swh-storage-url http://uffizi.internal.softwareheritage.org:5002 --swh-web-url https://archive.softwareheritage.org directory
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.8.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/cli.py", line 56, in check_vault_directory
    sys.exit(VaultCheck(ctx.obj).main())
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/vault.py", line 61, in main
    assert response.status_code == 200, (response, response.text)
AssertionError: (<Response [404]>, '{"exception":"NotFoundExc","reason":"Cooking of directory \'a62945e49c922e2eeded1cbf4c957396a1e7e321\' was never requested."}')
swhworker@pergamon:~$ /usr/bin/swh icinga_plugins check-vault --swh-storage-url http://uffizi.internal.softwareheritage.org:5002 --swh-web-url https://archive.softwareheritage.org directory
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.8.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/cli.py", line 56, in check_vault_directory
    sys.exit(VaultCheck(ctx.obj).main())
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/vault.py", line 61, in main
    assert response.status_code == 200, (response, response.text)
AssertionError: (<Response [404]>, '{"exception":"NotFoundExc","reason":"Cooking of directory \'436807ce5f6c5e98cf05c85b1f6058db38bf1c8c\' was never requested."}')

Event Timeline

ardumont triaged this task as Normal priority.Nov 13 2020, 12:41 PM
ardumont created this task.

After an upgrade of the packages on pergamon and vangogh, the error is now :

Feb 17 10:49:38 vangogh python3[1990225]: 2021-02-17 10:49:38 [1990225] root:ERROR <RemoteException 500 InvalidDatetimeFormat: ['invalid input syntax for type timestamp with time zone: "Timestamp(seconds=1613558977, nanoseconds=999614000)"\nCONTEXT:  COPY tmp_task, line 1, column next_run: "Timestamp(seconds=1613558977, nanoseconds=999614000)"\n']>
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3/dist-packages/swh/core/api/asynchronous.py", line 71, in middleware_handler
                                              return await handler(request)
                                            File "/usr/lib/python3/dist-packages/swh/core/api/asynchronous.py", line 178, in decorated_meth
                                              result = obj_meth(**kw)
                                            File "/usr/lib/python3/dist-packages/swh/core/db/common.py", line 62, in _meth
                                              return meth(self, *args, db=db, cur=cur, **kwargs)
                                            File "/usr/lib/python3/dist-packages/swh/vault/backend.py", line 220, in cook
                                              self.create_task(obj_type, obj_id, sticky)
                                            File "/usr/lib/python3/dist-packages/swh/core/db/common.py", line 62, in _meth
                                              return meth(self, *args, db=db, cur=cur, **kwargs)
                                            File "/usr/lib/python3/dist-packages/swh/vault/backend.py", line 163, in create_task
                                              task_id = self._send_task(obj_type, hex_id)
                                            File "/usr/lib/python3/dist-packages/swh/vault/backend.py", line 139, in _send_task
                                              added_tasks = self.scheduler.create_tasks([task])
                                            File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_
                                              return self.post(meth._endpoint_path, post_data)
                                            File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post
                                              return self._decode_response(response)
                                            File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 354, in _decode_response
                                              self.raise_for_status(response)
                                            File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 344, in raise_for_status
                                              raise exception from None
                                          swh.core.api.RemoteException: <RemoteException 500 InvalidDatetimeFormat: ['invalid input syntax for type timestamp with time zone: "Timestamp(seconds=1613558977, nanoseconds=999614000)"\nCONTEXT:  COPY tmp_task, line 1, column next_run: "Timestamp(seconds=1613558977, nanoseconds=999614000)"\n']>
vsellier changed the task status from Open to Work in Progress.Feb 17 2021, 11:51 AM
vsellier moved this task from Backlog to in-progress on the System administration board.

It seems the scheduler has missed some updates. After an upgrade of the python3-swh-.* packages, the error is again the initial one.

after digging, it seems the request with a 404 return code are cached by varnish.
When the test is launched, a first request is done which returns a 404 and the post is issued. When the check try to get the status of the cooking, the initial 404 is returned by varnish

For the archive access logs

::1 - - [17/Feb/2021:11:36:28 +0000] "GET /api/1/vault/directory/b029c7a9da9d1e6fe28c852ab1c5329c24ff4bf3/ HTTP/1.1" 404 123 "-" "python-requests/2.21.0" 105
::1 - - [17/Feb/2021:11:36:28 +0000] "POST /api/1/vault/directory/b029c7a9da9d1e6fe28c852ab1c5329c24ff4bf3/ HTTP/1.1" 200 184 "-" "python-requests/2.21.0" 483

The header of the responses:
200 no cache:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 0
Allow: GET, POST, OPTIONS
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 189
Content-Type: application/json
Date: Wed, 17 Feb 2021 11:40:03 GMT
Expires: Wed, 17 Feb 2021 11:40:03 GMT
Server: gunicorn/19.9.0
Strict-Transport-Security: max-age=15768000;
Vary: Accept,Cookie,Accept-Encoding
Via: 1.1 archive.softwareheritage.org
Via: 1.1 varnish (Varnish/6.1)
X-Frame-Options: SAMEORIGIN
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 58
X-RateLimit-Reset: 1613562004
X-Varnish: 33878

With a 404:

HTTP/1.1 404 Not Found
Age: 1
Allow: GET, POST, OPTIONS
Connection: keep-alive
Content-Length: 123
Content-Type: application/json
Date: Wed, 17 Feb 2021 11:39:45 GMT
Server: gunicorn/19.9.0
Strict-Transport-Security: max-age=15768000;
Vary: Accept,Cookie
Via: 1.1 archive.softwareheritage.org
Via: 1.1 varnish (Varnish/6.1)
X-Frame-Options: SAMEORIGIN
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 59
X-RateLimit-Reset: 1613561985
X-Varnish: 33841 33836

With tcpdump, it seems swh-web don't add the headers to don't cache the response in case of a 404:

GET /api/1/vault/directory/a317baff051f68e83557d51e59539dac2ff55b34/ HTTP/1.1
Host: archive.softwareheritage.org
User-Agent: python-requests/2.21.0
Accept: */*
X-Forwarded-For: 128.93.166.14
X-Forwarded-Proto: https
Accept-Encoding: gzip
X-Varnish: 230399

HTTP/1.1 404 Not Found
Date: Wed, 17 Feb 2021 11:43:42 GMT
Server: gunicorn/19.9.0
Content-Type: application/json
Vary: Accept,Cookie
Allow: GET, POST, OPTIONS
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 59
X-RateLimit-Reset: 1613562223
X-Frame-Options: SAMEORIGIN
Content-Length: 123
Via: 1.1 archive.softwareheritage.org

{"exception":"NotFoundExc","reason":"Cooking of directory 'a317baff051f68e83557d51e59539dac2ff55b34' was never requested."}

@anlambert agrees with the previous hypothesis ^ and is working on a fix

Thanks @anlambert, the monitoring comes back to green

vsellier claimed this task.