Page MenuHomeSoftware Heritage

Investigate why logging is inconsistent
Open, NormalPublic

Description

Logging investigation

Storage service crashed on connection reset with objstorage service.
https://forge.softwareheritage.org/P837
Thought it was an objstorage crash with no objstorage logs, but after investigation it seems not the case (all but HTTP 500 errors are not logged).

Observation: objstorage server only logs internal server errors, whereas storage also logs access.
How is it so?

TL;DR:
The aiohttp-based services use an aiohttp Gunicorn worker which implements a different feature set than the builtin Gunicorn workers used by the other servers.
The same configuration yield different logging behavior out-of-the box depending on the workers used.
In order to have consistent logging accross WSGI services, a fix is proposed as a common configuration file that would lead to uniforn logging.

Library server implementations

Aiohttp

https://docs.aiohttp.org/en/stable/logging.html
https://docs.aiohttp.org/en/stable/web_quickstart.html#aiohttp-web-exceptions
https://docs.aiohttp.org/en/stable/web_advanced.html#middlewares
"If the exception is not an instance of HTTPException it is converted to
500 HTTPInternalServerError after processing the middlewares chain."

Access and error logs active by default, given that logging is configured (not default).
All logs defaults >=DEBUG and to stderr if no handlers.
Access logs are actually out as INFO, whereas errors as ERROR.

Flask

https://flask.palletsprojects.com/en/1.1.x/logging/

Main logger app.logger. Access logger werkzeug.
Flask default adds a StreamHandler to app.logger, if logging not configured.
Werkzeug default adds a StreamHandler to logging.root, if logging not configured.
Python stream handlers log to stderr by default.

Gunicorn

https://docs.aiohttp.org/en/stable/deployment.html#aiohttp-deployment-gunicorn
https://docs.gunicorn.org/en/stable/settings.html#accesslog

Defaults: accesslog: None, errorlog: -
Gunicorn main logging options: --error-logfile --access-logfile --loglevel --log-config --capture-output --log-syslog
Example logging config file https://github.com/benoitc/gunicorn/blob/master/examples/logging.conf

SWH server implementations

Find everything related to logs:
git grep -iP "[\s_.]log(?:ger|ging)?[\s_.]"

core.api.asynchronous

Base: aiohttp
Logger: no logging config or getting
Both error and access logs enabled.
Error handler:
bubbles up aiohttp.web.HTTPException errors (not only 4XX) but log and send 5XX errors.
HTTP errors not used beyond the defaults of the framework (403, 404, 405, 413, 417)
grep -n "raise HTTP" ~/.virtualenvs/swh/lib/python3.7/site-packages/aiohttp*/**
-> remove special aiohttp.web.HTTPException handling?

core.api.__init__

Base: Flask
logger = logging.getLogger(__name__)
Error handler:
handles all errors by logging and sending

objstorage

Base: core.api.asynchronous
No special logging or error handler.
Absolutely no relevant logging code.

storage

Base: core.api.__init__
2 error_handlers: StorageArgumentException = 400, other exceptions = 500
Logging handler is added, seems useless because the same as Flask default.

swh/storage/api/server.py:    handler = logging.StreamHandler(); app.logger.addHandler(handler) // useless?
bin/swh-storage-add-dir:    logging.basicConfig(level=logging.INFO)
swh/storage/cli.py:        logging.getLogger("werkzeug").setLevel(ctx.obj["log_level"])
swh/storage/cli.py:    logging.basicConfig()

Loggers mostly used for warnings (2 occurences) or replay/backfill (irrelevant).
Where does the displayed API access logs DEBUG come from?

vault

Base: core.api.asynchronous
No special logging or error handler.

swh/vault/__init__.py:logger = logging.getLogger(__name__)
swh/vault/__init__.py:    logger.debug("Instantiating
swh/vault/cli.py:    setup_log_handler(loglevel=ctx.obj.get("log_level", logging.INFO),
swh/vault/cookers/base.py:            logging.exception("Bundle cooking failed.")

Testing objstorage logging

Setup

Have a valid FS tree for objstorage:
sudo mkdir -p /srv/softwareheritage/objects && sudo chown -R me:me /srv

Devel server:
swh objstorage -C conf/objstorage.yml rpc-serve

Production server:

SWH_CONFIG_FILENAME=conf/objstorage.yml gunicorn --bind 0.0.0.0:5003
    --worker-class aiohttp.worker.GunicornWebWorker --log-level DEBUG
    'swh.objstorage.api.server:make_app_from_configfile()'

Added prints in core.api.asynchronous:middleware_handler except block.

Base request:
curl -s -X POST http://127.1:5003/check_config -H "Content-Type: application/json" -d '{"check_write": false}'

Base logs:

ERROR:root:...
Traceback (most recent call last):
...
TypeError: check_config() got an unexpected keyword argument 'check_writ'
INFO:aiohttp.access:127.0.0.1 [29/Oct/2020:16:22:21 +0000] "POST /check_config HTTP/1.1" 500 897 "-" "curl/7.64.0"

Objstorage devel server log *all* access like above access log, log *some* errors like above error log.
Server: always above access log, sometimes above logging.root error log

Tests

  • POST, /check_config, json, {"check_write": false}

no error log, 200, server no traceback, client "."

  • POST, /check_confi, json, {"check_writ": false}

no error log, 404, server no traceback, client "404: Not Found"

  • GET, /check_config

no error log, 405, server no traceback, client "405: Method Not Allowed"

  • POST, /check_config, json, {"check_writ": false}

error log, 500, server+client traceback "TypeError: .*unexpected keyword argument"

  • POST, /check_config, jso, {"check_write": false}

error log, 500, server+client traceback "ValueError: Wrong content type"

Error 500 are logged and sent to client. They come not as aiohttp exceptions but as any other Python exception.
Error 4XX are not logged but sent to client. They are the only aiohttp.web.HTTPExceptions raised.
Other statuses are not logged.

Testing storage logging

Added marker "TEST" in gunicorn.worker.pre_request log output to verify this is the log source.
-> Which it was.

`SWH_CONFIG_FILENAME=conf/storage.yml gunicorn --bind 0.0.0.0:5002 --log-level DEBUG
--error-logfile None 'swh.storage.api.server:make_app_from_configfile()'`

-> No more logging, pre_request` logs go to stderr along with errors, as it uses the same logger.

Conclusion

Gunicorn (default) workers call pre_request at each handle_request. This does log through errorlog.
aiohttp GunicornWebWorker inherits Gunicorn BaseWorker but does not implement some features like Gunicorn server hooks like pre_request

Our configs disable access logs but Gunicorn workers do "accesslog" through errorlog...
To have consistant logging accross SWH API servers, we could disable the pre_request hook through the Gunicorn config file, and enable accesslog.
Should we remove aiohttp error special case in core.api.async error handler?

Proposed config file would be similar to:

pre_request = lambda *args: None
accesslog = "-"

Do we want to keep special handling of aiohttp.web.HTTPExceptions in core.api.asynchronous?

Annex: Docker configurations findings

Searched through Docker configurations for logging.

  • docker-compose files: no logging
  • SWH service configs (conf/*.yml): no logging
  • env/common_python.env: no logging
  • services/swh-{service}/entrypoint.sh:
    • All same gunicorn config: "accesslog: None, errorlog: -, loglevel: DEBUG" which is default except for loglevel
    • swh.core/web.gunicorn_config defines only sentry config, including Flask (core) Django (web) integration.
    • --worker-class used to specify async (aiohttp) instead of sync, in objstorage
    • Inconsistency in options passing: --worker-class, --reload, --log-level
    • All use Gunicorn except Vault which uses aiohttp devel server!
    • Deposit is Django != architecture schema, uses swh core gunicorn_config which does not handle Django integration.

Related Objects

Event Timeline

tenma triaged this task as Normal priority.Oct 29 2020, 6:19 PM
tenma created this task.
tenma updated the task description. (Show Details)Oct 30 2020, 2:46 PM
tenma updated the task description. (Show Details)
tenma updated the task description. (Show Details)Tue, Nov 3, 10:23 AM
tenma updated the task description. (Show Details)Tue, Nov 3, 10:30 AM
tenma updated the task description. (Show Details)Tue, Nov 3, 10:45 AM