After an upgrade to version 0.20.8, it seems the errors sent to sentry are simply ignored.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2899 Sentry doesn't react to new errors | ||
Migrated | gitlab-migration | T2910 Sentry: Increase disk space |
Event Timeline
@olasd, if you have some detailed of the version upgrades you have performed yesterday, perhaps it could help to diagnose.
This is a diagram of the ingestion sentry pipeline : https://getsentry.github.io/event-ingestion-graph/
and of the global sentry architecture : https://develop.sentry.dev/architecture/
Current state of the diagnosis :
- all the services are running
root@riverside:/var/lib/sentry-onpremise# docker-compose ps Name Command State Ports ---------------------------------------------------------------------------------------------------------------------- sentry_onpremise_clickhouse_1 /entrypoint.sh Up 8123/tcp, 9000/tcp, 9009/tcp sentry_onpremise_cron_1 /bin/sh -c exec /docker-en ... Up 9000/tcp sentry_onpremise_ingest-consumer_1 /bin/sh -c exec /docker-en ... Up 9000/tcp sentry_onpremise_kafka_1 /etc/confluent/docker/run Up 9092/tcp sentry_onpremise_memcached_1 docker-entrypoint.sh memcached Up 11211/tcp sentry_onpremise_nginx_1 nginx -g daemon off; Up 0.0.0.0:9000->80/tcp sentry_onpremise_post-process-forwarder_1 /bin/sh -c exec /docker-en ... Up 9000/tcp sentry_onpremise_redis_1 docker-entrypoint.sh redis ... Up 6379/tcp sentry_onpremise_relay_1 /bin/bash /docker-entrypoi ... Up 3000/tcp sentry_onpremise_sentry-cleanup_1 /entrypoint.sh 0 0 * * * g ... Up 9000/tcp sentry_onpremise_snuba-api_1 ./docker_entrypoint.sh api Up 1218/tcp sentry_onpremise_snuba-cleanup_1 /entrypoint.sh */5 * * * * ... Up 1218/tcp sentry_onpremise_snuba-consumer_1 ./docker_entrypoint.sh con ... Up 1218/tcp sentry_onpremise_snuba-outcomes-consumer_1 ./docker_entrypoint.sh con ... Up 1218/tcp sentry_onpremise_snuba-replacer_1 ./docker_entrypoint.sh rep ... Up 1218/tcp sentry_onpremise_snuba-sessions-consumer_1 ./docker_entrypoint.sh con ... Up 1218/tcp sentry_onpremise_snuba-transactions-consumer_1 ./docker_entrypoint.sh con ... Up 1218/tcp sentry_onpremise_symbolicator-cleanup_1 /entrypoint.sh 55 23 * * * ... Up 3021/tcp sentry_onpremise_symbolicator_1 /bin/bash /docker-entrypoi ... Up 3021/tcp sentry_onpremise_web_1 /bin/sh -c exec /docker-en ... Up 9000/tcp sentry_onpremise_worker_1 /bin/sh -c exec /docker-en ... Up 9000/tcp sentry_onpremise_zookeeper_1 /etc/confluent/docker/run Up 2181/tcp, 2888/tcp, 3888/tcp
- there are no suspicious logs on the the containers
- The inbound filtering was disabled for webapp and storage but with no effects : https://docs.sentry.io/product/error-monitoring/filtering/
- the events are correctly posted and acknowledged :
nginx_1 | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5" nginx_1 | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5" nginx_1 | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
- the kafka topic looks correct, and there is some consumer listening :
root@riverside:/var/lib/sentry-onpremise# docker-compose exec kafka bash root@731b50428583:/# kafka-consumer-groups --bootstrap-server localhost:9092 --describe --all-groups GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID ingest-consumer ingest-events 0 418886 418890 4 rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8 rdkafka ingest-consumer ingest-transactions 0 - 0 - rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8 rdkafka ingest-consumer ingest-attachments 0 - 0 - rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8 rdkafka GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID snuba-consumers events 0 35448 35448 0 rdkafka-ee6b70fa-4d66-4e9d-b67a-585f84e2933f /172.25.0.18 rdkafka snuba-consumers ingest-sessions 0 - 0 - rdkafka-7dd7628a-652b-4dbb-a0c3-01b64acef88f /172.25.0.21 rdkafka snuba-consumers outcomes 0 49245 49245 0 rdkafka-eb77848b-ef87-4815-9ea2-addfb80bef18 /172.25.0.17 rdkafka GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID snuba-post-processor events 0 35448 35448 0 rdkafka-fed535ec-97fa-4603-9ba9-0e52aaa71d3a /172.25.0.6 rdkafka GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID snuba-post-processor:sync:0fef0994406d11eb839b0242ac190006 snuba-commit-log 0 - 5925 - rdkafka-c83cff42-56d7-47f1-b381-f8d8a5ebbf89 /172.25.0.6 rdkafka GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID snuba-replacers event-replacements 0 - 0 - rdkafka-9ad0b174-45c0-427d-a0e8-319c4f0e7a39 /172.25.0.23 rdkafka GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID transactions_group events 0 35448 35448 0 rdkafka-2c1e749b-d53f-484b-b634-86ce07423ab4 /172.25.0.13 rdkafka
- observation : the offset is moving on the ingest-events topic and there is not lag of the consumer. It seems there are no new message on the events topic itself as the offset didn't moved since this morning
- the ingest consumer seems to ignore the events but the events are still stored in redis
ingest-consumer_1 | 14:12:37 [INFO] batching-kafka-consumer: Worker flush took 34ms ingest-consumer_1 | 14:12:39 [INFO] batching-kafka-consumer: Flushing 18 items (from {(u'ingest-events', 0): [421646L, 421663L]}): forced:False size:False time:True ingest-consumer_1 | 14:12:39 [INFO] batching-kafka-consumer: Worker flush took 141ms ingest-consumer_1 | 14:12:40 [INFO] batching-kafka-consumer: Flushing 16 items (from {(u'ingest-events', 0): [421664L, 421679L]}): forced:False size:False time:True ingest-consumer_1 | 14:12:40 [INFO] batching-kafka-consumer: Worker flush took 126ms
At the moment the investigation are focused on the ingest-consumer as it seems the event are lost somewhere around this step
we have followed the event track on the consumer code without finding anything suspicious.
As a last try, we have fully rebooted the vm, but as expected, it changed nothing at all.
Switching to another subject for today
To eliminate another possible root cause, a test was done in a temporary project with the last version of the python library, it doesn't work either
The small python code to perform the test :
import sentry_sdk sentry_sdk.init("https://xxxxxx@sentry.softwareheritage.org/18") division_by_zero = 1 / 0
The message is well sent, but not ingested:
nginx_1 | 192.168.100.29 - - [18/Dec/2020:09:48:55 +0000] "POST /api/18/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.16.2"
I guess the next step is to try an upgrade to a more recent version of sentry.
@olasd I can handle that on a clone of the original vm to try on a constrained environment
Thanks for the thorough investigation so far!
I don't think making a clone of the current vm is going to be very fruitful, the sentry database is external to it so it's going to be annoying to get a clone of that going too and isolate the clone from the prod instance.
I thought I had checked that new events were ingesting fine (and showing up in the web interface) when I did the upgrades, but looking at the date of the last events I don't think that's the case (damn).
Looks like the events are/were getting stuck at the celery stage.
Unfortunately the default celery worker logs in the onpremise docker compose are not enough to notice this.
Applying the following diff
diff --git a/docker-compose.yml b/docker-compose.yml index 3a80383..c7d81e3 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -147,7 +147,7 @@ services: command: run cron worker: << : *sentry_defaults - command: run worker + command: run worker --loglevel INFO ingest-consumer: << : *sentry_defaults command: run ingest-consumer --all-consumer-types
We can notice that the worker gets stuck after a while.
I've bumped parallelism and increased max_tasks_per_child:
diff --git a/docker-compose.yml b/docker-compose.yml index 3a80383..c7d81e3 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -147,7 +147,7 @@ services: command: run cron worker: << : *sentry_defaults - command: run worker + command: run worker --loglevel INFO -c 10 --max-tasks-per-child 1000000000 ingest-consumer: << : *sentry_defaults command: run ingest-consumer --all-consumer-types
And the worker could clear the backlog.
This issue was indeed noticed upstream for 20.8.0 (https://github.com/getsentry/sentry/commit/1c81aa88ec63e5bfb1c458d053f52af2eda6cc06) so upgrading to 20.9.0 would probably have fixed it. I'll continue the upgrades up to 20.12.0 now.
To look at the state of the celery queues, from https://docs.celeryproject.org/en/stable/userguide/monitoring.html#monitoring-redis-queues:
root@riverside:/var/lib/sentry-onpremise# docker-compose run redis redis-cli -h redis # Get list of queues and some garbage. All sentry redis queues have a dot in them. Empty queues do not appear in this list! redis:6379> keys *.* 1) "_kombu.binding.celery.pidbox" 2) "_kombu.binding.default" 3) "b:k:sentry.group:8e6ab22a0499e9dce076eb1042105058" 4) "b:k:sentry.group:051240e922870ea5edea4c2e12fc20fd" 5) "_kombu.binding.triggers" 6) "b:k:sentry.group:b45f83e14f889324a0e09cb8d1aeaeaf" 7) "_kombu.binding.counters" 8) "events.store_event" # Get length of the given queue; should be 0 or close to 0... redis:6379> llen events.store_event (integer) 123456
After pushing through the updates up to 20.12.1, it seems the events are being processed correctly. I'm somewhat confident that the updated celery in the sentry image will not exhibit the same processing bug, but I'll keep an eye on the logs for a bit...
I confirm that sentry is still happily processing events, more than 12 hours after the last upgrade :)