Sentry doesn't react to new errors
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Dec 17 2020, 2:59 PM

Description

After an upgrade to version 0.20.8, it seems the errors sent to sentry are simply ignored.

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T2899 Sentry doesn't react to new errors
		Migrated	gitlab-migration	T2910 Sentry: Increase disk space

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 17 2020, 2:59 PM

vsellier triaged this task as High priority.

vsellier created this task.

@olasd, if you have some detailed of the version upgrades you have performed yesterday, perhaps it could help to diagnose.

This is a diagram of the ingestion sentry pipeline : https://getsentry.github.io/event-ingestion-graph/
and of the global sentry architecture : https://develop.sentry.dev/architecture/

Current state of the diagnosis :

all the services are running

root@riverside:/var/lib/sentry-onpremise# docker-compose ps
                     Name                                   Command               State              Ports            
----------------------------------------------------------------------------------------------------------------------
sentry_onpremise_clickhouse_1                    /entrypoint.sh                   Up      8123/tcp, 9000/tcp, 9009/tcp
sentry_onpremise_cron_1                          /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_ingest-consumer_1               /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_kafka_1                         /etc/confluent/docker/run        Up      9092/tcp                    
sentry_onpremise_memcached_1                     docker-entrypoint.sh memcached   Up      11211/tcp                   
sentry_onpremise_nginx_1                         nginx -g daemon off;             Up      0.0.0.0:9000->80/tcp        
sentry_onpremise_post-process-forwarder_1        /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_redis_1                         docker-entrypoint.sh redis ...   Up      6379/tcp                    
sentry_onpremise_relay_1                         /bin/bash /docker-entrypoi ...   Up      3000/tcp                    
sentry_onpremise_sentry-cleanup_1                /entrypoint.sh 0 0 * * * g ...   Up      9000/tcp                    
sentry_onpremise_snuba-api_1                     ./docker_entrypoint.sh api       Up      1218/tcp                    
sentry_onpremise_snuba-cleanup_1                 /entrypoint.sh */5 * * * * ...   Up      1218/tcp                    
sentry_onpremise_snuba-consumer_1                ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-outcomes-consumer_1       ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-replacer_1                ./docker_entrypoint.sh rep ...   Up      1218/tcp                    
sentry_onpremise_snuba-sessions-consumer_1       ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-transactions-consumer_1   ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_symbolicator-cleanup_1          /entrypoint.sh 55 23 * * * ...   Up      3021/tcp                    
sentry_onpremise_symbolicator_1                  /bin/bash /docker-entrypoi ...   Up      3021/tcp                    
sentry_onpremise_web_1                           /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_worker_1                        /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_zookeeper_1                     /etc/confluent/docker/run        Up      2181/tcp, 2888/tcp, 3888/tcp

there are no suspicious logs on the the containers
The inbound filtering was disabled for webapp and storage but with no effects : https://docs.sentry.io/product/error-monitoring/filtering/
the events are correctly posted and acknowledged :

nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"

the kafka topic looks correct, and there is some consumer listening :

root@riverside:/var/lib/sentry-onpremise# docker-compose exec kafka bash
root@731b50428583:/# kafka-consumer-groups --bootstrap-server localhost:9092 --describe --all-groups

GROUP           TOPIC               PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
ingest-consumer ingest-events       0          418886          418890          4               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka
ingest-consumer ingest-transactions 0          -               0               -               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka
ingest-consumer ingest-attachments  0          -               0               -               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-consumers events          0          35448           35448           0               rdkafka-ee6b70fa-4d66-4e9d-b67a-585f84e2933f /172.25.0.18    rdkafka
snuba-consumers ingest-sessions 0          -               0               -               rdkafka-7dd7628a-652b-4dbb-a0c3-01b64acef88f /172.25.0.21    rdkafka
snuba-consumers outcomes        0          49245           49245           0               rdkafka-eb77848b-ef87-4815-9ea2-addfb80bef18 /172.25.0.17    rdkafka

GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor events          0          35448           35448           0               rdkafka-fed535ec-97fa-4603-9ba9-0e52aaa71d3a /172.25.0.6     rdkafka

GROUP                                                      TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor:sync:0fef0994406d11eb839b0242ac190006 snuba-commit-log 0          -               5925            -               rdkafka-c83cff42-56d7-47f1-b381-f8d8a5ebbf89 /172.25.0.6     rdkafka

GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-replacers event-replacements 0          -               0               -               rdkafka-9ad0b174-45c0-427d-a0e8-319c4f0e7a39 /172.25.0.23    rdkafka

GROUP              TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
transactions_group events          0          35448           35448           0               rdkafka-2c1e749b-d53f-484b-b634-86ce07423ab4 /172.25.0.13    rdkafka

observation : the offset is moving on the ingest-events topic and there is not lag of the consumer. It seems there are no new message on the events topic itself as the offset didn't moved since this morning
the ingest consumer seems to ignore the events but the events are still stored in redis

ingest-consumer_1              | 14:12:37 [INFO] batching-kafka-consumer: Worker flush took 34ms
ingest-consumer_1              | 14:12:39 [INFO] batching-kafka-consumer: Flushing 18 items (from {(u'ingest-events', 0): [421646L, 421663L]}): forced:False size:False time:True
ingest-consumer_1              | 14:12:39 [INFO] batching-kafka-consumer: Worker flush took 141ms
ingest-consumer_1              | 14:12:40 [INFO] batching-kafka-consumer: Flushing 16 items (from {(u'ingest-events', 0): [421664L, 421679L]}): forced:False size:False time:True
ingest-consumer_1              | 14:12:40 [INFO] batching-kafka-consumer: Worker flush took 126ms

At the moment the investigation are focused on the ingest-consumer as it seems the event are lost somewhere around this step

ardumont mentioned this in T2898: Sentry: Increase disk space.Dec 17 2020, 3:49 PM

we have followed the event track on the consumer code without finding anything suspicious.
As a last try, we have fully rebooted the vm, but as expected, it changed nothing at all.

Switching to another subject for today

To eliminate another possible root cause, a test was done in a temporary project with the last version of the python library, it doesn't work either

The small python code to perform the test :

import sentry_sdk
sentry_sdk.init("https://xxxxxx@sentry.softwareheritage.org/18")

division_by_zero = 1 / 0

The message is well sent, but not ingested:

nginx_1                        | 192.168.100.29 - - [18/Dec/2020:09:48:55 +0000] "POST /api/18/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.16.2"

I guess the next step is to try an upgrade to a more recent version of sentry.
@olasd I can handle that on a clone of the original vm to try on a constrained environment

Thanks for the thorough investigation so far!

I don't think making a clone of the current vm is going to be very fruitful, the sentry database is external to it so it's going to be annoying to get a clone of that going too and isolate the clone from the prod instance.

I thought I had checked that new events were ingesting fine (and showing up in the web interface) when I did the upgrades, but looking at the date of the last events I don't think that's the case (damn).

Looks like the events are/were getting stuck at the celery stage.

Unfortunately the default celery worker logs in the onpremise docker compose are not enough to notice this.

Applying the following diff

diff --git a/docker-compose.yml b/docker-compose.yml
index 3a80383..c7d81e3 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -147,7 +147,7 @@ services:
     command: run cron
   worker:
     << : *sentry_defaults
-    command: run worker
+    command: run worker --loglevel INFO
   ingest-consumer:
     << : *sentry_defaults
     command: run ingest-consumer --all-consumer-types

We can notice that the worker gets stuck after a while.

I've bumped parallelism and increased max_tasks_per_child:

diff --git a/docker-compose.yml b/docker-compose.yml
index 3a80383..c7d81e3 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -147,7 +147,7 @@ services:
     command: run cron
   worker:
     << : *sentry_defaults
-    command: run worker
+    command: run worker --loglevel INFO -c 10 --max-tasks-per-child 1000000000
   ingest-consumer:
     << : *sentry_defaults
     command: run ingest-consumer --all-consumer-types

And the worker could clear the backlog.

This issue was indeed noticed upstream for 20.8.0 (https://github.com/getsentry/sentry/commit/1c81aa88ec63e5bfb1c458d053f52af2eda6cc06) so upgrading to 20.9.0 would probably have fixed it. I'll continue the upgrades up to 20.12.0 now.

To look at the state of the celery queues, from https://docs.celeryproject.org/en/stable/userguide/monitoring.html#monitoring-redis-queues:

root@riverside:/var/lib/sentry-onpremise# docker-compose run redis redis-cli -h redis
# Get list of queues and some garbage. All sentry redis queues have a dot in them. Empty queues do not appear in this list!
redis:6379> keys *.*
1) "_kombu.binding.celery.pidbox"
2) "_kombu.binding.default"
3) "b:k:sentry.group:8e6ab22a0499e9dce076eb1042105058"
4) "b:k:sentry.group:051240e922870ea5edea4c2e12fc20fd"
5) "_kombu.binding.triggers"
6) "b:k:sentry.group:b45f83e14f889324a0e09cb8d1aeaeaf"
7) "_kombu.binding.counters"
8) "events.store_event"

# Get length of the given queue; should be 0 or close to 0...
redis:6379> llen events.store_event
(integer) 123456

After pushing through the updates up to 20.12.1, it seems the events are being processed correctly. I'm somewhat confident that the updated celery in the sentry image will not exhibit the same processing bug, but I'll keep an eye on the logs for a bit...

\o/

Thanks.

I confirm that sentry is still happily processing events, more than 12 hours after the last upgrade :)

vsellier created subtask T2910: Sentry: Increase disk space.Dec 21 2020, 6:58 PM

vsellier closed subtask T2910: Sentry: Increase disk space as Resolved.Dec 21 2020, 7:05 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T2910: Sentry: Increase disk space from Resolved to Migrated.Oct 19 2022, 6:00 PM

Sentry doesn't react to new errorsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Sentry doesn't react to new errors
Closed, MigratedEdits Locked
Actions

Related Objects
Search...