Page MenuHomeSoftware Heritage

Sentry doesn't react to new errors
Closed, ResolvedPublic

Description

After an upgrade to version 0.20.8, it seems the errors sent to sentry are simply ignored.

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 17 2020, 2:59 PM
vsellier triaged this task as High priority.
vsellier created this task.

@olasd, if you have some detailed of the version upgrades you have performed yesterday, perhaps it could help to diagnose.

This is a diagram of the ingestion sentry pipeline : https://getsentry.github.io/event-ingestion-graph/
and of the global sentry architecture : https://develop.sentry.dev/architecture/

Current state of the diagnosis :

  • all the services are running
root@riverside:/var/lib/sentry-onpremise# docker-compose ps
                     Name                                   Command               State              Ports            
----------------------------------------------------------------------------------------------------------------------
sentry_onpremise_clickhouse_1                    /entrypoint.sh                   Up      8123/tcp, 9000/tcp, 9009/tcp
sentry_onpremise_cron_1                          /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_ingest-consumer_1               /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_kafka_1                         /etc/confluent/docker/run        Up      9092/tcp                    
sentry_onpremise_memcached_1                     docker-entrypoint.sh memcached   Up      11211/tcp                   
sentry_onpremise_nginx_1                         nginx -g daemon off;             Up      0.0.0.0:9000->80/tcp        
sentry_onpremise_post-process-forwarder_1        /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_redis_1                         docker-entrypoint.sh redis ...   Up      6379/tcp                    
sentry_onpremise_relay_1                         /bin/bash /docker-entrypoi ...   Up      3000/tcp                    
sentry_onpremise_sentry-cleanup_1                /entrypoint.sh 0 0 * * * g ...   Up      9000/tcp                    
sentry_onpremise_snuba-api_1                     ./docker_entrypoint.sh api       Up      1218/tcp                    
sentry_onpremise_snuba-cleanup_1                 /entrypoint.sh */5 * * * * ...   Up      1218/tcp                    
sentry_onpremise_snuba-consumer_1                ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-outcomes-consumer_1       ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-replacer_1                ./docker_entrypoint.sh rep ...   Up      1218/tcp                    
sentry_onpremise_snuba-sessions-consumer_1       ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_snuba-transactions-consumer_1   ./docker_entrypoint.sh con ...   Up      1218/tcp                    
sentry_onpremise_symbolicator-cleanup_1          /entrypoint.sh 55 23 * * * ...   Up      3021/tcp                    
sentry_onpremise_symbolicator_1                  /bin/bash /docker-entrypoi ...   Up      3021/tcp                    
sentry_onpremise_web_1                           /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_worker_1                        /bin/sh -c exec /docker-en ...   Up      9000/tcp                    
sentry_onpremise_zookeeper_1                     /etc/confluent/docker/run        Up      2181/tcp, 2888/tcp, 3888/tcp
nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
nginx_1                        | 192.168.100.29 - - [17/Dec/2020:14:03:51 +0000] "POST /api/13/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.13.5"
  • the kafka topic looks correct, and there is some consumer listening :
root@riverside:/var/lib/sentry-onpremise# docker-compose exec kafka bash
root@731b50428583:/# kafka-consumer-groups --bootstrap-server localhost:9092 --describe --all-groups

GROUP           TOPIC               PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
ingest-consumer ingest-events       0          418886          418890          4               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka
ingest-consumer ingest-transactions 0          -               0               -               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka
ingest-consumer ingest-attachments  0          -               0               -               rdkafka-bf8df2f1-a7d5-4230-b95b-e183618fc89a /172.25.0.8     rdkafka

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-consumers events          0          35448           35448           0               rdkafka-ee6b70fa-4d66-4e9d-b67a-585f84e2933f /172.25.0.18    rdkafka
snuba-consumers ingest-sessions 0          -               0               -               rdkafka-7dd7628a-652b-4dbb-a0c3-01b64acef88f /172.25.0.21    rdkafka
snuba-consumers outcomes        0          49245           49245           0               rdkafka-eb77848b-ef87-4815-9ea2-addfb80bef18 /172.25.0.17    rdkafka

GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor events          0          35448           35448           0               rdkafka-fed535ec-97fa-4603-9ba9-0e52aaa71d3a /172.25.0.6     rdkafka

GROUP                                                      TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor:sync:0fef0994406d11eb839b0242ac190006 snuba-commit-log 0          -               5925            -               rdkafka-c83cff42-56d7-47f1-b381-f8d8a5ebbf89 /172.25.0.6     rdkafka

GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-replacers event-replacements 0          -               0               -               rdkafka-9ad0b174-45c0-427d-a0e8-319c4f0e7a39 /172.25.0.23    rdkafka

GROUP              TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
transactions_group events          0          35448           35448           0               rdkafka-2c1e749b-d53f-484b-b634-86ce07423ab4 /172.25.0.13    rdkafka
  • observation : the offset is moving on the ingest-events topic and there is not lag of the consumer. It seems there are no new message on the events topic itself as the offset didn't moved since this morning
  • the ingest consumer seems to ignore the events but the events are still stored in redis
ingest-consumer_1              | 14:12:37 [INFO] batching-kafka-consumer: Worker flush took 34ms
ingest-consumer_1              | 14:12:39 [INFO] batching-kafka-consumer: Flushing 18 items (from {(u'ingest-events', 0): [421646L, 421663L]}): forced:False size:False time:True
ingest-consumer_1              | 14:12:39 [INFO] batching-kafka-consumer: Worker flush took 141ms
ingest-consumer_1              | 14:12:40 [INFO] batching-kafka-consumer: Flushing 16 items (from {(u'ingest-events', 0): [421664L, 421679L]}): forced:False size:False time:True
ingest-consumer_1              | 14:12:40 [INFO] batching-kafka-consumer: Worker flush took 126ms

At the moment the investigation are focused on the ingest-consumer as it seems the event are lost somewhere around this step

we have followed the event track on the consumer code without finding anything suspicious.
As a last try, we have fully rebooted the vm, but as expected, it changed nothing at all.

Switching to another subject for today

To eliminate another possible root cause, a test was done in a temporary project with the last version of the python library, it doesn't work either

The small python code to perform the test :

import sentry_sdk
sentry_sdk.init("https://xxxxxx@sentry.softwareheritage.org/18")

division_by_zero = 1 / 0

The message is well sent, but not ingested:

nginx_1                        | 192.168.100.29 - - [18/Dec/2020:09:48:55 +0000] "POST /api/18/store/ HTTP/1.1" 200 41 "-" "sentry.python/0.16.2"

I guess the next step is to try an upgrade to a more recent version of sentry.
@olasd I can handle that on a clone of the original vm to try on a constrained environment

Thanks for the thorough investigation so far!

I don't think making a clone of the current vm is going to be very fruitful, the sentry database is external to it so it's going to be annoying to get a clone of that going too and isolate the clone from the prod instance.

I thought I had checked that new events were ingesting fine (and showing up in the web interface) when I did the upgrades, but looking at the date of the last events I don't think that's the case (damn).

Looks like the events are/were getting stuck at the celery stage.

Unfortunately the default celery worker logs in the onpremise docker compose are not enough to notice this.

Applying the following diff

diff --git a/docker-compose.yml b/docker-compose.yml
index 3a80383..c7d81e3 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -147,7 +147,7 @@ services:
     command: run cron
   worker:
     << : *sentry_defaults
-    command: run worker
+    command: run worker --loglevel INFO
   ingest-consumer:
     << : *sentry_defaults
     command: run ingest-consumer --all-consumer-types

We can notice that the worker gets stuck after a while.

I've bumped parallelism and increased max_tasks_per_child:

diff --git a/docker-compose.yml b/docker-compose.yml
index 3a80383..c7d81e3 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -147,7 +147,7 @@ services:
     command: run cron
   worker:
     << : *sentry_defaults
-    command: run worker
+    command: run worker --loglevel INFO -c 10 --max-tasks-per-child 1000000000
   ingest-consumer:
     << : *sentry_defaults
     command: run ingest-consumer --all-consumer-types

And the worker could clear the backlog.

This issue was indeed noticed upstream for 20.8.0 (https://github.com/getsentry/sentry/commit/1c81aa88ec63e5bfb1c458d053f52af2eda6cc06) so upgrading to 20.9.0 would probably have fixed it. I'll continue the upgrades up to 20.12.0 now.

To look at the state of the celery queues, from https://docs.celeryproject.org/en/stable/userguide/monitoring.html#monitoring-redis-queues:

root@riverside:/var/lib/sentry-onpremise# docker-compose run redis redis-cli -h redis
# Get list of queues and some garbage. All sentry redis queues have a dot in them. Empty queues do not appear in this list!
redis:6379> keys *.*
1) "_kombu.binding.celery.pidbox"
2) "_kombu.binding.default"
3) "b:k:sentry.group:8e6ab22a0499e9dce076eb1042105058"
4) "b:k:sentry.group:051240e922870ea5edea4c2e12fc20fd"
5) "_kombu.binding.triggers"
6) "b:k:sentry.group:b45f83e14f889324a0e09cb8d1aeaeaf"
7) "_kombu.binding.counters"
8) "events.store_event"

# Get length of the given queue; should be 0 or close to 0...
redis:6379> llen events.store_event
(integer) 123456
olasd claimed this task.

After pushing through the updates up to 20.12.1, it seems the events are being processed correctly. I'm somewhat confident that the updated celery in the sentry image will not exhibit the same processing bug, but I'll keep an eye on the logs for a bit...

I confirm that sentry is still happily processing events, more than 12 hours after the last upgrade :)