- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 8 2023
Oct 19 2022
Oct 4 2022
Closing as there is no alerts since almost one month
Sep 15 2022
Sep 6 2022
The root cause is a swh-graph experiment that generated a lot of grpc errors which are huge.
No consumers seem to have a big lag on these topics, so it should be possible to reduce the lag to unblock the server and have a look which service is sending the events:
root@riverside:/var/lib/sentry-onpremise# docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list | tr -d '\r' | xargs -t -n1 docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group | grep -e GROUP -e " events " Creating sentry-self-hosted_kafka_run ... done docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-consumers Creating sentry-self-hosted_kafka_run ... done GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID snuba-consumers events 0 82585390 82587094 1704 - - - docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-post-processor:sync:6fa9928e1d6911edac290242ac170014 Creating sentry-self-hosted_kafka_run ... done GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group ingest-consumer Creating sentry-self-hosted_kafka_run ... done
The biggest topics are:
root@riverside:/var/lib/docker/volumes/sentry-kafka/_data# du -sch * | sort -h | tail -n 5 31M snuba-commit-log-0 291M outcomes-0 30G ingest-events-0 43G events-0 73G total
Aug 24 2022
Aug 17 2022
Aug 9 2022
Feb 9 2022
It seems everything is already ok (another cooking task issue reported [1]) in the end so closing this.
I don't know how to trigger an error in the vault, currently; you'd have to change the code to manually do that :|
Also that's a cooker worker issue from sentry from 18h ago or so (as of the moment of this comment).
I'll let you trigger some cooking and reports here your finding.
Something is bothering me, isn't there some catchall exceptions happening in the vault source code somewhere already?
Feb 8 2022
I wonder if replacing @worker_init.connect with @worker_process_init.connect at https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/celery_backend/config.py$157 would work.
so I'm guessing Celery is eating logs somehow, so Sentry doesn't see them
So i don't currently know what's wrong (if anything is).
So i was wrong, it is correctly set [1].
And there are sentry issues about workers [2].
Yes, i confirm (from #swh-sysadm discussion)
Looking at puppet configuration, my guess is that the sentry_dsn is not set for the vault cookers.
Oct 15 2021
Sep 29 2021
In the mean time, logs can be reached in the dedicated dashboard
Sep 17 2021
Sep 8 2021
Sep 3 2021
Aug 5 2021
uh, indeed
Does it?
Jul 29 2021
Feb 18 2021
Feb 11 2021
Well my concern was about having different versions running at the same time, but Sentry is able to detect the version so that's not it.
I'm not sure to understand the real problem here.
As the indexer and indexer-storage are in same source repository, the versions should match or increase in //. Sentry should be able to deal with it as any other version upgrade.
Feb 8 2021
Feb 2 2021
Jan 6 2021
Dec 21 2020
- before :
root@riverside:~# pvscan PV /dev/sda1 VG riverside-vg lvm2 [<63.98 GiB / 0 free] Total: 1 [<63.98 GiB] / in use: 1 [<63.98 GiB] / in no VG: 0 [0 ] root@riverside:~# df -h / Filesystem Size Used Avail Use% Mounted on /dev/mapper/riverside--vg-root 60G 56G 1.4G 98% /
(2% some cleanup seems to have occur since the creation of the task :) )
- disk extended on proxmox by 16Go on proxmox
(extract of dmesg of riverside) [350521.461023] sd 2:0:0:0: Capacity data has changed [350521.461339] sd 2:0:0:0: [sda] 167772160 512-byte logical blocks: (85.9 GB/80.0 GiB) [350521.461484] sda: detected capacity change from 68719476736 to 85899345920
- partition resized :
root@riverside:~# parted /dev/sda GNU Parted 3.2 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print free Model: QEMU QEMU HARDDISK (scsi) Disk /dev/sda: 85.9GB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags:
Dec 19 2020
I confirm that sentry is still happily processing events, more than 12 hours after the last upgrade :)
Dec 18 2020
\o/
After pushing through the updates up to 20.12.1, it seems the events are being processed correctly. I'm somewhat confident that the updated celery in the sentry image will not exhibit the same processing bug, but I'll keep an eye on the logs for a bit...
To look at the state of the celery queues, from https://docs.celeryproject.org/en/stable/userguide/monitoring.html#monitoring-redis-queues:
Looks like the events are/were getting stuck at the celery stage.
Thanks for the thorough investigation so far!
To eliminate another possible root cause, a test was done in a temporary project with the last version of the python library, it doesn't work either
Dec 17 2020
we have followed the event track on the consumer code without finding anything suspicious.
As a last try, we have fully rebooted the vm, but as expected, it changed nothing at all.
@olasd, if you have some detailed of the version upgrades you have performed yesterday, perhaps it could help to diagnose.