swh-search / staging: transient timeouts on elasticsearch queries
Closed, MigratedEdits Locked
Actions

Description

priority set as normal as it happens only in staging and probably relative to the elasticsearch configuration in this environment

Sentry link: https://sentry.softwareheritage.org/share/issue/bb5a04156b8b4b1696a50cf8e24349d2/

Event Timeline

vsellier triaged this task as Normal priority.Dec 9 2021, 12:38 PM

vsellier created this task.

looks like the server is short in heap

[2022-02-17T15:26:30,847][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965188] overhead, spent [408ms] collecting in the last [1s]
[2022-02-17T15:27:08,154][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965225] overhead, spent [296ms] collecting in the last [1s]
[2022-02-17T15:29:31,383][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][young][5965368][3283] duration [1s], collections [1]/[1.1s], total [1s]/[5.8m], memory [8.2gb]->[5.4gb]/[16gb], all_pools {[young] [2.8gb]->[0b]/[0b]}{[old] [4.7gb]->[5.3gb]/[16gb]}{[survivor] [652mb]->[184mb]/[0b]}
[2022-02-17T15:29:31,384][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965368] overhead, spent [1s] collecting in the last [1.1s]
[2022-02-17T15:31:49,449][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965506] overhead, spent [260ms] collecting in the last [1s]
[2022-02-17T15:33:46,505][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965623] overhead, spent [256ms] collecting in the last [1s]
[2022-02-17T15:37:11,728][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965828] overhead, spent [372ms] collecting in the last [1s]
[2022-02-17T15:47:19,087][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966435] overhead, spent [289ms] collecting in the last [1s]
[2022-02-17T15:49:56,439][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966592] overhead, spent [315ms] collecting in the last [1.1s]
[2022-02-17T15:55:40,579][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966936] overhead, spent [274ms] collecting in the last [1s]

vsellier changed the task status from Open to Work in Progress.Feb 21 2022, 11:59 AM

vsellier claimed this task.

vsellier moved this task from Backlog to in-progress on the System administration board.

first, clean the unused resources, even if it will not free a lot of resources:

aliases cleanup

vsellier@search-esnode0 ~ % export ES_SERVER=192.168.130.80:9200
vsellier@search-esnode0 ~ % curl -XGET http://$ES_SERVER/_cat/aliases
origin-read         origin-v0.11  - - - -
origin-write        origin-v0.11  - - - -
origin-v0.9.0-read  origin-v0.9.0 - - - -
origin-v0.9.0-write origin-v0.9.0 - - - -
vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-v0.9.0/_alias/origin-v0.9.0-read
{"acknowledged":true}%                                                                                                                                       vsellier@search-esnode0 ~ % curl -XDELETE -H "Content-Type: application/json" http://$ES_SERVER/origin-v0.9.0/_alias/origin-v0.9.0-write
{"acknowledged":true}%

vsellier@search-esnode0 ~ % curl http://$ES_SERVER/_cat/indices                           
green open  origin-v0.11                HljzsdD9SmKI7-8ekB_q3Q 80 0 4206243 569646 4.2gb 4.2gb
green close origin                      HthJj42xT5uO7w3Aoxzppw 80 0                           
green close origin-v0.9.0               o7FiYJWnTkOViKiAdCXCuA 80 0                           
green close origin-v0.10.0              -fvf4hK9QDeN8qYTJBBlxQ 80 0                           
green close origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0                           
green close origin-v0.5.0               SGplSaqPR_O9cPYU4ZsmdQ 80 0

indexes cleanup :

search-esnode0 ~ % df -h  /srv/elasticsearch                                                                     
Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/base--template--vg-root   17G  5.2G   11G  33% /
tmpfs                                 16G     0   16G   0% /dev/shm
tmpfs                                5.0M     0  5.0M   0% /run/lock
/dev/vda1                            236M  123M  101M  55% /boot
elasticsearch-volume                 194G   12G  182G   7% /srv/elasticsearch
tmpfs                                3.1G     0  3.1G   0% /run/user/1025

vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin                                       
{"acknowledged":true}%                                                                                                                                       vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-v0.9.0
{"acknowledged":true}%                                                                                                                                       vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-v0.10.0
{"acknowledged":true}%                                                                                                                                       vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-backup-20210209-1736
{"acknowledged":true}%                                                                                                                                       vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-v0.5.0              
{"acknowledged":true}%

search-esnode0 ~ % df -h /srv/elasticsearch
Filesystem            Size  Used Avail Use% Mounted on
elasticsearch-volume  194G  4.4G  189G   3% /srv/elasticsearch

Elastisearch was restarted and the sentry issues closed.
Let's monitor if the gcs are coming coming again

The index managed by this server is not so big (~4.2gb) so the 16Go of the vms should be enough.
It's a lot less compared to the production cluster (~330gb per node) which doesn't have this gc problem.

vsellier moved this task from in-progress to deployed/landed/monitoring on the System administration board.Feb 21 2022, 4:42 PM

After the elasticsearch restart, there is no more message relative to any gc overhead in the logs but there were a couple of timeouts during the night.
Further investigations are needed

This task has been migrated to GitLab.

swh-search / staging: transient timeouts on elasticsearch queriesClosed, MigratedEdits LockedActions

Description

Event Timeline

swh-search / staging: transient timeouts on elasticsearch queries
Closed, MigratedEdits Locked
Actions