Too many open files error on kafka
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Aug 25 2021, 9:04 AM

Description

The kafka server are in a crashing loop due to a Too many open files error.

[2021-08-25 06:58:14,898] ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
        at kafka.network.Acceptor.accept(SocketServer.scala:671)
        at kafka.network.Acceptor.run(SocketServer.scala:601)
        at java.base/java.lang.Thread.run(Thread.java:829)

Revisions and Commits

rSPSITE puppet-swh-site
	D6130	rSPSITEaa2e550eb111 kafka: increase the open file limit

Related Objects

Mentioned Here: D6130: kafka: increase the open file limit

Event Timeline

vsellier changed the task status from Open to Work in Progress.Aug 25 2021, 9:04 AM

vsellier triaged this task as High priority.

vsellier created this task.

Incident created on status.io
Loader disabled:

root@pergamon:~# clush -b -w @swh-workers 'puppet agent --disable "Kafka incident T3501"; systemctl stop cron; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@loader_*; do systemctl disable $unit; done; systemctl stop "swh-worker@loader_*"'

The open file limit was manually increased to stabilize the cluster:

# puppet agent --disable T3501
# diff -U3 /tmp/kafka.service kafka.service
--- /tmp/kafka.service	2021-08-25 07:32:28.068928972 +0000
+++ kafka.service	2021-08-25 07:32:31.384955246 +0000
@@ -15,7 +15,7 @@
 Environment='LOG_DIR=/var/log/kafka'
 Type=simple
 ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
-LimitNOFILE=65536
+LimitNOFILE=131072
 
 [Install]
 WantedBy=multi-user.target

# systemctl daemon-reload
# systemctl restart kafka

kafka has correctly restarted after that

It seems we have reached a limit, perhaps with the extid topic extension. After the restart, with only the s3 content replayer and the journal clients active:

root@kafka2:~# cd /proc/1978593/fd
root@kafka2:/proc/1978593/fd# ls -al | wc -l
65374

which is almost the limit

I will restart a worker to check the behavior and prepare the diff to puppetize the change

all the loaders are restarted on worker01 and workers02, it seems the cluster is ok.

The number of open files looks to be stable:

root@kafka2:/proc/1978593/fd# ls -al | wc -l
65419

The kafka servers are only running kafka and zookeeper, so the limit of open files isn't that critical. I think we can bump the limit more substantially than just x2 (maybe go directly with x8?), as I expect we'll still be adding more topics in the future.

ok roger that :).
I will increase to 524288 in the diff

vsellier added a revision: D6130: kafka: increase the open file limit.Aug 25 2021, 10:25 AM

vsellier added a commit: rSPSITEaa2e550eb111: kafka: increase the open file limit.Aug 25 2021, 10:32 AM

D6130 landed and applied one kafka at a time

root@kafka4:~# cat /proc/4125205/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             514030               514030               processes 
Max open files            524288               524288               files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       514030               514030               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

all the workers are restarted
Several save code now requests look stuck in the scheduled status, currently looking how to unblock them

Save code now requests rescheduled:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
...
<output loast due to the psql pager :(
...

softwareheritage-scheduler=> select * from task where id in (398244739, 398244740, 398244742, 398244744, 398244745, 398244748, 398095676, 397470401, 397470402, 397470404, 397470399);

few minutes later:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
 id | request_date | visit_type | origin_url | status | loading_task_id | visit_date | loading_task_status | visit_status | user_ids 
----+--------------+------------+------------+--------+-----------------+------------+---------------------+--------------+----------
(0 rows)

status.io incident closed

This task has been migrated to GitLab.

Too many open files error on kafkaClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Too many open files error on kafka
Closed, MigratedEdits Locked
Actions