Page MenuHomeSoftware Heritage

Too many open files error on kafka
Closed, ResolvedPublic

Description

The kafka server are in a crashing loop due to a Too many open files error.

[2021-08-25 06:58:14,898] ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
        at kafka.network.Acceptor.accept(SocketServer.scala:671)
        at kafka.network.Acceptor.run(SocketServer.scala:601)
        at java.base/java.lang.Thread.run(Thread.java:829)

Event Timeline

vsellier changed the task status from Open to Work in Progress.Aug 25 2021, 9:04 AM
vsellier triaged this task as High priority.
vsellier created this task.
  • Incident created on status.io
  • Loader disabled:
root@pergamon:~# clush -b -w @swh-workers 'puppet agent --disable "Kafka incident T3501"; systemctl stop cron; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@loader_*; do systemctl disable $unit; done; systemctl stop "swh-worker@loader_*"'

The open file limit was manually increased to stabilize the cluster:

# puppet agent --disable T3501
# diff -U3 /tmp/kafka.service kafka.service
--- /tmp/kafka.service	2021-08-25 07:32:28.068928972 +0000
+++ kafka.service	2021-08-25 07:32:31.384955246 +0000
@@ -15,7 +15,7 @@
 Environment='LOG_DIR=/var/log/kafka'
 Type=simple
 ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
-LimitNOFILE=65536
+LimitNOFILE=131072
 
 [Install]
 WantedBy=multi-user.target

# systemctl daemon-reload
# systemctl restart kafka
  • kafka has correctly restarted after that

It seems we have reached a limit, perhaps with the extid topic extension. After the restart, with only the s3 content replayer and the journal clients active:

root@kafka2:~# cd /proc/1978593/fd
root@kafka2:/proc/1978593/fd# ls -al | wc -l
65374

which is almost the limit

I will restart a worker to check the behavior and prepare the diff to puppetize the change

all the loaders are restarted on worker01 and workers02, it seems the cluster is ok.

The number of open files looks to be stable:

root@kafka2:/proc/1978593/fd# ls -al | wc -l
65419

The kafka servers are only running kafka and zookeeper, so the limit of open files isn't that critical. I think we can bump the limit more substantially than just x2 (maybe go directly with x8?), as I expect we'll still be adding more topics in the future.

ok roger that :).
I will increase to 524288 in the diff

vsellier moved this task from Backlog to done on the System administration board.

D6130 landed and applied one kafka at a time

root@kafka4:~# cat /proc/4125205/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             514030               514030               processes 
Max open files            524288               524288               files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       514030               514030               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us
  • all the workers are restarted
  • Several save code now requests look stuck in the scheduled status, currently looking how to unblock them

Save code now requests rescheduled:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
...
<output loast due to the psql pager :(
...
softwareheritage-scheduler=> select * from task where id in (398244739, 398244740, 398244742, 398244744, 398244745, 398244748, 398095676, 397470401, 397470402, 397470404, 397470399);

few minutes later:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
 id | request_date | visit_type | origin_url | status | loading_task_id | visit_date | loading_task_status | visit_status | user_ids 
----+--------------+------------+------------+--------+-----------------+------------+---------------------+--------------+----------
(0 rows)