

我有2个Slave和1个Master节点kubernetes集群.当一个节点关闭时,大约需要5分钟才能看到kubernetes失败.我正在为卷使用动态预配置,这次对我来说有点多了.我该怎么办?减少检测故障的时间?我发现了一个关于它的帖子: https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quick-detection-of-a-node-down/

I have 2 Slave and 1 Master node kubernetes cluster.When a node down it takes approximately 5 minutes to kubernetes see that failure.I am using dynamic provisioning for volumes and this time is a little bit much for me.How can i reduce that detecting failure time ?I found a post about it:https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/


At the bottom of the post,it says, we can reduce that detection time by changing that parameters:

kubelet:node-status-update-frequency = 4s(从10s开始)
控制器管理器:node-monitor-period = 2s(从5s开始)
控制器管理员:node-monitor-grace-period = 16s(从40s开始)
控制器管理员:pod-eviction-timeout = 30s(从5m开始)

kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)


i can change node-status-update-frequency parameter from kubelet but i don't have any controller manager program or command on the cli.How can i change that parameters? Any other suggestions about reducing detect downtime will be appreciated.


您可以在controller-manger系统单元文件中更改/添加该参数,然后重新启动守护程序.请在controller-manager 此处.

You can change/add that parameter in controller-manger systemd unit file and restart the daemon. Please check the man pages for controller-manager here.


If you deploy controller-manager as micro service(pod), check the manifest file for that pod and change the parameters at container's command section(For example like this)


08-13 12:55