问题描述
是否可以通过Stackdriver监视Pod状态并重新启动在GKE集群中运行的Pod计数?
Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?
虽然我可以在Stackdriver中看到所有Pod的CPU,内存和磁盘使用情况指标,但似乎无法获取有关崩溃的Pod或由于崩溃而重新启动的副本集中的Pod的指标.
While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.
我正在使用Kubernetes副本集来管理Pod,因此当它们崩溃时,它们会重新生成并以新名称创建.据我所知,Stackdriver中的指标是按Pod名称显示的(这在Pod的生命周期中是唯一的),听起来并不明智.
I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.
容忍Pod故障听起来像是很自然的事情,很难相信目前还不支持.我从Stackdriver for Google Container Engine获得的监视和警报功能似乎毫无用处,因为它们都绑定到使用寿命很短的Pod.
Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.
因此,如果无法立即使用,是否存在已知的解决方法或最佳做法,可用于监视连续崩溃的Pod?
So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?
推荐答案
您可以通过以下操作手动实现:
You can achieve this manually with the following:
-
在Logs Viewer中,创建以下过滤器:
In Logs Viewer, creating the following filter:
resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"
通过单击过滤器输入上方的创建指标"按钮来创建指标,并填写详细信息.
Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
您现在可以在Stackdriver中跟踪该指标.
You may now track this metric in Stackdriver.
很乐意被告知内置指标,而不是此指标.
Would be happy to be informed of a built-in metric instead of this.
这篇关于监视和提醒Pod状态,或使用Google Container Engine(GKE)和Stackdriver重新启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!