问题描述
我正在尝试编写一个 Prometheus 查询,它可以告诉我每个命名空间在一个时间范围内(比如一周)使用了多少 CPU(以及另一个用于内存和网络的百分比).
我尝试使用的指标是 container_spec_cpu_shares
和 container_memory_working_set_bytes
,但我不知道随着时间的推移如何对它们求和.无论我尝试返回 0 还是错误.
有关如何为此编写查询的任何帮助将不胜感激.
要检查每个命名空间使用的内存百分比,您需要一个类似于以下的查询:
sum( container_memory_working_set_bytes{container="", namespace=~".+"} )|by(命名空间)/忽略(命名空间)group_left总和(machine_memory_bytes{})* 100
上面的查询应该产生一个类似于这个的图:
免责声明!:
- 上面的屏幕截图来自 Grafana,以提高可见度.
- 此查询不确认可用 RAM 的更改(节点更改、节点自动缩放等).
要在 PromQL 中获取一段时间内的指标,您需要使用其他函数,例如:
avg_over_time(EXP[time])
.
要回到过去并计算特定时间点的资源,您需要使用:
偏移时间
使用上面的指针查询应该结合:
avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:])/ignoring (namespace) group_left总和(machine_memory_bytes{})
上述查询将计算每个命名空间使用的平均内存百分比,并将其除以集群中从 120 分钟到当前时间的所有内存.它也将从当前时间提前 45 分钟开始.
示例:
- 运行查询时间:20:00
avg_over_time(EXPR[2h:])
偏移 45 分钟
以上示例将从 17:15 开始,并将查询运行到 19:15.您可以修改它以包括整个星期:)
如果您想按命名空间计算 CPU 使用率,您可以将此指标替换为以下指标:
container_cpu_usage_seconds_total{}
- 使用此指标(计数器)时请检查rate()
函数machine_cpu_cores{}
您还可以查看此网络指标:
container_network_receive_bytes_total
- 使用此指标(计数器)时请检查rate()
函数container_network_transmit_bytes_total
- 使用此指标(计数器)时请检查rate()
函数
我在下面包含了更多解释,包括示例(内存)、测试方法和使用查询的剖析.
让我们假设:
- Kubernetes 集群
1.18.6
(Kubespray) 总内存为 12GB:- 具有
2GB
内存的主节点 - 具有
8GB
内存的 worker-one 节点 - 具有
2GB
内存的工人二节点
- 具有
- Prometheus 和 Grafana 安装:
使用
avg_over_time(EXPR[time:])/集群内存
计算显示使用率在13%左右((17.5+8.5)/2
) 查询人工负载产生的时间时.这应该表明查询是正确的:至于使用的查询:
avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )by (namespace)[120m:])/ignoring (namespace) group_left总和(machine_memory_bytes{})* 100
上面的查询与开头的查询非常相似,但我做了一些更改以仅显示
kruk
命名空间.我将查询解释分为两部分(除数/除数).
股息
container_memory_working_set_bytes{container="", namespace="kruk"}
该指标将输出命名空间
kruk
中的内存使用记录.如果您要查询所有名称空间,请查看附加说明:namespace=~".+"
<- 只有当命名空间键中的值包含 1 个或多个字符时,此正则表达式才会匹配.这是为了避免带有聚合指标的空命名空间结果.container=""
container="" 仅在容器值为空时才匹配(下面引用的最后一行).
container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="19799d2f5bb794bcdb5dc5f8bca",instance="1902","student="1902","student="1902"192","student="1902"192.1"k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",servicelet="2"2container_memory_working_set_bytes {容器= QUOT; ubuntu的",端点= QUOT; HTTPS度量",ID ="/kubepods/podab1ed1fb-dc8c-47分贝-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",图像= QUOT; ubuntu的@ SHA256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",实例=192.168.0.124:10250",job=kubelet",metrics_path=/metrics/cadvisor",name=k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47a-dc8c-47a-dc8c-47a-dc8c-47a-dc8c-47a0"acc8"acc8"acc8"空间=worker-one",pod =ubuntu",服务 =kubelet"} 2186403840container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:""_bepath"=",metric=";/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064
您可以在此处阅读有关暂停容器的更多信息:
sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )通过(命名空间)
此查询将按各自的命名空间对结果求和.
offset 1380m
用于回到过去进行的测试.avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )通过(命名空间)[120m:])
此查询将从比当前时间早 1380m 开始的指定时间(120m 到现在)内跨命名空间的内存指标计算平均值.
您可以在此处阅读有关
avg_over_time()
的更多信息:除数
sum( machine_memory_bytes{})
该指标将对集群中每个节点的可用内存求和.
EXPR/忽略(命名空间)group_left总和(machine_memory_bytes{})* 100
专注于:
/ignoring (namespace) group_left
<- 这个表达式允许你将每个记录"分开在除数(集群中的所有内存)的除数(每个命名空间及其内存平均跨时间)中.您可以在此处阅读更多相关信息:Prometheus.io: Vector匹配* 100
是不言自明的,会将结果乘以 100 以看起来更像百分比.
其他资源:
I'm trying to write a Prometheus query that can tell me how much, as a percentage, CPU (and another for memory and network) each namespace has used over a time frame, say a week.
The metrics I'm trying to use are
container_spec_cpu_shares
andcontainer_memory_working_set_bytes
but I can't figure out how sum them over time. Whatever I try either returns 0 or errors.Any help on how to write a query for this would be greatly appreciated.
解决方案To check the percentage of memory used by each namespace you will need a query similar to the one below:
sum( container_memory_working_set_bytes{container="", namespace=~".+"} )| by (namespace) / ignoring (namespace) group_left sum( machine_memory_bytes{}) * 100
Above query should produce a graph similar to this one:
To get the metric over a period of time in PromQL you will need to use additional function like:
avg_over_time(EXP[time])
.
To go back in time and calculate resources from specific point in time you will need to use:
offset TIME
Using above pointers query should combine to:
avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:]) / ignoring (namespace) group_left sum( machine_memory_bytes{})
Above query will calculate the average percentage of memory used by each namespace and divide it by all memory in the cluster in the span of 120 minutes to present time. It will also start 45 minutes earlier from present time.
Example:
- Time of running the query: 20:00
avg_over_time(EXPR[2h:])
offset 45 min
Above example will start at 17:15 and it will run the query to the 19:15. You can modify it to include the whole week :).
If you want to calculate the CPU usage by namespace you can replace this metrics with the one below:
container_cpu_usage_seconds_total{}
- please checkrate()
function when using this metric (counter)machine_cpu_cores{}
You could also look on this network metrics:
container_network_receive_bytes_total
- please checkrate()
function when using this metric (counter)container_network_transmit_bytes_total
- please checkrate()
function when using this metric (counter)
I've included more explanation below with examples (memory), methodology of testing and dissection of used queries.
Let's assume:
- Kubernetes cluster
1.18.6
(Kubespray) with 12GB of memory in total:- master node with
2GB
of memory - worker-one node with
8GB
of memory - worker-two node with
2GB
of memory
- master node with
- Prometheus and Grafana installed with: Github.com: Coreos: Kube-prometheus
- Namespace
kruk
with singleubuntu
pod set to generate artificial load with below command:$ stress-ng --vm 1 --vm-bytes <AMOUNT_OF_RAM_USED> --vm-method all -t 60m -v
The artificial load was generated with
stress-ng
two times:- 60 minutes - 1GB of memory used
- 60 minutes - 2GB of memory used
The percentage of memory used by namespace
kruk
in this timespan:- 1GB which accounts for about ~8.5% of all memory in the cluster (12GB)
- 2GB which accounts for about ~17.5% of all memory in the cluster (12GB)
The load from Prometheus query for
kruk
namespace was looking like that:Calculation using
avg_over_time(EXPR[time:]) / memory in the cluster
showed the usage in the midst of about 13% ((17.5+8.5)/2
) when querying the time the artificial load was generated. This should indicate that the query was correct:As for the used query:
avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m ) by (namespace)[120m:]) / ignoring (namespace) group_left sum( machine_memory_bytes{}) * 100
Above query is really similar to the one in the beginning but I've made some changes to show only the
kruk
namespace.I divided the query explanation on 2 parts (dividend/divisor).
Dividend
container_memory_working_set_bytes{container="", namespace="kruk"}
This metric will output records of memory usage in namespace
kruk
. If you were to query for all namespaces look on additional explanation:namespace=~".+"
<- this regexp will match only when the value inside of namespace key is containing 1 or more characters. This is to avoid empty namespace result with aggregated metrics.container=""
<- part is used to filter the metrics. If you were to query without it you would get multiple memory usage metrics for each container/pod like below.container=""
will match only when container value is empty (last row in below citation).
container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 692224 container_memory_working_set_bytes{container="ubuntu",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",image="ubuntu@sha256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2186403840 container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064
sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m ) by (namespace)
This query will sum the results by their respective namespaces.
offset 1380m
is used to go back in time as the tests were made in the past.avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m ) by (namespace)[120m:])
This query will calculate average from memory metric across namespaces in the specified time (120m to now) starting 1380m earlier than present time.
You can read more about
avg_over_time()
here:Divisor
sum( machine_memory_bytes{})
This metric will sum the memory available in each node in the cluster.
EXPR / ignoring (namespace) group_left sum( machine_memory_bytes{}) * 100
Focusing on:
/ ignoring (namespace) group_left
<- this expression will allow you to divide each "record" in the dividend (each namespace with their memory average across time) by a divisor (all memory in the cluster). You can read more about it here: Prometheus.io: Vector matching* 100
is rather self explanatory and will multiply the result by a 100 to look more like percentages.
Additional resources:
这篇关于PromQL 查询以查找上周使用的 CPU 和内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!