问题描述
我正在使用 prometheus 进行一些监控,并试图了解如何正确使用速率函数.
I am doing some monitoring with prometheus and is trying to understand how to properly use the rate functions.
前提是这样;我有一个计数器,它的配置设置为每 15 秒摄取一次新值.
Premise is this; I have a counter, configuration for this is set to ingest new values every 15s.
现在我正在尝试绘制每秒速率,因此使用速率函数我这样做:
Now I am trying to graph the per second rate of this, so using the rate function I do this as:
rate(pgbouncer_sent_bytes_total{job="pgbouncer", database="worker"}[1m])
当我解释率函数时,查询会在每个被查询的时间点给我一个滚动率平均值(在 1m 回顾窗口中).点的间隔由所使用的分辨率指定.
As I interpret the rate function, the query will give me a rolling rate average (in 1m look back windows) at each point in time that is queried. The interval of points is appointed by the resolution used.
下面是来自 prometheus 控制台的屏幕截图,包括原始数据图和上面使用 1m 分辨率的速率查询的绘图.现在,在查看底部图表中的原始数据时,此处生成的速率图与我的预期并不真正相符.
Below is a screenshot from the prometheus console including the raw data graph and the plot from the rate query above using a 1m resolution. Now the resulting rate graph here does not really match my expectations looking at the raw data in the bottom graph.
有趣的一点是,根据加载的时间点,结果图看起来会有很大不同.只需在随后的几次重新加载相同的图形,就会将外观完全转移到一个点,它甚至看起来都不像,因为它代表了相同的数据.下图是几分钟后的相同数据集,但几秒钟后也会发生同样的情况.
The interesting bit it also that the resulting graph will look very different depending on the point in time it is loaded. Simply reloading the same graph a couple of subsequent times will completely shift the looks to a point where it does not even looks as it is representing the same data. Image below is the same dataset a few minutes after, but the same occurs even seconds after.
有人能解释一下这里到底发生了什么吗?
Could someone shed some light on what is really going on here?
推荐答案
AFAICT 导致奇怪结果的原因是 (1) 您的计数器实际上每分钟只增加一次,即使您每 15 秒收集一次(2) Prometheus 的 rate()
实现丢弃每 4 次计数器增加(在您的特定设置中).
AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate()
implementation discarding every 4th counter increase (in your particular setup).
更准确地说,您似乎在计算 1 分钟的费率,在以 15 秒分辨率刮取的计数器上每 1 分钟计算一次,每 1 分钟增加一次(平均).
More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).
这实质上意味着 Prometheus 基本上会将您的 1 小时间隔分成不相交 1 分钟范围,并估计每个范围内的速率.第一个值将是点 0 和 3 之间的外推增长率,第二个值是点 4 和 7 之间的外推增长率,依此类推.因为您的计数器实际上每分钟只增加一次,所以您可能会遇到两种不同的情况:
What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:
- 您的计数器增加发生在点对 3-4、7-8 等之间.在这种情况下,Prometheus 看到增加率为零(因为点 0 和 3、点 4 和 7 等之间没有增加.这似乎是发生在第一张图的前半部分.
- 您的计数器增加发生在 0-3、4-7 等点之间.在这种情况下,Prometheus 将每个间隔中最后一个点和第一个点之间的差值(您的实际计数器增加量)除以2 分(平均 45 秒),然后将其推断为 1 分钟(基本上将其高估了 1 倍.(3)——我注意到在约 50 分钟内增加了约 200k,因此平均速率约为67 QPS,而
rate()
返回接近 90 QPS 的值).这就是图表后半部分发生的情况.
- Your counter increases happen between point pairs 3-4, 7-8 etc. In this case Prometheus sees an increase rate of zero (because there is no increase between points 0 and 3, points 4 and 7 etc. This seems to be happening in the first half of your first graph.
- Your counter increases happen somewhere between points 0-3, 4-7 etc. In this case Prometheus takes the difference between the last and first points in each interval (your actual counter increase), divides it by the time difference between the 2 points (on average 45 seconds), then extrapolates that to 1 minute (essentially overestimating it by a factor of 1.(3) -- I'm eyeballing an increase of ~200k over ~50 minutes, so an average rate of about 67 QPS, whereas
rate()
returns something closer to 90 QPS). This is what happens in the second half of your graph.
这也是您的图表在刷新时看起来大不相同的原因.rate()
当前实现的论据是它平均正确".如果您查看整个图表,跨越刷新,这是正确的.</讽刺>
This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate()
is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>
从本质上讲,在分辨率为 R 的时间范围内绘制 Prometheus rate()
或 increase()
将导致混叠,要么高估(在您的情况下为 1.33 倍)或低估(在您的情况下为零)除了平稳增加的计数器.
Essentially graphing a Prometheus rate()
or increase()
over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.
您可以通过将表达式替换为:
You can work around it by replacing your expression with:
rate(foo[75s]) / 75 * 60
通过这种方式,您实际上将获得相隔 1 分钟的数据点之间的增长率(75 秒范围几乎总是准确返回 5 个点,因此 4 个计数器增加)并将外推倒推到 Prometheus 所做的 75 秒.在边缘情况下会有一些噪音(例如,如果您的评估与抓取时间一致,由于抓取间隔抖动,有可能在一个范围内获得 6 分,而在下一个范围内获得 4 分)但无论如何您都会获得 rate()
.
This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate()
.
顺便说一句,您可以通过将图形的分辨率增加到 1 秒左右来看到锯齿(任何 15 秒或以下的时间都应该清楚地显示出来).
BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).
这篇关于Prometheus 速率函数和区间选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!