Alertmanager 简介
Alertmanager 实现的核心概念
Grouping
Inhibition
Silences
Client behavior
High Availability
Alertmanager 配置文件
global
global:
# 定义邮件服务器
smtp_smarthost: 'localhost:25'
# 发送邮件的邮件地址
smtp_from: 'alertmanager@example.org'
# 发件人名字(具体以邮件服务器为准)
smtp_auth_username: 'alertmanager'
# 基于 SMTP 身份验证的,不是平常使用的明文密码,需要从邮箱里面申请
smtp_auth_password: 'password'
# SMTP 是否是 tls
smtp_require_tls: false
templates
templates:
- '/etc/alertmanager/template/*.tmpl'
route
route:
# 依据 label 做分组,例如:cluster=A 和 alertname=LatencyHigh 的多个警报将被批处理到一个组中。
# 这有效地完全禁用了聚合,按原样传递所有警报。这不太可能是您想要的,除非您的警报量非常低,或者您的上游通知系统执行自己的分组。
group_by: ['alertname', 'cluster', 'service']
# 当传入警报创建新的警报组时,请至少等待 "group_wait" 以发送初始通知。
# 这种方式可以确保您获得同一组的多个警报,这些警报在第一次通知中将另一个警报批处理在一起后不久就开始触发。
group_wait: 30s
# 发送第一个通知时,请等待 "group_interval" 以发送一批已开始为该组启动的新警报。
group_interval: 5m
# 如果警报已成功发送,请等待 "repeat_interval" 重新发送警报。
repeat_interval: 3h
# 默认的接收器
receiver: team-X-mails
# 以上所有属性都由所有子路由继承,并且可以在每条路由上进行覆盖。
# 子路由
routes:
# 此路由对警报标签执行正则表达式匹配,以捕获与服务列表相关的警报。
- matchers:
- service=~"foo1|foo2|baz"
receiver: team-X-mails
# 该服务有一个关键警报的子路由,任何不匹配的警报,即不等于 critical 的,回退到父节点并发送到 "team-X-mails"
routes:
- matchers:
- severity="critical"
receiver: team-X-pager
- matchers:
- service="files"
receiver: team-Y-mails
routes:
- matchers:
- severity="critical"
receiver: team-Y-pager
# 此路由处理来自数据库服务的所有警报。如果没有团队来处理,则默认由 DB 团队处理。
- matchers:
- service="database"
receiver: team-DB-pager
# 按受影响的数据库对警报进行分组。
group_by: [alertname, cluster, database]
routes:
- matchers:
- owner="team-X"
receiver: team-X-pager
continue: true
- matchers:
- owner="team-Y"
receiver: team-Y-pager
inhibit_rules
# 抑制规则允许在另一个警报正在触发的情况下使一组警报静音。
# 如果同一警报已经是关键警报,我们将使用此功能来静音任何警告级别的通知。
inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
# 如果源警报和目标警报中都缺少 "equal" 中列出的所有标签名称,则将应用禁止规则!
equal: [alertname, cluster, service]
receivers
receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org'
- name: 'team-X-pager'
email_configs:
- to: 'team-X+alerts-critical@example.org'
pagerduty_configs:
- service_key: <team-X-key>
- name: 'team-Y-mails'
email_configs:
- to: 'team-Y+alerts@example.org'
- name: 'team-Y-pager'
pagerduty_configs:
- service_key: <team-Y-key>
- name: 'team-DB-pager'
pagerduty_configs:
- service_key: <team-DB-key>
Alertmanager 部署
创建 cm
---
apiVersion: v1
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'alertmanager'
smtp_require_tls: false
templates:
- '/app/config/email.tmpl'
receivers:
- name: default-receiver
email_configs:
- to: "imcxsen@163.com"
html: '{{ template "email.to.html" . }}'
headers: { Subject: " {{ .CommonAnnotations.summary }}" }
send_resolved: true
route:
group_interval: 15m
group_wait: 30s
receiver: default-receiver
repeat_interval: 15m
routes:
- match:
severity: warning
receiver: default-receiver
continue: true
- match:
severity: error
receiver: default-receiver
continue: true
email.tmpl: |-
{{ define "email.to.html" }}
{{ range .Alerts }}
========= {{ .StartsAt.Format "2006-01-02T15:04:05" }} ==========<br>
告警程序: prometheus_alert <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
{{ end }}
{{ end }}
kind: ConfigMap
metadata:
labels:
name: alertmanager-cm
namespace: monitor
创建 svc
---
apiVersion: v1
kind: Service
metadata:
annotations:
labels:
app: alertmanager
name: alertmanager-svc
namespace: monitor
spec:
ports:
- name: http
protocol: TCP
port: 9093
selector:
app: alertmanager
type: ClusterIP
创建 sts
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: alertmanager
name: alertmanager
namespace: monitor
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
serviceName: alertmanager-svc
template:
metadata:
annotations:
labels:
app: alertmanager
spec:
containers:
- args:
- "--config.file=/app/config/alertmanager.yml"
- "--storage.path=/alertmanager/data"
image: prom/alertmanager:v0.27.0
livenessProbe:
failureThreshold: 60
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: service
timeoutSeconds: 1
name: alertmanager
ports:
- containerPort: 9093
name: service
protocol: TCP
- containerPort: 8002
name: cluster
protocol: TCP
resources:
limits:
cpu: 1000m
memory: 1024Mi
requests:
cpu: 1000m
memory: 1024Mi
volumeMounts:
- mountPath: /app/config
name: config-volume
volumes:
- configMap:
name: alertmanager-cm
name: config-volume
Prometheus 配置告警
Prometheus 配置文件增加 Alertmanager 配置
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager-svc.monitor.svc.cluster.local:9093"]
Prometheus 增加告警规则
groups:
- name: test-rule
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 15
for: 2m
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 15% (current value is: {{ $value }}"