1 通过 Prometheus-Operator 进行安装

在xxxxx机器上,
ssh xxxx@server01
进入到如下目录,先连接上VPN
cd /home/xxxx/openvpn
sudo openvpn xxxx.ovpn

cd /home/weichuang/ec-prod/prometheus-operator-deployment/
[weichuang@server01 prometheus-operator-deployment]$ ll
total 8
drwxrwxr-x 10 weichuang weichuang 4096 Jun 4 15:10 prometheus-operator-0.27.0 #部署文件
drwxrwxr-x 2 weichuang weichuang 4096 Jun 4 17:16 prometheus-storage #grafana存储创建

[weichuang@server01 prometheus-operator-0.27.0]$ ll
total 40
drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 alertmanager
-rw-rw-r-- 1 weichuang weichuang 487 Jun 4 15:10 delete-prometheus-operation.sh
-rw-rw-r-- 1 weichuang weichuang 460 Jun 4 15:10 deploy-prometheus-operation.sh
drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 grafana
drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 kube-state-metrics
drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 node-exporter
drwxr-xr-x 2 weichuang weichuang 4096 Sep 4 17:56 prometheus
drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 prometheus-adapter
drwxr-xr-x 2 weichuang weichuang 4096 Jul 30 17:30 prometheus-operator
drwxr-xr-x 2 weichuang weichuang 4096 Jul 30 17:18 serviceMonitor

1.1 创建grafana的存储pv和pvc

cd /home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-storage
kubectl apply -f storage-nas-grafana.yaml

1.2 开始部署prometheus-operation

这个目录下面包含我们所有的资源清单文件,直接在该文件夹下面执行创建资源命令即可:

kubectl apply -f prometheus-operator/
kubectl apply -f prometheus-adapter/
kubectl apply -f serviceMonitor/
kubectl apply -f prometheus/
kubectl apply -f kube-state-metrics/
kubectl apply -f node-exporter/
kubectl apply -f grafana/

sh alertmanager/alertmanager-main.sh
kubectl apply -f alertmanager/alertmanager-serviceAccount.yaml
kubectl apply -f alertmanager/alertmanager-service.yaml
kubectl apply -f alertmanager/alertmanager-alertmanager.yaml

部署完成后,会创建一个名为monitoring的 namespace,所以资源对象对将部署在改命名空间下面,此外 Operator 会自动创建4个 CRD 资源对象:

[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get crd |grep coreos
alertmanagers.monitoring.coreos.com 2019-05-16T07:54:33Z
prometheuses.monitoring.coreos.com 2019-05-16T07:54:33Z
prometheusrules.monitoring.coreos.com 2019-05-16T07:54:33Z
servicemonitors.monitoring.coreos.com 2019-05-16T07:54:33Z

可以在 monitoring 命名空间下面查看所有的 Pod,其中 alertmanager 和 prometheus 是用 StatefulSet 控制器管理的,其中还有一个比较核心的 prometheus-operator 的 Pod,用来控制其他资源对象和监听对象变化的:

[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get pods -n monitoring
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   5          72d
alertmanager-main-1                    2/2     Running   0          126d
grafana-7669f89c98-7glxs               1/1     Running   0          84d
kube-state-metrics-67cc9bb496-9wwf6    4/4     Running   0          72d
node-exporter-2z9hh                    2/2     Running   0          127d
node-exporter-56tr2                    2/2     Running   0          127d
node-exporter-5th7p                    2/2     Running   0          127d
node-exporter-chnpn                    2/2     Running   0          127d
node-exporter-hpgj2                    2/2     Running   0          127d
node-exporter-zp9pr                    2/2     Running   0          127d
prometheus-adapter-87f698958-gtzbn     1/1     Running   0          72d
prometheus-k8s-0                       3/3     Running   1          126d
prometheus-k8s-1                       3/3     Running   2          84d
prometheus-operator-7d5fc9ccb6-9hsw8   1/1     Running   0          126d

查看创建的 Service:

[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
alertmanager-main       ClusterIP   172.21.13.181   <none>        9093/TCP            127d
alertmanager-operated   ClusterIP   None            <none>        9093/TCP,6783/TCP   126d
grafana                 ClusterIP   172.21.13.5     <none>        3000/TCP            126d
kube-state-metrics      ClusterIP   None            <none>        8443/TCP,9443/TCP   127d
node-exporter           ClusterIP   None            <none>        9100/TCP            127d
prometheus-adapter      ClusterIP   172.21.14.182   <none>        443/TCP             126d
prometheus-k8s          ClusterIP   172.21.11.15    <none>        9090/TCP            126d
prometheus-operated     ClusterIP   None            <none>        9090/TCP            126d
prometheus-operator     ClusterIP   None            <none>        8080/TCP            126d

可以看到上面针对 grafana 和 prometheus 都创建了一个类型为 ClusterIP 的 Service,当然如果我们想要在外网访问这两个服务的话可以通过创建对应的 Ingress 对象

cd /home/weichuang/ec-prod/ingress-nginx-dev
kubectl apply -f ec-prometheus-ingress.yaml  -f ec-grafana-ingress.yaml

更改完成后,我们就可以通过去访问上面的两个服务了,比如查看 prometheus 的 targets 页面:

2 主要注意事项

2.1 kubelet监控报错

kubelet显示没有被监控到,是因为kubelet没有开启--authentication-token-webhook=true  --authorization-mode=Webhook两个参数。

修改/etc/systemd/system/kubelet.service.d/10-kubeadm.conf把原来的
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
改为
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --authentication-token-webhook=true --client-ca-file=/etc/kubernetes/pki/ca.crt"
systemctl daemon-reload&&systemctl restart kubelet

2.2 配置PrometheusRules

配置具体的报警规则文件位于:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/prometheus/prometheus-rules.yaml目录下面所有的 YAML 文件。

2.3 配置邮箱和报警模版

2.3.1 配置邮箱的文件:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/alertmanager/alertmanager.yaml

[weichuang@server01 alertmanager]$ cat alertmanager.yaml
global:
smtp_smarthost: ''
smtp_from: ''
smtp_auth_username: ''
smtp_auth_password: ''
smtp_require_tls: true
route:
group_by: ['instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receiver: email
routes:
- match:
severity: critical
receiver: email
- match_re:
severity: ^(warning|critical)$
receiver: support_team

receivers:
- name: 'email'
email_configs:
- to: '[email protected];[email protected];[email protected];[email protected]'

2.3.2 配置发送报警的模板:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/alertmanager/mail-template.tmpl

[weichuang@server01 alertmanager]$ cat mail-template.tmpl

{{ define "mail.default.message" }}
{{ range .Alerts }}
========start==========
告警程序: prometheus_alert
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2013-12-02 15:04:05" }}
========end==========
{{ end }}
{{ end }}

2.3.3 执行命令

kd delete secret alertmanager-main -n monitoring
kd create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=mail-template.tmpl -n monitoring

2.4 增加监控target

cd /home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/prometheus

2.4.1 在prometheus-prometheus.yaml文件中增加additionalScrapeConfigs

[weichuang@server01 prometheus]$ cat prometheus-prometheus.yaml
#增加最后三行
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml

2.4.2 增加prometheus-additional.yaml 配置

[weichuang@server01 prometheus]$ cat prometheus-additional.yaml
- job_name: 'consul-prometheus'
consul_sd_configs:
- server: 'consul.byton-prod:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
action: replace
- source_labels: ['__metrics_path__']
regex: '/metrics'
target_label: __metrics_path__
replacement: '/actuator/prometheus'

2.4.3 执行下面命令

kp delete secret additional-configs -n monitoring
kp create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
01-21 16:48