1 通过 Prometheus-Operator 进行安装
在xxxxx机器上, ssh xxxx@server01 进入到如下目录,先连接上VPN cd /home/xxxx/openvpn sudo openvpn xxxx.ovpn cd /home/weichuang/ec-prod/prometheus-operator-deployment/ [weichuang@server01 prometheus-operator-deployment]$ ll total 8 drwxrwxr-x 10 weichuang weichuang 4096 Jun 4 15:10 prometheus-operator-0.27.0 #部署文件 drwxrwxr-x 2 weichuang weichuang 4096 Jun 4 17:16 prometheus-storage #grafana存储创建 [weichuang@server01 prometheus-operator-0.27.0]$ ll total 40 drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 alertmanager -rw-rw-r-- 1 weichuang weichuang 487 Jun 4 15:10 delete-prometheus-operation.sh -rw-rw-r-- 1 weichuang weichuang 460 Jun 4 15:10 deploy-prometheus-operation.sh drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 grafana drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 kube-state-metrics drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 node-exporter drwxr-xr-x 2 weichuang weichuang 4096 Sep 4 17:56 prometheus drwxr-xr-x 2 weichuang weichuang 4096 Jun 4 17:03 prometheus-adapter drwxr-xr-x 2 weichuang weichuang 4096 Jul 30 17:30 prometheus-operator drwxr-xr-x 2 weichuang weichuang 4096 Jul 30 17:18 serviceMonitor
1.1 创建grafana的存储pv和pvc
cd /home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-storage kubectl apply -f storage-nas-grafana.yaml
1.2 开始部署prometheus-operation
这个目录下面包含我们所有的资源清单文件,直接在该文件夹下面执行创建资源命令即可:
kubectl apply -f prometheus-operator/
kubectl apply -f prometheus-adapter/
kubectl apply -f serviceMonitor/
kubectl apply -f prometheus/
kubectl apply -f kube-state-metrics/
kubectl apply -f node-exporter/
kubectl apply -f grafana/
sh alertmanager/alertmanager-main.sh
kubectl apply -f alertmanager/alertmanager-serviceAccount.yaml
kubectl apply -f alertmanager/alertmanager-service.yaml
kubectl apply -f alertmanager/alertmanager-alertmanager.yaml
部署完成后,会创建一个名为monitoring
的 namespace,所以资源对象对将部署在改命名空间下面,此外 Operator 会自动创建4个 CRD 资源对象:
[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get crd |grep coreos
alertmanagers.monitoring.coreos.com 2019-05-16T07:54:33Z
prometheuses.monitoring.coreos.com 2019-05-16T07:54:33Z
prometheusrules.monitoring.coreos.com 2019-05-16T07:54:33Z
servicemonitors.monitoring.coreos.com 2019-05-16T07:54:33Z
可以在 monitoring 命名空间下面查看所有的 Pod,其中 alertmanager 和 prometheus 是用 StatefulSet 控制器管理的,其中还有一个比较核心的 prometheus-operator 的 Pod,用来控制其他资源对象和监听对象变化的:
[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 5 72d alertmanager-main-1 2/2 Running 0 126d grafana-7669f89c98-7glxs 1/1 Running 0 84d kube-state-metrics-67cc9bb496-9wwf6 4/4 Running 0 72d node-exporter-2z9hh 2/2 Running 0 127d node-exporter-56tr2 2/2 Running 0 127d node-exporter-5th7p 2/2 Running 0 127d node-exporter-chnpn 2/2 Running 0 127d node-exporter-hpgj2 2/2 Running 0 127d node-exporter-zp9pr 2/2 Running 0 127d prometheus-adapter-87f698958-gtzbn 1/1 Running 0 72d prometheus-k8s-0 3/3 Running 1 126d prometheus-k8s-1 3/3 Running 2 84d prometheus-operator-7d5fc9ccb6-9hsw8 1/1 Running 0 126d
查看创建的 Service:
[weichuang@server01 prometheus-operator-0.27.0]$ kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main ClusterIP 172.21.13.181 <none> 9093/TCP 127d alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 126d grafana ClusterIP 172.21.13.5 <none> 3000/TCP 126d kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 127d node-exporter ClusterIP None <none> 9100/TCP 127d prometheus-adapter ClusterIP 172.21.14.182 <none> 443/TCP 126d prometheus-k8s ClusterIP 172.21.11.15 <none> 9090/TCP 126d prometheus-operated ClusterIP None <none> 9090/TCP 126d prometheus-operator ClusterIP None <none> 8080/TCP 126d
可以看到上面针对 grafana 和 prometheus 都创建了一个类型为 ClusterIP 的 Service,当然如果我们想要在外网访问这两个服务的话可以通过创建对应的 Ingress 对象
cd /home/weichuang/ec-prod/ingress-nginx-dev
kubectl apply -f ec-prometheus-ingress.yaml -f ec-grafana-ingress.yaml
更改完成后,我们就可以通过去访问上面的两个服务了,比如查看 prometheus 的 targets 页面:
2 主要注意事项
2.1 kubelet监控报错
kubelet显示没有被监控到,是因为kubelet没有开启--authentication-token-webhook=true --authorization-mode=Webhook两个参数。
修改/etc/systemd/system/kubelet.service.d/10-kubeadm.conf把原来的 Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt" 改为 Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --authentication-token-webhook=true --client-ca-file=/etc/kubernetes/pki/ca.crt" systemctl daemon-reload&&systemctl restart kubelet
2.2 配置PrometheusRules
配置具体的报警规则文件位于:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/prometheus/prometheus-rules.yaml目录下面所有的 YAML 文件。
2.3 配置邮箱和报警模版
2.3.1 配置邮箱的文件:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/alertmanager/alertmanager.yaml
[weichuang@server01 alertmanager]$ cat alertmanager.yaml global: smtp_smarthost: '' smtp_from: '' smtp_auth_username: '' smtp_auth_password: '' smtp_require_tls: true route: group_by: ['instance'] group_wait: 30s group_interval: 5m repeat_interval: 2h receiver: email routes: - match: severity: critical receiver: email - match_re: severity: ^(warning|critical)$ receiver: support_team receivers: - name: 'email' email_configs: - to: '[email protected];[email protected];[email protected];[email protected]'
2.3.2 配置发送报警的模板:/home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/alertmanager/mail-template.tmpl
[weichuang@server01 alertmanager]$ cat mail-template.tmpl {{ define "mail.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ .StartsAt.Format "2013-12-02 15:04:05" }} ========end========== {{ end }} {{ end }}
2.3.3 执行命令
kd delete secret alertmanager-main -n monitoring kd create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=mail-template.tmpl -n monitoring
2.4 增加监控target
cd /home/weichuang/ec-prod/prometheus-operator-deployment/prometheus-operator-0.27.0/prometheus
2.4.1 在prometheus-prometheus.yaml文件中增加additionalScrapeConfigs
[weichuang@server01 prometheus]$ cat prometheus-prometheus.yaml #增加最后三行 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml
2.4.2 增加prometheus-additional.yaml 配置
[weichuang@server01 prometheus]$ cat prometheus-additional.yaml - job_name: 'consul-prometheus' consul_sd_configs: - server: 'consul.byton-prod:8500' services: [] relabel_configs: - source_labels: [__meta_consul_service] target_label: job action: replace - source_labels: ['__metrics_path__'] regex: '/metrics' target_label: __metrics_path__ replacement: '/actuator/prometheus'
2.4.3 执行下面命令
kp delete secret additional-configs -n monitoring kp create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring