prometheus grafana alertmanager 监控报警基本配置

此配置主要用于监控主机状态,prometheus还可以监控各种服务的状态,只要使用相应的exporter即可。

prometheus

监控节点安装prometheus,被监控节点只需安装prometheus-node-exporter

1
$ sudo apt install prometheus

/etc/prometheus/prometheus.yml文件中添加被监控节点

1
2
3
4
5
6
7
8
9
10
11
12
- job_name: node
# If prometheus-node-exporter is installed, grab stats about the local
# machine by default.
static_configs:
# 被监控节点,默认
- targets: \['localhost:9100'\]
labels:
hostname: 'vmin'
# 被监控节点
- targets: \['10.100.0.31:9100'\]
labels:
hostname: 'vmsvr02'

添加主机名标签,方便管理。
通过监控主机的9090端口访问prometheus,http://ip_of_monitor:9090/

alertmanager
在监控主机安装

1
$ sudo apt install prometheus-alertmanager

添加节点监控规则文件/etc/prometheus/node-alert.rules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# hostStatsAlert
groups:
- name: hostStatsAlert
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{$labels.instance}} down"
description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}\[5m\]))) by (instance) > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
- alert: filesystemUsageAlert
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype=~"ext4xfs"} * 100) / node_filesystem_size_bytes {mountpoint="/",fstype=~"ext4xfs"}) > 85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} root DISK usgae high"
description: "{{ $labels.instance }} root DISK usage above 85% (current value: {{ $value }})"

此规则文件主要监测主机在线状态,cpu、memory和filesystem使用率

/etc/prometheus/prometheus.yml引用此规则文件:

1
2
rule_files:
- "node-alert.rules"

alertmanager配置邮件报警/etc/prometheus/alertmanager.yml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'abc@163.com'
smtp_auth_username: 'abc@163.com'
smtp_auth_password: 'password'
...
# A default receiver
receiver: team-X-mails
...
receivers:
- name: 'team-X-mails'
email_configs:
- to: '123@163.com'

重新装载prometheus和alertmanager服务,停止一个被监控节点的监控服务,就可以收到报警邮件了。

grafana

可以使用grafana来展示prometheus监控信息
安装grafana

1
2
3
4
5
$ sudo apt-get install -y software-properties-common
$ sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
$ wget -q -O - https://packages.grafana.com/gpg.key sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install grafana

通过http 3000端口来访问grafana,然后添加prometheus数据源,添加展示prometheus数据的dashboard

References:
[1]Prometheus Alertmanager 基本配置
[2]alertmanager报警规则详解