Prometheus与Grafana监控体系搭建实战-育师

本文详解如何搭建Prometheus + Grafana监控体系，实现服务器、应用、数据库的全方位监控。

前言

生产环境必须要有监控：

及时发现问题
追溯历史数据
容量规划依据
告警通知

Prometheus + Grafana是目前最流行的开源监控方案：

Prometheus：采集和存储指标
Grafana：可视化展示
丰富的生态：各种Exporter

今天来搭建一套完整的监控体系。

一、架构设计

1.1 整体架构

┌─────────────────────────────────────────────────────┐ │ Grafana │ │ (可视化展示) │ └─────────────────────────────────────────────────────┘ ↑ ┌─────────────────────────────────────────────────────┐ │ Prometheus │ │ (采集+存储+查询) │ └─────────────────────────────────────────────────────┘ ↑ ↑ ↑ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Node Exporter│ │MySQL Exporter│ │Redis Exporter│ │ (主机监控) │ │ (MySQL监控) │ │ (Redis监控) │ └──────────────┘ └──────────────┘ └──────────────┘

1.2 数据流

1. Exporter采集指标 → 暴露HTTP接口（:9100等） 2. Prometheus定时拉取 → 存储时序数据 3. Grafana查询Prometheus → 展示图表 4. Alertmanager → 发送告警

二、Prometheus部署

2.1 Docker Compose部署

# docker-compose.ymlversion:'3.8'services:prometheus:image:prom/prometheus:latestcontainer_name:prometheusports:-"9090:9090"volumes:-./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml-./prometheus/rules:/etc/prometheus/rules-prometheus_data:/prometheuscommand:-'--config.file=/etc/prometheus/prometheus.yml'-'--storage.tsdb.path=/prometheus'-'--storage.tsdb.retention.time=30d'-'--web.enable-lifecycle'restart:unless-stoppedgrafana:image:grafana/grafana:latestcontainer_name:grafanaports:-"3000:3000"environment:-GF_SECURITY_ADMIN_PASSWORD=admin123-GF_USERS_ALLOW_SIGN_UP=falsevolumes:-grafana_data:/var/lib/grafanarestart:unless-stoppedalertmanager:image:prom/alertmanager:latestcontainer_name:alertmanagerports:-"9093:9093"volumes:-./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.ymlrestart:unless-stoppedvolumes:prometheus_data:grafana_data:

2.2 Prometheus配置

# prometheus/prometheus.ymlglobal:scrape_interval:15sevaluation_interval:15salerting:alertmanagers:-static_configs:-targets:-alertmanager:9093rule_files:-/etc/prometheus/rules/*.ymlscrape_configs:# Prometheus自身-job_name:'prometheus'static_configs:-targets:['localhost:9090']# 主机监控-job_name:'node'static_configs:-targets:-'192.168.1.101:9100'-'192.168.1.102:9100'-'192.168.1.103:9100'# MySQL监控-job_name:'mysql'static_configs:-targets:['192.168.1.101:9104']# Redis监控-job_name:'redis'static_configs:-targets:['192.168.1.101:9121']

2.3 启动服务

# 创建目录mkdir-p prometheus/rules alertmanager# 启动docker compose up -d# 访问# Prometheus: http://localhost:9090# Grafana: http://localhost:3000 (admin/admin123)

三、Node Exporter（主机监控）

3.1 安装部署

# 方式1：Dockerdocker run -d --name node_exporter\--net="host"\--pid="host"\-v"/:/host:ro,rslave"\prom/node-exporter:latest\--path.rootfs=/host# 方式2：二进制安装wgethttps://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gztarxvfz node_exporter-*.tar.gzcdnode_exporter-*/ ./node_exporter&

3.2 验证

curlhttp://localhost:9100/metrics# 输出示例# node_cpu_seconds_total{cpu="0",mode="idle"} 12345.67# node_memory_MemTotal_bytes 8.3e+09# node_filesystem_size_bytes{device="/dev/sda1"} 1.0e+11

3.3 常用指标

指标	说明
node_cpu_seconds_total	CPU使用时间
node_memory_MemTotal_bytes	总内存
node_memory_MemAvailable_bytes	可用内存
node_filesystem_size_bytes	磁盘大小
node_filesystem_avail_bytes	磁盘可用
node_network_receive_bytes_total	网络接收
node_network_transmit_bytes_total	网络发送
node_load1/5/15	系统负载

四、应用监控

4.1 MySQL Exporter

# 部署docker run -d --name mysql_exporter\-p9104:9104\-eDATA_SOURCE_NAME="exporter:password@(mysql:3306)/"\prom/mysqld-exporter# 创建监控用户CREATEUSER'exporter'@'%'IDENTIFIED BY'password';GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO'exporter'@'%';FLUSH PRIVILEGES;

常用指标：

mysql_up：MySQL是否存活
mysql_global_status_connections：连接数
mysql_global_status_slow_queries：慢查询数
mysql_global_status_questions：查询总数

4.2 Redis Exporter

docker run -d --name redis_exporter\-p9121:9121\-eREDIS_ADDR=redis://192.168.1.101:6379\oliver006/redis_exporter

常用指标：

redis_up：Redis是否存活
redis_connected_clients：客户端连接数
redis_used_memory：内存使用
redis_commands_processed_total：命令处理数

4.3 Nginx Exporter

# 需要先启用Nginx状态模块# nginx.conf添加：# location /nginx_status {# stub_status on;# }docker run -d --name nginx_exporter\-p9113:9113\nginx/nginx-prometheus-exporter\-nginx.scrape-uri=http://192.168.1.101/nginx_status

4.4 Java应用（Micrometer）

<!-- pom.xml --><dependency><groupId>io.micrometer</groupId><artifactId>micrometer-registry-prometheus</artifactId></dependency>

# application.ymlmanagement:endpoints:web:exposure:include:prometheus,healthmetrics:export:prometheus:enabled:true

访问http://localhost:8080/actuator/prometheus获取指标。

五、Grafana配置

5.1 添加数据源

1. Configuration → Data Sources → Add data source 2. 选择Prometheus 3. URL: http://prometheus:9090（Docker网络） 或 http://192.168.1.100:9090（外部） 4. Save & Test

5.2 导入Dashboard

推荐Dashboard（Grafana官网ID）：

ID	名称	用途
1860	Node Exporter Full	主机监控
7362	MySQL Overview	MySQL监控
763	Redis Dashboard	Redis监控
12708	Nginx Exporter	Nginx监控

导入方式： 1. Dashboards → Import 2. 输入ID：1860 3. Load → 选择数据源 → Import

5.3 自定义面板

# CPU使用率 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # 内存使用率 (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # 磁盘使用率 (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 # 网络流量 rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m])

六、告警配置

6.1 告警规则

# prometheus/rules/alert.ymlgroups:-name:主机告警rules:-alert:主机宕机expr:up{job="node"}== 0for:1mlabels:severity:criticalannotations:summary:"主机 {{ $labels.instance }} 宕机"description:"主机已超过1分钟无法访问"-alert:CPU使用率过高expr:100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) * 100)>80for:5mlabels:severity:warningannotations:summary:"主机 {{ $labels.instance }} CPU使用率过高"description:"CPU使用率超过80%，当前值: {{ $value }}%"-alert:内存使用率过高expr:(1-(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100>80for:5mlabels:severity:warningannotations:summary:"主机 {{ $labels.instance }} 内存使用率过高"description:"内存使用率超过80%，当前值: {{ $value }}%"-alert:磁盘空间不足expr:(1-(node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100>85for:5mlabels:severity:warningannotations:summary:"主机 {{ $labels.instance }} 磁盘空间不足"description:"磁盘使用率超过85%，当前值: {{ $value }}%"

6.2 Alertmanager配置

# alertmanager/alertmanager.ymlglobal:resolve_timeout:5mroute:group_by:['alertname','instance']group_wait:30sgroup_interval:5mrepeat_interval:4hreceiver:'webhook'receivers:-name:'webhook'webhook_configs:-url:'http://your-webhook-url/alert'send_resolved:true# 或使用邮件# - name: 'email'# email_configs:# - to: 'admin@example.com'# from: 'alert@example.com'# smarthost: 'smtp.example.com:587'# auth_username: 'alert@example.com'# auth_password: 'password'

6.3 告警测试

# 查看告警状态curlhttp://localhost:9090/api/v1/alerts# 查看规则状态curlhttp://localhost:9090/api/v1/rules

七、多站点监控

7.1 场景

监控需求： - 总部机房10台服务器 - 分部A机房5台服务器 - 分部B机房3台服务器 - 云上2台服务器 挑战：各站点网络不通

7.2 传统方案

方案1：每个站点部署Prometheus

优点：独立运行
缺点：无法统一查看，告警分散

方案2：公网暴露Exporter

优点：中心化采集
缺点：安全风险高

7.3 组网方案（推荐）

使用组网软件（如星空组网）打通所有节点：

组网后的架构： ┌──────────────────────┐ │ 中心Prometheus │ │ 10.10.0.1 │ └──────────────────────┘ ↑ ┌─────────────────────┼─────────────────────┐ ↑ ↑ ↑ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ 总部 │ │ 分部A │ │ 分部B │ │ 10.10.0.2 │ │ 10.10.0.3 │ │ 10.10.0.4 │ │ Node Export │ │ Node Export │ │ Node Export │ │ :9100 │ │ :9100 │ │ :9100 │ └──────────────┘ └──────────────┘ └──────────────┘

Prometheus配置：

scrape_configs:# 总部服务器（组网IP）-job_name:'node-headquarters'static_configs:-targets:-'10.10.0.10:9100'-'10.10.0.11:9100'-'10.10.0.12:9100'relabel_configs:-source_labels:[__address__]target_label:locationreplacement:'总部'# 分部A服务器（组网IP）-job_name:'node-branch-a'static_configs:-targets:-'10.10.0.20:9100'-'10.10.0.21:9100'relabel_configs:-source_labels:[__address__]target_label:locationreplacement:'分部A'# 分部B服务器（组网IP）-job_name:'node-branch-b'static_configs:-targets:-'10.10.0.30:9100'relabel_configs:-source_labels:[__address__]target_label:locationreplacement:'分部B'

优势：

统一监控入口
所有数据集中展示
告警统一管理
无需公网暴露
配置简单

八、高可用部署

8.1 Prometheus联邦

# 中心Prometheus配置scrape_configs:-job_name:'federate'scrape_interval:15shonor_labels:truemetrics_path:'/federate'params:'match[]':-'{job=~".+"}'static_configs:-targets:-'10.10.0.2:9090'# 总部Prometheus-'10.10.0.3:9090'# 分部Prometheus

8.2 Grafana高可用

# 使用外部MySQL存储services:grafana:image:grafana/grafana:latestenvironment:-GF_DATABASE_TYPE=mysql-GF_DATABASE_HOST=mysql:3306-GF_DATABASE_NAME=grafana-GF_DATABASE_USER=grafana-GF_DATABASE_PASSWORD=password

九、常见问题

9.1 Prometheus内存占用高

# 减少数据保留时间--storage.tsdb.retention.time=15d# 减少采集频率global:scrape_interval:30s

9.2 查询慢

# 使用Recording Rules预计算groups:-name:recordingrules:-record:job:node_cpu_usage:avgexpr:avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (job)

9.3 热重载配置

curl-X POST http://localhost:9090/-/reload

十、总结

监控体系搭建要点：

基础架构：Prometheus + Grafana + Alertmanager
主机监控：Node Exporter必装
应用监控：根据技术栈选Exporter
Dashboard：导入现成的，再自定义
告警规则：按优先级设置
多站点：组网打通后统一监控
高可用：联邦 + 外部存储

我的监控清单：

必监控项： - CPU/内存/磁盘/网络 - 服务存活状态 - 数据库连接数和慢查询 - 应用响应时间和错误率

监控是运维的眼睛，没有监控的系统就是在裸奔。

参考资料

Prometheus官方文档：https://prometheus.io/docs/
Grafana官方文档：https://grafana.com/docs/
Awesome Prometheus Alerts：https://awesome-prometheus-alerts.grep.to/

💡建议：先监控核心指标，逐步完善。告警不要太多，否则容易麻木。

前言