Kubernetes持续监控与告警管理:构建实时的监控体系
一、监控概述
Kubernetes监控是保障集群稳定性的关键,涉及指标收集、可视化展示和告警通知。
1.1 监控架构
┌─────────────────────────────────────────────────────────────────┐ │ 监控目标 │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node │ │ Pod │ │ Service │ │ Cluster │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └───────┼─────────────┼─────────────┼─────────────┼─────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 指标收集层 │ │ Node Exporter / cAdvisor │ │ ┌──────────────────┐ │ │ │ Metrics API │ │ │ └────────┬─────────┘ │ └─────────────────────────────────┼───────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 指标存储层 │ │ Prometheus │ │ ┌──────────────────┐ │ │ │ Time Series │ │ │ └────────┬─────────┘ │ └─────────────────────────────────┼───────────────────────────────┘ │ ┌─────────────┼─────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │Alertmanager│ │ Grafana │ │ Rule │ │ 告警 │ │ 可视化 │ │ 规则 │ └──────────┘ └──────────┘ └──────────┘1.2 监控组件
| 组件 | 功能 |
|---|---|
| Prometheus | 指标存储与查询 |
| Grafana | 可视化仪表盘 |
| Alertmanager | 告警管理 |
| Node Exporter | 节点指标 |
| cAdvisor | 容器指标 |
二、Prometheus配置
2.1 Prometheus部署
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 resources: requests: memory: 4Gi serviceAccountName: prometheus serviceMonitorSelector: matchLabels: app: prometheus alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web2.2 ServiceMonitor配置
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter endpoints: - port: metrics interval: 30s2.3 Prometheus规则配置
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: cluster-alerts namespace: monitoring spec: groups: - name: node.rules rules: - record: node_cpu_usage expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) - record: node_memory_usage expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)三、告警配置
3.1 Alertmanager配置
apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: alertmanager namespace: monitoring spec: replicas: 2 serviceAccountName: alertmanager config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook' receivers: - name: 'webhook' webhook_configs: - url: 'http://alert-webhook:8080/webhook'3.2 告警规则
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: alert-rules namespace: monitoring spec: groups: - name: critical-alerts rules: - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.instance }} is down" - alert: HighCPU expr: avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.1 for: 10m labels: severity: critical annotations: summary: "High CPU usage on {{ $labels.instance }}"四、Grafana配置
4.1 Grafana部署
apiVersion: grafana.integreatly.org/v1beta1 kind: Grafana metadata: name: grafana namespace: monitoring spec: config: log: mode: "console" datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:90904.2 自定义仪表盘
{ "title": "Cluster Overview", "panels": [ { "type": "graph", "title": "CPU Usage", "targets": [ { "expr": "sum(node_cpu_seconds_total{mode!=\"idle\"})", "legendFormat": "Total CPU" } ], "yAxes": [ { "format": "percent" } ] }, { "type": "graph", "title": "Memory Usage", "targets": [ { "expr": "sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)", "legendFormat": "Used Memory" } ], "yAxes": [ { "format": "bytes" } ] }, { "type": "stat", "title": "Active Pods", "targets": [ { "expr": "count(kube_pod_status_phase{phase=\"Running\"})" } ] } ] }五、监控最佳实践
5.1 自定义指标
from prometheus_client import start_http_server, Gauge REQUESTS = Gauge('app_requests_total', 'Total requests') ERRORS = Gauge('app_errors_total', 'Total errors') LATENCY = Gauge('app_request_latency_seconds', 'Request latency') @app.route('/') def index(): REQUESTS.inc() start_time = time.time() try: # 处理请求 return 'OK' except Exception as e: ERRORS.inc() raise finally: LATENCY.set(time.time() - start_time) if __name__ == '__main__': start_http_server(8000) app.run()5.2 监控服务配置
apiVersion: v1 kind: Service metadata: name: app-metrics annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" spec: selector: app: my-app ports: - port: 8000 name: metrics5.3 告警通知配置
apiVersion: monitoring.coreos.com/v1 kind: AlertmanagerConfig metadata: name: alertmanager-config namespace: monitoring spec: route: groupBy: ['alertname'] receiver: 'email' receivers: - name: 'email' emailConfigs: - to: 'admin@example.com' from: 'alerts@example.com' smarthost: 'smtp.example.com:587' authUsername: 'alerts' authPassword: name: smtp-password key: password六、总结
监控告警实践包括:
- 指标收集:使用Node Exporter和cAdvisor收集指标
- 指标存储:使用Prometheus存储时间序列数据
- 可视化:使用Grafana创建仪表盘
- 告警规则:配置告警条件和通知方式
- 自定义指标:暴露应用程序指标
建议建立完善的监控体系,实现实时监控和智能告警。
参考资料:
- Prometheus文档
- Grafana文档
- Alertmanager文档