WeKnora监控方案:Prometheus+Grafana实现性能监控
如果你正在使用WeKnora搭建自己的智能知识库,有没有遇到过这样的困惑:系统运行得好好的,突然某个功能变慢了,或者用户反馈问答响应时间变长,但你却不知道问题出在哪里?是文档解析服务卡住了,还是向量检索变慢了,又或者是大模型推理出了问题?
这就是为什么我们需要监控系统。今天我要分享的,就是如何为WeKnora搭建一套完整的性能监控方案,让你能够实时掌握系统运行状态,快速定位问题,而不是等到用户抱怨了才发现问题。
1. 为什么WeKnora需要监控系统?
让我先说说自己的经历。之前我在公司内部部署了一套WeKnora,用于技术文档的智能问答。开始运行得挺顺利,但过了一段时间,有同事反馈说问答响应时间从原来的2-3秒变成了10多秒。我登录服务器一看,CPU、内存都正常,但就是不知道哪里慢了。
后来我花了一整天时间,手动检查各个服务的日志,才发现是文档解析服务在处理某个复杂的PDF文件时卡住了。如果当时有监控系统,我可能几分钟就能发现问题。
WeKnora作为一个微服务架构的系统,包含多个组件:
- Go后端服务(处理业务逻辑)
- Python文档解析服务
- PostgreSQL数据库(含pgvector扩展)
- Redis缓存和消息队列
- 前端Vue应用
- 可能还有Ollama等大模型服务
这么多服务协同工作,任何一个环节出问题都可能影响整体性能。没有监控,就像在黑暗中开车,不知道前方有什么障碍。
2. 监控方案选型:为什么选择Prometheus+Grafana?
市面上监控工具很多,我选择Prometheus+Grafana组合,主要是基于这几个考虑:
Prometheus的优势:
- 专门为云原生和微服务设计,天生适合WeKnora这种架构
- 拉取模式(pull model)更适合服务发现
- 强大的查询语言PromQL,可以灵活分析指标
- 社区活跃,生态丰富
Grafana的优势:
- 可视化效果一流,图表丰富
- 支持多种数据源,不只是Prometheus
- 灵活的告警配置
- 仪表盘可以分享和复用
最重要的是,这套方案完全开源免费,部署相对简单,而且和Docker环境集成得很好。
3. 环境准备与快速部署
3.1 前提条件
在开始之前,确保你已经:
- 成功部署了WeKnora(可以参考官方文档)
- 系统安装了Docker和Docker Compose
- 有基本的Linux命令行操作经验
3.2 创建监控配置文件
首先,在WeKnora项目目录下创建一个新的文件夹来存放监控配置:
# 进入WeKnora项目目录 cd WeKnora # 创建监控配置目录 mkdir -p monitoring cd monitoring创建Prometheus的配置文件prometheus.yml:
# monitoring/prometheus.yml global: scrape_interval: 15s # 每15秒采集一次指标 evaluation_interval: 15s # 每15秒评估一次告警规则 # 告警规则配置 rule_files: - "alerts.yml" # 采集目标配置 scrape_configs: # Prometheus自身监控 - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # WeKnora应用监控 - job_name: 'weknora-app' static_configs: - targets: ['weknora-app:8080'] # WeKnora后端服务 metrics_path: '/metrics' # 指标端点 scrape_interval: 10s # 更频繁地采集应用指标 # 文档解析服务监控 - job_name: 'weknora-docreader' static_configs: - targets: ['weknora-docreader:50051'] # gRPC服务 scrape_interval: 10s # 数据库监控 - job_name: 'postgres' static_configs: - targets: ['weknora-postgres:9187'] # PostgreSQL Exporter scrape_interval: 30s # Redis监控 - job_name: 'redis' static_configs: - targets: ['weknora-redis:9121'] # Redis Exporter scrape_interval: 30s # 节点监控(服务器资源) - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] scrape_interval: 30s创建告警规则文件alerts.yml:
# monitoring/alerts.yml groups: - name: weknora_alerts rules: # 高错误率告警 - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 2m labels: severity: warning annotations: summary: "高错误率检测到" description: "{{ $labels.job }}的错误率超过5%,当前值为{{ $value }}" # 高响应时间告警 - alert: HighResponseTime expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 5 for: 2m labels: severity: warning annotations: summary: "高响应时间检测到" description: "{{ $labels.job }}的95%分位响应时间超过5秒,当前值为{{ $value }}秒" # 服务宕机告警 - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务 {{ $labels.job }} 宕机" description: "{{ $labels.job }} 已经宕机超过1分钟"3.3 创建Docker Compose文件
创建docker-compose.monitoring.yml文件来定义监控服务:
# monitoring/docker-compose.monitoring.yml version: '3.8' services: # Prometheus - 指标采集和存储 prometheus: image: prom/prometheus:latest container_name: prometheus restart: unless-stopped volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--storage.tsdb.retention.time=30d' # 保留30天数据 - '--web.enable-lifecycle' ports: - "9090:9090" networks: - weknora-network depends_on: - node-exporter - postgres-exporter - redis-exporter # Grafana - 数据可视化 grafana: image: grafana/grafana:latest container_name: grafana restart: unless-stopped environment: - GF_SECURITY_ADMIN_PASSWORD=admin123 # 修改为你的密码 - GF_INSTALL_PLUGINS=grafana-piechart-panel volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning ports: - "3000:3000" networks: - weknora-network depends_on: - prometheus # Node Exporter - 服务器资源监控 node-exporter: image: prom/node-exporter:latest container_name: node-exporter restart: unless-stopped volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.rootfs=/rootfs' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' ports: - "9100:9100" networks: - weknora-network # PostgreSQL Exporter - 数据库监控 postgres-exporter: image: prometheuscommunity/postgres-exporter:latest container_name: postgres-exporter restart: unless-stopped environment: - DATA_SOURCE_NAME=postgresql://weknora:your_password@weknora-postgres:5432/weknora?sslmode=disable ports: - "9187:9187" networks: - weknora-network depends_on: - weknora-postgres # Redis Exporter - Redis监控 redis-exporter: image: oliver006/redis_exporter:latest container_name: redis-exporter restart: unless-stopped environment: - REDIS_ADDR=redis://weknora-redis:6379 - REDIS_PASSWORD=your_redis_password ports: - "9121:9121" networks: - weknora-network depends_on: - weknora-redis # Alertmanager - 告警管理 alertmanager: image: prom/alertmanager:latest container_name: alertmanager restart: unless-stopped volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager_data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' ports: - "9093:9093" networks: - weknora-network networks: weknora-network: external: true name: weknora_default # 使用WeKnora的网络 volumes: prometheus_data: grafana_data: alertmanager_data:创建Alertmanager配置文件alertmanager.yml:
# monitoring/alertmanager.yml global: smtp_smarthost: 'smtp.gmail.com:587' # 配置你的SMTP服务器 smtp_from: 'alertmanager@yourdomain.com' smtp_auth_username: 'your-email@gmail.com' smtp_auth_password: 'your-password' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'email-notifications' receivers: - name: 'email-notifications' email_configs: - to: 'admin@yourdomain.com' send_resolved: true3.4 配置Grafana数据源和仪表盘
创建Grafana配置目录:
mkdir -p grafana/provisioning/datasources mkdir -p grafana/provisioning/dashboards创建数据源配置文件grafana/provisioning/datasources/prometheus.yml:
# grafana/provisioning/datasources/prometheus.yml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true创建仪表盘配置文件grafana/provisioning/dashboards/dashboards.yml:
# grafana/provisioning/dashboards/dashboards.yml apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false editable: true options: path: /etc/grafana/provisioning/dashboards3.5 启动监控服务
现在可以启动监控服务了:
# 确保在monitoring目录下 cd monitoring # 启动监控服务 docker-compose -f docker-compose.monitoring.yml up -d检查服务状态:
docker-compose -f docker-compose.monitoring.yml ps你应该看到所有服务都正常运行:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (用户名: admin, 密码: admin123)
- Alertmanager: http://localhost:9093
4. 为WeKnora添加监控指标
现在监控系统已经运行起来了,但WeKnora本身还没有暴露指标。我们需要为WeKnora添加Prometheus指标支持。
4.1 为Go后端添加指标
修改WeKnora的Go后端代码,添加Prometheus指标采集。首先安装必要的依赖:
# 在WeKnora项目根目录执行 go get github.com/prometheus/client_golang/prometheus go get github.com/prometheus/client_golang/prometheus/promauto go get github.com/prometheus/client_golang/prometheus/promhttp创建一个新的文件internal/monitoring/metrics.go:
// internal/monitoring/metrics.go package monitoring import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var ( // HTTP请求相关指标 HttpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "path", "status"}, ) HttpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: prometheus.DefBuckets, }, []string{"method", "path"}, ) // 业务相关指标 KnowledgeUploadTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "knowledge_upload_total", Help: "Total number of knowledge uploads", }, []string{"type", "status"}, ) KnowledgeUploadDuration = promauto.NewHistogram( prometheus.HistogramOpts{ Name: "knowledge_upload_duration_seconds", Help: "Knowledge upload duration in seconds", Buckets: []float64{0.1, 0.5, 1, 5, 10, 30, 60}, }, ) ChatRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "chat_requests_total", Help: "Total number of chat requests", }, []string{"knowledge_base", "status"}, ) ChatResponseDuration = promauto.NewHistogram( prometheus.HistogramOpts{ Name: "chat_response_duration_seconds", Help: "Chat response duration in seconds", Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 30}, }, ) // 向量检索相关指标 VectorSearchDuration = promauto.NewHistogram( prometheus.HistogramOpts{ Name: "vector_search_duration_seconds", Help: "Vector search duration in seconds", Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 2, 5}, }, ) // 文档处理相关指标 DocumentProcessingDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "document_processing_duration_seconds", Help: "Document processing duration in seconds", Buckets: []float64{1, 5, 10, 30, 60, 120, 300}, }, []string{"format"}, ) // 系统资源指标 ActiveSessions = promauto.NewGauge( prometheus.GaugeOpts{ Name: "active_sessions", Help: "Number of active chat sessions", }, ) KnowledgeCount = promauto.NewGaugeVec( prometheus.GaugeOpts{ Name: "knowledge_count", Help: "Number of knowledge items", }, []string{"knowledge_base"}, ) ChunkCount = promauto.NewGaugeVec( prometheus.GaugeOpts{ Name: "chunk_count", Help: "Number of document chunks", }, []string{"knowledge_base"}, ) ) // 初始化函数 func Init() { // 这里可以注册自定义收集器或执行其他初始化 }创建一个Prometheus中间件internal/middleware/prometheus.go:
// internal/middleware/prometheus.go package middleware import ( "net/http" "strconv" "time" "github.com/gin-gonic/gin" "github.com/prometheus/client_golang/prometheus" "github.com/weknora/internal/monitoring" ) func PrometheusMiddleware() gin.HandlerFunc { return func(c *gin.Context) { // 跳过健康检查等端点 if c.Request.URL.Path == "/metrics" || c.Request.URL.Path == "/health" { c.Next() return } start := time.Now() method := c.Request.Method path := c.Request.URL.Path // 处理请求 c.Next() // 记录指标 duration := time.Since(start).Seconds() status := strconv.Itoa(c.Writer.Status()) // 记录HTTP请求指标 monitoring.HttpRequestsTotal.WithLabelValues(method, path, status).Inc() monitoring.HttpRequestDuration.WithLabelValues(method, path).Observe(duration) // 根据路径记录业务指标 recordBusinessMetrics(c, path, duration) } } func recordBusinessMetrics(c *gin.Context, path string, duration float64) { // 知识上传 if path == "/api/v1/knowledge-bases/:id/knowledge/file" { status := "success" if c.Writer.Status() >= 400 { status = "error" } monitoring.KnowledgeUploadTotal.WithLabelValues("file", status).Inc() monitoring.KnowledgeUploadDuration.Observe(duration) } // 聊天请求 if path == "/api/v1/knowledge-chat/:session_id" { // 从上下文中获取知识库ID if kbID, exists := c.Get("knowledge_base_id"); exists { status := "success" if c.Writer.Status() >= 400 { status = "error" } monitoring.ChatRequestsTotal.WithLabelValues(kbID.(string), status).Inc() monitoring.ChatResponseDuration.Observe(duration) } } }修改主程序入口cmd/server/main.go,添加指标端点:
// cmd/server/main.go 添加以下内容 import ( "net/http" "github.com/prometheus/client_golang/prometheus/promhttp" "github.com/weknora/internal/monitoring" ) func main() { // ... 现有代码 ... // 初始化监控 monitoring.Init() // 添加Prometheus指标端点 router.GET("/metrics", gin.WrapH(promhttp.Handler())) // ... 现有代码 ... }在路由配置中添加Prometheus中间件:
// internal/router/router.go import ( "github.com/weknora/internal/middleware" ) func SetupRouter() *gin.Engine { router := gin.New() // 添加Prometheus中间件 router.Use(middleware.PrometheusMiddleware()) // ... 现有代码 ... }4.2 为Python文档解析服务添加指标
修改文档解析服务,添加Prometheus指标支持。首先在services/docreader/requirements.txt中添加依赖:
prometheus-client==0.20.0创建services/docreader/src/metrics.py:
# services/docreader/src/metrics.py from prometheus_client import Counter, Histogram, Gauge, start_http_server import time # 文档解析指标 document_parse_total = Counter( 'document_parse_total', 'Total number of document parses', ['format', 'status'] ) document_parse_duration = Histogram( 'document_parse_duration_seconds', 'Document parse duration in seconds', ['format'], buckets=(0.1, 0.5, 1, 5, 10, 30, 60, 120, 300) ) # OCR处理指标 ocr_process_total = Counter( 'ocr_process_total', 'Total number of OCR processes', ['status'] ) ocr_process_duration = Histogram( 'ocr_process_duration_seconds', 'OCR process duration in seconds', buckets=(0.1, 0.5, 1, 5, 10, 30) ) # 图像处理指标 image_process_total = Counter( 'image_process_total', 'Total number of image processes', ['status'] ) image_process_duration = Histogram( 'image_process_duration_seconds', 'Image process duration in seconds', buckets=(0.1, 0.5, 1, 5, 10) ) # 系统指标 active_parsers = Gauge( 'active_parsers', 'Number of active document parsers' ) queue_size = Gauge( 'queue_size', 'Size of processing queue' ) def start_metrics_server(port=8000): """启动Prometheus指标服务器""" start_http_server(port) print(f"Metrics server started on port {port}") # 装饰器用于记录函数执行时间和计数 def monitor_parse(func): """监控文档解析函数的装饰器""" def wrapper(*args, **kwargs): start_time = time.time() file_format = kwargs.get('format', 'unknown') try: result = func(*args, **kwargs) duration = time.time() - start_time # 记录成功指标 document_parse_total.labels(format=file_format, status='success').inc() document_parse_duration.labels(format=file_format).observe(duration) return result except Exception as e: duration = time.time() - start_time # 记录失败指标 document_parse_total.labels(format=file_format, status='error').inc() document_parse_duration.labels(format=file_format).observe(duration) raise e return wrapper修改services/docreader/src/grpc_server.py,集成指标监控:
# services/docreader/src/grpc_server.py import grpc from concurrent import futures import time from prometheus_client import start_http_server import threading # 导入metrics模块 from . import metrics class DocReaderService(docreader_pb2_grpc.DocReaderServiceServicer): @metrics.monitor_parse def ParseDocument(self, request, context): """解析文档 - 已添加监控""" try: # 增加活跃解析器计数 metrics.active_parsers.inc() # 解析逻辑... result = self._parse_document_internal(request) # 减少活跃解析器计数 metrics.active_parsers.dec() return result except Exception as e: metrics.active_parsers.dec() raise e def _parse_document_internal(self, request): """内部解析逻辑""" # 现有的解析代码... pass def serve(): # 启动指标服务器(在另一个端口) metrics_thread = threading.Thread( target=metrics.start_metrics_server, args=(8000,) # 在8000端口提供指标 ) metrics_thread.daemon = True metrics_thread.start() # 启动gRPC服务器 server = grpc.server(futures.ThreadPoolExecutor(max_workers=10)) docreader_pb2_grpc.add_DocReaderServiceServicer_to_server( DocReaderService(), server ) server.add_insecure_port('[::]:50051') server.start() print("gRPC server started on port 50051") server.wait_for_termination() if __name__ == '__main__': serve()4.3 更新Docker配置
更新WeKnora的Docker Compose配置,暴露指标端口并添加监控标签。
修改docker-compose.yml,为各个服务添加指标端口:
# 在weknora-app服务中添加 weknora-app: # ... 现有配置 ... ports: - "8080:8080" - "8081:8080" # 指标端点 # ... 现有配置 ... # 在weknora-docreader服务中添加 weknora-docreader: # ... 现有配置 ... ports: - "50051:50051" - "8000:8000" # 指标端点 # ... 现有配置 ...5. 配置Grafana仪表盘
现在监控数据已经可以采集了,接下来我们需要在Grafana中创建仪表盘来可视化这些数据。
5.1 登录Grafana
访问 http://localhost:3000,使用用户名admin和密码admin123登录。
5.2 导入预置仪表盘
Grafana社区有很多预置的仪表盘,我们可以直接导入使用。这里我推荐几个:
- Node Exporter Full(ID: 1860) - 服务器资源监控
- PostgreSQL(ID: 9628) - 数据库监控
- Redis(ID: 763) - Redis监控
- Spring Boot Statistics(ID: 6756) - 可以适配用于Go应用监控
导入方法:
- 在Grafana左侧菜单选择 "Create" → "Import"
- 输入仪表盘ID
- 选择Prometheus数据源
- 点击 "Import"
5.3 创建自定义WeKnora仪表盘
除了导入预置仪表盘,我们还需要创建专门针对WeKnora业务指标的仪表盘。
创建一个新的仪表盘,添加以下面板:
面板1:系统概览
- 当前活跃会话数
- 知识库数量
- 文档块总数
- 系统健康状态
面板2:性能指标
- HTTP请求响应时间(P50, P95, P99)
- 文档解析时间
- 向量检索时间
- 聊天响应时间
面板3:业务流量
- HTTP请求速率(按状态码)
- 文档上传速率
- 聊天请求速率
- 错误率
面板4:资源使用
- CPU使用率
- 内存使用率
- 磁盘使用率
- 网络流量
面板5:数据库性能
- 数据库连接数
- 查询性能
- 缓存命中率
- 锁等待时间
这里是一个示例的仪表盘JSON配置,你可以直接导入:
{ "dashboard": { "title": "WeKnora监控仪表盘", "panels": [ { "title": "HTTP请求响应时间", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P95响应时间" }, { "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P50响应时间" } ], "type": "graph" }, { "title": "聊天响应时间", "targets": [ { "expr": "histogram_quantile(0.95, rate(chat_response_duration_seconds_bucket[5m]))", "legendFormat": "P95聊天响应时间" } ], "type": "graph" }, { "title": "文档处理性能", "targets": [ { "expr": "histogram_quantile(0.95, rate(document_processing_duration_seconds_bucket[5m]))", "legendFormat": "P95文档处理时间" }, { "expr": "histogram_quantile(0.95, rate(vector_search_duration_seconds_bucket[5m]))", "legendFormat": "P95向量检索时间" } ], "type": "graph" } ] } }6. 设置告警规则
监控不仅要能看到问题,还要能在问题发生时及时通知我们。Prometheus的Alertmanager可以帮我们实现这个功能。
6.1 配置告警规则
我们已经在前面的alerts.yml中定义了一些基础告警规则,现在添加更多针对WeKnora的告警:
# 在alerts.yml中添加以下规则 # 文档处理失败告警 - alert: DocumentProcessingFailed expr: rate(document_parse_total{status="error"}[5m]) / rate(document_parse_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "文档处理失败率过高" description: "文档处理失败率超过10%,当前值 {{ $value }}" # 聊天响应时间过长告警 - alert: ChatResponseSlow expr: histogram_quantile(0.95, rate(chat_response_duration_seconds_bucket[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "聊天响应时间过长" description: "95%分位的聊天响应时间超过10秒,当前值 {{ $value }}秒" # 向量检索性能下降告警 - alert: VectorSearchSlow expr: histogram_quantile(0.95, rate(vector_search_duration_seconds_bucket[5m])) > 2 for: 5m labels: severity: warning annotations: summary: "向量检索性能下降" description: "95%分位的向量检索时间超过2秒,当前值 {{ $value }}秒" # 知识库存储空间不足告警 - alert: StorageSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) * 100 < 20 for: 5m labels: severity: critical annotations: summary: "存储空间不足" description: "存储空间使用率超过80%,剩余 {{ $value }}%" # 数据库连接数过高告警 - alert: HighDatabaseConnections expr: pg_stat_database_numbackends > 50 for: 2m labels: severity: warning annotations: summary: "数据库连接数过高" description: "数据库连接数超过50,当前值 {{ $value }}"6.2 配置告警通知
修改alertmanager.yml配置多种通知方式:
# monitoring/alertmanager.yml global: smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'alerts@yourcompany.com' smtp_auth_username: 'your-email@gmail.com' smtp_auth_password: 'your-password' route: group_by: ['alertname', 'severity'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'default-receiver' # 根据严重程度路由到不同的接收器 routes: - match: severity: critical receiver: 'critical-alerts' continue: false - match: severity: warning receiver: 'warning-alerts' receivers: - name: 'default-receiver' email_configs: - to: 'team@yourcompany.com' send_resolved: true - name: 'critical-alerts' email_configs: - to: 'oncall@yourcompany.com' send_resolved: true webhook_configs: - url: 'https://hooks.slack.com/services/your/slack/webhook' send_resolved: true - name: 'warning-alerts' email_configs: - to: 'dev-team@yourcompany.com' send_resolved: true6.3 测试告警
重启Alertmanager使配置生效:
docker-compose -f docker-compose.monitoring.yml restart alertmanager手动触发一个告警来测试:
# 模拟高错误率 curl -X POST http://localhost:9090/-/reload # 重新加载Prometheus配置检查Alertmanager界面 http://localhost:9093,应该能看到触发的告警。
7. 监控系统维护和优化
监控系统搭建好了,但工作还没结束。我们需要定期维护和优化监控系统。
7.1 数据保留策略
根据你的存储容量和需求,调整数据保留时间:
# 在prometheus.yml中调整 --storage.tsdb.retention.time=30d # 保留30天对于长期趋势分析,可以考虑:
- 使用Prometheus的远程存储(如Thanos、Cortex)
- 定期导出重要指标到外部存储
- 使用Grafana的报表功能生成定期报告
7.2 性能优化
如果监控数据量很大,可能需要优化:
- 减少指标采集频率:对于变化不频繁的指标,可以降低采集频率
- 使用记录规则:预计算常用查询
- 优化标签:避免高基数标签(如用户ID、会话ID)
# 在prometheus.yml中添加记录规则 rule_files: - "alerts.yml" - "recording_rules.yml"创建recording_rules.yml:
# monitoring/recording_rules.yml groups: - name: weknora_recording_rules interval: 30s rules: - record: job:http_requests:rate5m expr: rate(http_requests_total[5m]) - record: job:chat_response_time:percentile95 expr: histogram_quantile(0.95, rate(chat_response_duration_seconds_bucket[5m])) - record: job:error_rate:ratio expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])7.3 监控系统自身的监控
别忘了监控系统本身也需要被监控:
- Prometheus自身监控:Prometheus提供了自身的指标
- Grafana监控:Grafana也有内置的指标端点
- 存储空间监控:定期检查Prometheus的数据目录大小
- 告警通道监控:确保告警能正常发送
8. 实际使用案例
让我分享几个实际使用监控系统解决问题的例子:
案例1:文档解析性能问题有一次,用户反馈上传PDF文件特别慢。通过监控仪表盘,我很快发现:
- 文档解析时间从平时的1-2秒变成了30多秒
- OCR处理失败率突然升高
- 活跃解析器数量达到上限
进一步检查发现,是一个包含大量扫描图片的PDF文件导致OCR处理超时。通过调整OCR超时设置和增加处理并发度,问题得到解决。
案例2:聊天响应变慢监控系统告警显示聊天响应时间P95从3秒增加到了15秒。查看详细指标:
- 向量检索时间正常
- LLM推理时间正常
- 但数据库查询时间明显变长
检查数据库监控,发现有一个慢查询正在执行。优化该查询后,响应时间恢复正常。
案例3:内存泄漏监控显示WeKnora应用内存使用持续增长,即使在没有请求的时候也不释放。通过分析内存指标和GC日志,发现是一个缓存实现有问题,导致对象无法被回收。修复缓存逻辑后,内存使用恢复正常。
9. 总结
搭建WeKnora的监控系统可能看起来有些复杂,但一旦完成,它会成为你运维工作中最得力的助手。通过Prometheus+Grafana的组合,你可以:
- 实时掌握系统状态:随时了解每个服务的运行状况
- 快速定位问题:当出现性能问题时,能快速找到瓶颈
- 预防性维护:通过趋势分析,提前发现潜在问题
- 数据驱动优化:基于实际数据做出架构和配置优化决策
这套监控方案不仅适用于WeKnora,其设计思路和实现方法也可以应用到其他微服务系统中。监控不是一次性的工作,而是一个持续的过程。随着业务的发展,你需要不断调整监控策略,添加新的指标,优化告警规则。
开始可能会觉得配置监控系统有些繁琐,但相信我,当第一次通过监控系统快速解决一个生产问题时,你会觉得所有的投入都是值得的。监控系统就像给你的WeKnora安装了一双"眼睛",让你能够看清系统的每一个细节,而不是在黑暗中摸索。
获取更多AI镜像
想探索更多AI镜像和应用场景?访问 CSDN星图镜像广场,提供丰富的预置镜像,覆盖大模型推理、图像生成、视频生成、模型微调等多个领域,支持一键部署。