前言
MySQL的CPU或内存突然飙升是生产环境中最紧急的故障之一,可能导致服务雪崩。本文提供一套从症状到根因的完整排查流程,涵盖实时监控、问题定位、紧急处理、根因分析四个阶段,帮助你快速响应并解决MySQL性能危机。
一、紧急响应:30秒内确定问题类型
- 快速检查系统负载
bash
检查整体负载(立即执行)
top -c
或使用更清晰的htop
htop
查看MySQL进程资源占用
pidstat -p $(pgrep mysqld) 1 5
检查内存使用情况
free -h
检查SWAP使用
swapon --show
2. 判断问题类型:CPU vs 内存
bash
如果是CPU问题(top显示CPU%持续>80%):
特征:系统响应慢,查询超时增多
如果是内存问题(free显示可用内存极低):
特征:可能出现OOM,大量磁盘IO
- 快速缓解措施
bash
临时限制MySQL资源(如有多个实例)
systemctl set-property mysqld.service CPUShares=512 MemoryLimit=4G
紧急重启(最后手段,按顺序执行)
sudo systemctl stop mysql
等待30秒
sudo systemctl start mysql
二、CPU飙升排查:定位高消耗SQL
- 实时查看活跃连接
sql
– 查看当前所有连接和状态
SHOW FULL PROCESSLIST;
– 或更详细的查询
SELECT
id,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 100) AS query_preview,
ROWS_SENT,
ROWS_EXAMINED
FROM information_schema.PROCESSLIST
WHERE COMMAND != ‘Sleep’
AND INFO IS NOT NULL
ORDER BY TIME DESC
LIMIT 20; - 定位消耗CPU的线程
bash
Linux层面查看MySQL线程
top -H -p $(pgrep mysqld)
将线程ID转换为十六进制(用于MySQL内部查询)
printf “%x\n” 12345
输出:3039
sql
– 在MySQL中查看线程对应操作
SELECT
THREAD_ID,
PROCESSLIST_ID,
NAME AS thread_name,
PROCESSLIST_INFO AS current_query,
PROCESSLIST_TIME AS query_time,
PROCESSLIST_STATE AS state
FROM performance_schema.threads
WHERE THREAD_OS_ID = 12345; – 替换为实际线程ID
3. 识别问题SQL特征
sql
– 查看当前运行时间最长的查询
SELECT
ps.THREAD_ID,
esh.EVENT_NAME,
esh.SQL_TEXT,
esh.TIMER_WAIT/1000000000 as seconds,
esh.LOCK_TIME/1000000000 as lock_seconds,
esh.ROWS_EXAMINED,
esh.ROWS_AFFECTED
FROM performance_schema.events_statements_history esh
JOIN performance_schema.threads ps ON esh.THREAD_ID = ps.THREAD_ID
WHERE ps.PROCESSLIST_ID IS NOT NULL
ORDER BY esh.TIMER_WAIT DESC
LIMIT 10;
4. 检查锁争用情况
sql
– 查看锁等待
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;
三、内存飙升排查:识别内存泄漏点
- 查看MySQL内存分配
sql
– 查看全局内存分配
SELECT
SUBSTRING_INDEX(event_name,‘/’,2) AS code_area,
SUM(current_alloc) AS current_alloc
FROM sys.memory_global_by_current_bytes
GROUP BY SUBSTRING_INDEX(event_name,‘/’,2)
ORDER BY SUM(current_alloc) DESC;
– 查看具体组件内存使用
SELECT * FROM sys.memory_global_by_current_bytes
WHERE current_count > 0
ORDER BY current_alloc DESC
LIMIT 15;
2. 检查缓冲区使用情况
sql
– 查看InnoDB缓冲池状态
SHOW ENGINE INNODB STATUS\G
– 重点关注BUFFER POOL AND MEMORY部分
– 查看缓冲池命中率
SELECT
(1 - (Variable_value / (SELECT Variable_value
FROM information_schema.global_status
WHERE Variable_name = ‘Innodb_buffer_pool_read_requests’))) * 100 AS hit_rate
FROM information_schema.global_status
WHERE Variable_name = ‘Innodb_buffer_pool_reads’;
– 查看排序和临时表内存使用
SHOW GLOBAL STATUS LIKE ‘Sort%’;
SHOW GLOBAL STATUS LIKE ‘%tmp%table%’;
3. 识别内存泄漏会话
sql
– 查看每个连接的内存使用
SELECT
thread_id,
processlist_id,
processlist_user,
processlist_host,
memory_current_allocated,
memory_total_allocated
FROM sys.session
WHERE memory_current_allocated > 10010241024 – 大于100MB
ORDER BY memory_current_allocated DESC;
– 查看内存事件
SELECT
event_name,
current_alloc,
high_alloc
FROM sys.memory_by_thread_by_current_bytes
ORDER BY current_alloc DESC
LIMIT 10;
四、深度分析:使用性能Schema
- 开启性能监控
sql
– 确保性能Schema已开启
SHOW VARIABLES LIKE ‘performance_schema’;
– 开启所有监控(如果未开启)
UPDATE performance_schema.setup_instruments
SET ENABLED = ‘YES’, TIMED = ‘YES’;
UPDATE performance_schema.setup_consumers
SET ENABLED = ‘YES’;
2. 分析SQL执行统计
sql
– 查看消耗资源最多的SQL
SELECT
DIGEST_TEXT AS query,
COUNT_STAR AS exec_count,
SUM_TIMER_WAIT/1000000000 AS total_time_sec,
AVG_TIMER_WAIT/1000000000 AS avg_time_sec,
SUM_ROWS_EXAMINED AS rows_examined,
SUM_ROWS_SENT AS rows_sent,
SUM_SORT_ROWS AS sort_rows,
SUM_CREATED_TMP_TABLES AS tmp_tables,
FIRST_SEEN,
LAST_SEEN
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT IS NOT NULL
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
3. 检查表扫描情况
sql
– 查看全表扫描的表
SELECT
object_schema,
object_name,
count_read,
count_write,
count_fetch,
count_insert,
count_update,
count_delete
FROM performance_schema.table_io_waits_summary_by_table
ORDER BY count_read DESC
LIMIT 10;
五、紧急处理脚本集
- 快速诊断脚本
bash
#!/bin/bash
mysql_emergency_check.sh
echo “========== 系统状态 ==========”
top -bn1 | grep -E “^(%Cpu|Tasks|Mem|Swap)”
echo -e “\n========== MySQL进程状态 ==========”
ps aux | grep mysqld | grep -v grep
echo -e “\n========== MySQL连接数 ==========”
mysql -e “SHOW GLOBAL STATUS LIKE ‘Threads_connected’;”
mysql -e “SHOW GLOBAL STATUS LIKE ‘Max_used_connections’;”
echo -e “\n========== 运行时间最长的查询 ==========”
mysql -e "
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 100) AS query,
ROWS_SENT,
ROWS_EXAMINED
FROM information_schema.PROCESSLIST
WHERE COMMAND != ‘Sleep’
AND TIME > 60
ORDER BY TIME DESC
LIMIT 5\G"
echo -e “\n========== 锁等待 ==========”
mysql -e “SHOW ENGINE INNODB STATUS\G” | grep -A 20 “LATEST DETECTED DEADLOCK”
2. 自动终止问题查询
bash
#!/bin/bash
kill_long_queries.sh
TIMEOUT=300 # 5分钟
终止运行超过指定时间的查询
mysql -e "
SELECT CONCAT('KILL ', ID, ‘;’) AS kill_command
FROM information_schema.PROCESSLIST
WHERE COMMAND != ‘Sleep’
AND TIME > $TIMEOUT
AND USER NOT IN (‘system user’, ‘event_scheduler’)
AND INFO NOT LIKE ‘%PROCESSLIST%’
" | grep ‘KILL’ | mysql
3. 内存泄漏监控脚本
bash
#!/bin/bash
monitor_mysql_memory.sh
LOG_FILE=“/var/log/mysql_memory.log”
THRESHOLD=80 # 内存使用率阈值%
while true; do
# 获取MySQL内存使用百分比
MEM_USAGE=$(ps -o %mem= -p $(pgrep mysqld) | awk ‘{sum+=$1} END {print sum}’)
if (( $(echo "$MEM_USAGE > $THRESHOLD" | bc -l) )); then echo "$(date): MySQL内存使用率 ${MEM_USAGE}% 超过阈值" >> $LOG_FILE # 捕获内存快照 mysql -e "SHOW ENGINE INNODB STATUS\G" >> $LOG_FILE mysql -e "SELECT * FROM sys.memory_global_by_current_bytes LIMIT 10;" >> $LOG_FILE # 发送告警 echo "MySQL内存告警: ${MEM_USAGE}%" | mail -s "MySQL内存异常" admin@example.com fi sleep 60 # 每分钟检查一次done
六、根因分析与解决方案
- CPU飙升常见原因及解决
原因 特征 解决方案
全表扫描 ROWS_EXAMINED远大于ROWS_SENT 添加合适索引,优化查询条件
排序操作 大量filesort,Sort_merge_passes增加 优化ORDER BY,增加sort_buffer_size
子查询优化 使用DEPENDENT SUBQUERY 改写为JOIN或使用EXISTS
锁争用 大量锁等待,Lock wait timeout 优化事务,减少锁持有时间
并发连接高 Threads_connected接近max_connections 增加连接池,优化连接管理
优化示例:
sql
– 从低效查询
SELECT * FROM orders WHERE DATE(create_time) = ‘2023-10-01’;
– 优化为
SELECT * FROM orders WHERE create_time >= ‘2023-10-01’ AND create_time < ‘2023-10-02’;
– 并在create_time上建立索引
2. 内存飙升常见原因及解决
原因 特征 解决方案
连接过多 每个连接占用大量内存 减少空闲连接,优化连接池
缓冲池过大 缓冲池占用物理内存80%以上 适当减小innodb_buffer_pool_size
临时表过多 Created_tmp_disk_tables高 优化查询,增加tmp_table_size
内存泄漏 内存持续增长不释放 检查bug,升级MySQL版本
排序缓冲区大 sort_buffer_size设置过大 适当减小,监控实际使用
配置优化示例:
ini
my.cnf 优化配置
[mysqld]
缓冲池大小(物理内存的50-70%)
innodb_buffer_pool_size = 4G
连接相关内存
max_connections = 200
thread_cache_size = 50
临时表内存
tmp_table_size = 64M
max_heap_table_size = 64M
排序缓冲区
sort_buffer_size = 2M
join_buffer_size = 2M
查询缓存(MySQL 8.0已移除,5.7可考虑禁用)
query_cache_size = 0
七、预防与监控体系建设
- 建立监控仪表盘
bash
使用Prometheus + Grafana监控
部署mysqld_exporter
docker run -d
–name mysqld-exporter
-p 9104:9104
-e DATA_SOURCE_NAME=“user:password@(hostname:3306)/”
prom/mysqld-exporter
Grafana仪表盘导入ID:7362(MySQL Overview)
- 关键监控指标告警
yaml
Prometheus告警规则示例
groups:
- name: mysql_alerts
rules:alert: HighCPUUsage
expr: rate(process_cpu_seconds_total{job=“mysql”}[5m]) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “MySQL CPU使用率过高”alert: HighMemoryUsage
expr: process_resident_memory_bytes{job=“mysql”} / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: “MySQL内存使用率过高”
定期健康检查
sql
– 每日健康检查SQL
SELECT
– 连接状态
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = ‘Threads_connected’) AS current_connections,
(SELECT VARIABLE_VALUE FROM performance_schema.global_variables
WHERE VARIABLE_NAME = ‘max_connections’) AS max_connections,– 缓冲池命中率
ROUND((1 - (
SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = ‘Innodb_buffer_pool_reads’
) / (
SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = ‘Innodb_buffer_pool_read_requests’
)) * 100, 2) AS buffer_pool_hit_rate,– 临时表使用
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = ‘Created_tmp_disk_tables’) AS disk_temp_tables,– 慢查询数量
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = ‘Slow_queries’) AS slow_queries_today;自动化优化建议
sql
– 使用MySQLTuner或pt-variable-advisor
下载并运行MySQLTuner
wget https://raw.githubusercontent.com/major/MySQLTuner-perl/master/mysqltuner.pl
perl mysqltuner.pl --host 127.0.0.1 --user root --pass ‘yourpassword’
或使用Percona工具
pt-variable-advisor localhost --user root --password ‘yourpassword’
八、排查流程总结
快速排查决策树
text
CPU/内存飙升
↓
- top/htop确认MySQL是罪魁祸首
↓ - SHOW PROCESSLIST查看活跃查询
├── 有大量慢查询 → 优化SQL/添加索引
├── 大量锁等待 → 分析锁争用,优化事务
└── 连接数爆满 → 调整连接池配置
↓ - 检查内存分配
├── 缓冲池过大 → 调整innodb_buffer_pool_size
├── 临时表过多 → 优化查询,调整tmp_table_size
└── 内存泄漏 → 升级版本,检查bug
↓ - 分析性能Schema数据
↓ - 实施优化措施并监控效果
必备命令速查表
bash
系统层面
top -c # 整体资源监控
iotop # IO监控
vmstat 1 # 虚拟内存统计
MySQL层面
mysqladmin processlist # 查看进程列表
mysqladmin status # 查看状态摘要
mysql -e “SHOW ENGINE INNODB STATUS\G” # InnoDB详细信息
性能分析
pt-query-digest slow.log # 分析慢查询日志
mysqldumpslow -s t slow.log # 慢查询排序
九、高级工具推荐
- 实时监控工具
bash
Percona Monitoring and Management (PMM)
docker run -d
–name pmm-server
-p 80:80
-p 443:443
-v /srv/pmm-data:/srv
percona/pmm-server:latest
2. SQL审核工具
bash
SOAR - SQL Optimizer And Rewriter
git clone https://github.com/XiaoMi/soar.git
cd soar
./soar -query “你的SQL” -report-type html
3. 压力测试工具
bash
sysbench基准测试
sysbench oltp_read_write
–mysql-host=127.0.0.1
–mysql-port=3306
–mysql-user=root
–mysql-password=xxx
–table-size=1000000
–threads=32
–time=300
run
问题解决验证清单
✅ CPU问题解决验证:
CPU使用率降至正常水平(<60%)
查询响应时间恢复正常
慢查询数量显著减少
锁等待时间降低
✅ 内存问题解决验证:
内存使用率稳定在合理范围
SWAP使用率为0或极低
缓冲池命中率>99%
临时表磁盘使用显著减少
✅ 预防措施就绪:
监控告警已配置
慢查询日志已开启并定期分析
定期健康检查脚本已部署
备份和恢复方案已验证
黄金法则:
当MySQL出现资源飙升时,保持冷静,按流程排查。先应急恢复服务,再分析根因,最后实施长效预防措施。记住:预防永远比救火更重要。建立一个完善的监控体系,是避免此类问题的最佳实践。