ChatTTS CPU 资源优化：Docker 部署实战与性能调优指南-平芜编程栈

ChatTTS CPU 资源优化：Docker 部署实战与性能调优指南

把大模型语音合成塞进 4C8G 机子，还能让并发不掉线，这篇笔记把踩过的坑一次说清。

1. 背景痛点：CPU 跑不动 ChatTTS

ChatTTS 官方默认给的是 GPU 脚本，扔到 CPU 机器上直接python app.py会出现：

单条 10s 音频 CPU 飙到 250%，4 核被打满，SSH 卡成 PPT；
冷启动 30s+，每次重启容器都要重新 JIT 编译算子；
并发 3 请求以上，Load Average > 5，直接被 OOM Killer 送走。

一句话：CPU 不是不能跑，而是没把“跑”和“省”分开谈。

跑：让模型算得动；
省：让系统留得住。

下面这套方案把“跑”和“省”一起打包进 Docker，开箱即用。

2. 技术选型：原生 vs Docker vs K8s

维度	原生 systemd	Docker	Kubernetes
依赖隔离	需手动 venv	镜像打包
NUMA 亲和	taskset 手动	`--cpuset-cpus`	拓扑管理插件
快速扩缩	人肉脚本	`docker compose up -d`	声明式 YAML
资源超卖	无	`cpu_quota`	请求/限制
学习成本	低	中	高

结论：

个人/小团队阶段，Docker Compose 最划算；
想玩弹性再迁 K8s，不耽误。

3. 核心实现

3.1 Dockerfile 最佳实践

采用多阶段 + 最小化运行时，镜像从 4.2 GB → 1.1 GB。

# 阶段 1：编译+下载 FROM python:3.10-slim as builder WORKDIR /build COPY requirements.txt . RUN pip install --user -r requirements.txt # 阶段 2：运行时 FROM python:3.10-slim RUN apt-get update && apt-get install -y --no-install-recommends \ libgomp1 numactl && rm -rf /var/lib/apt/lists/* COPY --from=builder /root/.local /usr/local COPY . /app WORKDIR /app ENV NUMBA_CACHE_DIR=/tmp/numba ENTRYPOINT ["numactl", "--cpunodebind=0", "--membind=0", \ "python", "server.py"]

要点

numactl绑 NUMA node0，减少跨节点访存；
NUMBA_CACHE_DIR指向 tmpfs，JIT 缓存重启不丢；
用--no-install-recommends减少 120 MB 无用包。

3.2 资源限制配置

docker-compose.yml片段：

deploy: deploy: resources: limits: cpus: '3.5' memory: 3G reservations: cpus: '2' memory: 2G cpuset: 0-3 # 物理核 0-3，避免超线程抖动

解释

cpus: 3.5对应--cpu-quota=350000；
cpuset固定到同一 NUMA node，与 Dockerfile 中 numactl 呼应；
预留 0.5 核给系统，防止 SSH 失联。

3.3 模型加载优化

预热：容器启动即合成 3 条 1s 空白音频，触发 Numba/JIT 一次性编译；
内存映射：把*.pt模型文件用mmap_mode='r'加载，RSS 节省 30%；
句级缓存：对相同文本做 LRU 缓存，命中率 42%，CPU 降 18%。

代码片段（server.py）：

import torch, functools, hashlib, time from lru import LRU model = torch.load('chatts.pt', mmap=True) cache = LRU(256) def synthesize(text: str) -> bytes: key = hashlib.sha256(text.encode()).hexdigest() if key in cache: return cache[key] with torch.no_grad(): wav = model.infer(text) cache[key] = wav return wav

4. 代码示例：一键跑起来的仓库结构

bash chatts-cpu-docker/ ├── docker-compose.yml ├── Dockerfile ├── server.py ├── warmup.py └── bench.py

docker-compose.yml（完整）

version: "3.9" services: chats: build: . ports: - "8090:8090" deploy: resources: limits: cpus: '3.5' memory: 3G cpuset: 0-3 tmpfs: - /tmp/numba:rw,noexec,nosuid,size=100m ulimits: memlock: -1 environment: - PYTHONUNBUFFERED=1 - NUMBA_CACHE_DIR=/tmp/numba

server.py（精简版，PE8 过 pylint）

#!/usr/bin/env python3 """ ChatTTS CPU 服务：单进程 + LRU 缓存 """ import io, json, time, functools, hashlib, logging from pathlib import Path from flask import Flask, request, Response import torch from lru import LRU logging.basicConfig(level=logging.INFO) app = Flask(__name__) MODEL_PATH = Path("chatts.pt") CACHE_SIZE = 256 WARM_TXT = ["hello world.", "123.", ""] def load_model(): logging.info("Loading model with mmap...") model = torch.load(MODEL_PATH, mmap=True, map_location="cpu") model.eval() return model def warmup(m): logging.info("Warming up...") for t in WARM_TXT: _ = m.infer(t) model = load_model() warmup(model) cache = LRU(CACHE_SIZE) def tts(text: str) -> bytes: key = hashlib.sha256(text.encode("utf8")).hexdigest() if key in cache: return cache[key] with torch.no_grad(): wav = model.infer(text) cache[key] = wav return wav @app.route("/synthesize", methods=["POST"]) def synthesize(): text = request.json.get("text", "") if not text: return Response("missing text", 400) wav = tts(text) return Response(wav, mimetype="audio/wav") if __name__ == "__main__": app.run(host="0.0.0.0", port=8090, threaded=False) # 单进程避免 GIL 竞争

启动命令

docker compose up -d --build

5. 性能测试：优化前后对比

测试机：Intel i5-8400 4C8T，32 GB DDR4，Ubuntu 22.04
工具：wrk + Lua 脚本循环 POST JSON，每次 20 字中文。

指标	原生裸跑	优化前 Docker	优化后 Docker
冷启动	31 s	33 s	9 s
单条 CPU 峰值	380%	390%	160%
并发 5 平均延迟	5.2 s	5.7 s	1.9 s
Load Average	6.1	6.3	1.8
3h 后 OOM 次数	2	3	0

测试脚本（bench.py）

#!/usr/bin/env python3 import subprocess, time, statistics, requests, concurrent.futures URL = "http://localhost:8090/synthesize" TEXT = "你好，这是一条性能测试语音。" * 4 def once(): t0 = time.perf_counter() resp = requests.post(URL, json={"text": TEXT}, timeout=30) resp.raise_for_status() return time.perf_counter() - t0 def main(n=20): with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool: lat = list(pool.map(lambda _: once(), range(n))) print("p50=%.2fs p95=%.2fs" % ( statistics.median(lat), statistics.quantiles(lat, n=20)[18])) if __name__ == "__main__": main()

跑 3 轮取平均即可复现。

6. 避坑指南

忘记限制内存 → OOM Killer
解决：compose 里写memory: 3G并打开 cgroup v2。
cpuset 与 numactl 冲突
解决：二者绑定同一 node，切勿 cpuset 给 0-7 又 numactl 绑 0-3。
/tmp 使用 overlayfs → Numba 缓存失效
解决：tmpfs 挂载/tmp/numba，重启容器缓存仍在。
Flask threaded=True 导致 CPU 抖动
解决：单进程 + 外部队列（Celery/RQ）更稳。
模型文件权限 0777 → mmap 失败
解决：chmod 644，容器内用户与宿主机 UID 一致。

7. 进阶建议：再榨一点 CPU 性能

动态量化：把 FP32 权重转 INT8，模型体积 ↓50%，推理延时 ↓35%，Pytorch 1.13 的torch.quantization.quantize_dynamic即可。
线程池亲和：使用taskset -c 0-3 gunicorn -k gthread把 I/O 线程也锁在同一 node。
合成批处理：把 5 条 2s 短句拼成 1 条 10s 批量推理，利用率可再提 20%。
使用 onxxruntime-cpu + OpenVINO 插件，官方已给出 ChatTTS 导出脚本，延迟可再降 15%。