如何做情感聚类？Emotion2Vec+ Large + Scikit-learn应用教程-平芜编程栈

如何做情感聚类？Emotion2Vec+ Large + Scikit-learn应用教程

1. 情感聚类到底在解决什么问题？

你有没有遇到过这样的场景：手上有几百段客服通话录音，想快速知道客户情绪分布是怎样的？或者收集了上千条短视频配音，需要自动归类出“热情型”“冷静型”“焦虑型”等声音风格？又或者正在做语音助手的用户体验分析，想从用户语音中发现未被明说的情绪模式？

这些都不是简单打个标签就能解决的问题——它们需要把相似的情感表达聚合成自然群体。而传统方法靠人工听、靠规则写，效率低、主观性强、难以规模化。

Emotion2Vec+ Large 正是破局的关键。它不只输出“快乐85%”这样的单点判断，更会为每段语音生成一个384维的语义情感向量（embedding）——这个向量就像语音的“情感DNA”，把抽象的情绪转化成了可计算、可比较、可聚类的数字指纹。

本文不讲理论推导，不堆模型参数，而是带你从零跑通一条完整链路：
启动已部署好的 Emotion2Vec+ Large WebUI
批量提取音频 embedding.npy 文件
用 scikit-learn 做 KMeans 聚类 + 可视化分析
解读聚类结果，反向定位典型音频样本

全程无需 GPU，不装新包，所有代码可直接复制运行。小白也能在 20 分钟内看到自己的第一张情感聚类图。

2. 准备工作：启动系统并批量提取 embedding

2.1 快速验证环境是否就绪

打开终端，执行启动命令（注意路径是否与你的部署一致）：

/bin/bash /root/run.sh

等待终端输出类似Running on local URL: http://localhost:7860的提示后，在浏览器中访问：

http://localhost:7860

如果看到 WebUI 界面，说明服务已正常运行。此时无需手动操作界面——我们要走的是自动化批量处理路线。

2.2 构建测试音频集（5–10 条即可）

准备一组有差异的语音样本，建议覆盖不同情绪倾向，例如：

一段销售电话（语速快、音调上扬 → 可能偏“快乐”或“惊讶”）
一段投诉录音（语速慢、停顿多、音调下沉 → 可能偏“悲伤”或“愤怒”）
一段产品介绍（平稳、清晰、节奏均匀 → 可能偏“中性”）
一段惊喜回应（“哇！真的吗？” → 明显“惊讶”）
一段疲惫应答（“嗯…好…知道了…” → 可能“中性”混“悲伤”）

将这些音频文件统一放在本地目录，比如./audio_samples/，格式支持 WAV/MP3/M4A/FLAC/OGG。

小技巧：首次测试可用 WebUI 的“ 加载示例音频”功能，识别后进入outputs/查看生成的embedding.npy文件结构，确认维度是否为(384,)——这是后续聚类的前提。

2.3 编写批量提取脚本（Python）

新建extract_embeddings.py，内容如下（已适配 Emotion2Vec+ Large 输出结构）：

import os import numpy as np import json from pathlib import Path # 配置路径 AUDIO_DIR = "./audio_samples" OUTPUT_DIR = "./embeddings" # 创建输出目录 Path(OUTPUT_DIR).mkdir(exist_ok=True) # 遍历所有音频文件 for audio_file in Path(AUDIO_DIR).glob("*.*"): if audio_file.suffix.lower() in [".wav", ".mp3", ".m4a", ".flac", ".ogg"]: print(f"正在处理: {audio_file.name}") # 模拟调用 WebUI API（实际需替换为真实请求） # 这里我们假设你已用 WebUI 手动处理过全部音频 # 并将所有 outputs_*/embedding.npy 复制到 ./raw_embeddings/ # 实际项目中，建议用 requests + multipart/form-data 自动上传 # 临时方案：从 outputs/ 中按时间戳匹配最新 embedding # （演示目的，生产环境请改用 API 或批量脚本） raw_outputs = sorted(Path("./outputs").glob("outputs_*"), key=os.path.getmtime, reverse=True) if raw_outputs: latest_dir = raw_outputs[0] emb_path = latest_dir / "embedding.npy" if emb_path.exists(): try: emb = np.load(emb_path) # 保存为标准命名：原始文件名 + _emb.npy save_name = f"{audio_file.stem}_emb.npy" np.save(os.path.join(OUTPUT_DIR, save_name), emb) print(f"✓ 已保存: {save_name} (shape: {emb.shape})") except Exception as e: print(f"✗ 加载失败 {emb_path}: {e}") else: print(f" 未找到 embedding.npy 在 {latest_dir}") else: print(" 未检测到 outputs/ 目录下的识别结果")

关键说明：

上述脚本是“半自动化”方案，适合快速验证流程。
若需全自动化，请参考 ModelScope 官方 SDK 或使用requests模拟 WebUI 表单提交（附后）。
所有.npy文件必须是一维向量，长度为 384（Emotion2Vec+ Large 的固定输出维度）。

2.4 整理 embedding 数据集

运行脚本后，你会得到类似这样的文件结构：

./embeddings/ ├── sales_call_emb.npy # (384,) ├── complaint_emb.npy # (384,) ├── product_intro_emb.npy # (384,) └── ...

现在，把这些向量合并成一个二维数组，作为聚类输入：

import numpy as np from pathlib import Path emb_files = list(Path("./embeddings").glob("*_emb.npy")) if not emb_files: raise FileNotFoundError("未找到任何 embedding 文件，请先运行提取脚本") # 加载所有 embedding embeddings = [] file_names = [] for f in emb_files: try: vec = np.load(f) if vec.shape == (384,): # 严格校验维度 embeddings.append(vec) file_names.append(f.stem.replace("_emb", "")) else: print(f" 跳过 {f.name}：维度不匹配，期望 (384,)，实际 {vec.shape}") except Exception as e: print(f" 跳过 {f.name}：加载失败 {e}") X = np.vstack(embeddings) # shape: (n_samples, 384) print(f" 成功加载 {len(X)} 条 embedding，数据形状: {X.shape}")

3. 情感聚类实战：KMeans + 可视化分析

3.1 为什么选 KMeans？而不是 DBSCAN 或 HDBSCAN？

速度快：384 维 × 几百样本，KMeans 秒级完成
解释性强：每个簇有明确中心，可反查“最典型样本”
与业务对齐：业务常关心“几类主流情绪”，而非无限细分
不适用场景：若你预期存在大量噪声（如静音片段、无效录音），再考虑 DBSCAN

我们默认尝试K=3（常见情绪分组：积极 / 中性 / 消极），后续可根据轮廓系数优化。

3.2 标准化 + 聚类（5 行核心代码）

from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # 1. 标准化（重要！避免量纲影响） scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 2. KMeans 聚类 kmeans = KMeans(n_clusters=3, random_state=42, n_init="auto") labels = kmeans.fit_predict(X_scaled) # 3. 评估聚类质量 silhouette_avg = silhouette_score(X_scaled, labels) print(f" K=3 时轮廓系数: {silhouette_avg:.3f}（越接近 1 越好）")

轮廓系数解读：
0.7：聚类效果优秀
0.5–0.7：合理
< 0.25：可能 K 值不合适，建议尝试 K=2 或 K=4

3.3 降维可视化：用 UMAP 替代 PCA（更保真）

PCA 在高维情感向量上容易模糊簇边界。UMAP 能更好保留局部结构，推荐用于情感分析：

pip install umap-learn

import umap import matplotlib.pyplot as plt # UMAP 降维到 2D reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=5, min_dist=0.1) X_umap = reducer.fit_transform(X_scaled) # 绘图 plt.figure(figsize=(10, 8)) scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=labels, cmap='tab10', s=100, alpha=0.8) plt.colorbar(scatter, ticks=np.unique(labels), label="聚类标签") plt.title("Emotion2Vec+ Large 情感向量 UMAP 聚类结果", fontsize=14, fontweight='bold') plt.xlabel("UMAP1") plt.ylabel("UMAP2") plt.grid(True, alpha=0.3) # 标注样本名（仅前 10 个，避免重叠） for i, name in enumerate(file_names[:10]): plt.annotate(name[:8] + "..", (X_umap[i, 0], X_umap[i, 1]), xytext=(5, 5), textcoords='offset points', fontsize=9, ha='left') plt.tight_layout() plt.savefig("emotion_clustering_umap.png", dpi=300, bbox_inches='tight') plt.show()

你将看到一张清晰的散点图：每个点代表一段语音，颜色代表其所属簇。如果簇间分离明显，说明情感向量确实蕴含强区分性。

3.4 深度解读：每个簇代表什么情绪？

光有标签（0/1/2）没意义。我们需要用原始情感得分反向解释每个簇：

# 加载对应 result.json 获取原始情感分布 cluster_profiles = {} for cluster_id in np.unique(labels): cluster_mask = (labels == cluster_id) cluster_files = [file_names[i] for i in range(len(file_names)) if cluster_mask[i]] # 假设 result.json 与 embedding 同名，存于 ./results/ scores_list = [] for fname in cluster_files: json_path = f"./results/{fname}.json" if os.path.exists(json_path): with open(json_path, 'r') as f: data = json.load(f) scores_list.append(list(data['scores'].values())) if scores_list: avg_scores = np.mean(scores_list, axis=0) emotion_names = list(data['scores'].keys()) dominant_emotion = emotion_names[np.argmax(avg_scores)] cluster_profiles[cluster_id] = { "dominant": dominant_emotion, "avg_scores": dict(zip(emotion_names, avg_scores.round(3))) } for cid, profile in cluster_profiles.items(): print(f"\n 簇 {cid} 主导情绪: '{profile['dominant']}'") print(" 平均得分:", profile['avg_scores'])

输出示例：

簇 0 主导情绪: 'happy' 平均得分: {'angry': 0.021, 'disgusted': 0.015, 'fearful': 0.032, 'happy': 0.712, ...} 簇 1 主导情绪: 'sad' 平均得分: {'angry': 0.043, 'disgusted': 0.028, 'fearful': 0.051, 'happy': 0.019, 'sad': 0.624, ...}

你现在不仅知道“哪些音频相似”，更知道“它们共同的情绪特征是什么”。

4. 进阶技巧：让聚类结果真正落地

4.1 找出每个簇的“代言人”音频

业务同学不需要看数字，他们想听：“哪一段最能代表‘焦虑型客户’？”
用余弦相似度找离簇中心最近的样本：

from sklearn.metrics.pairwise import cosine_similarity # 获取簇中心（已标准化） centers_scaled = kmeans.cluster_centers_ for cid in range(len(centers_scaled)): center = centers_scaled[cid].reshape(1, -1) # 计算该簇内所有样本到中心的余弦相似度 cluster_mask = (labels == cid) cluster_X = X_scaled[cluster_mask] similarities = cosine_similarity(cluster_X, center).flatten() # 找最相似的索引（在 cluster_X 中） top_idx_in_cluster = np.argmax(similarities) # 映射回全局索引 global_indices = np.where(cluster_mask)[0] top_global_idx = global_indices[top_idx_in_cluster] print(f"🏆 簇 {cid} 最典型样本: {file_names[top_global_idx]} " f"(相似度: {similarities[top_idx_in_cluster]:.3f})")

4.2 自动化报告生成（Markdown）

把结果写成一份可交付的简报：

with open("emotion_clustering_report.md", "w", encoding="utf-8") as f: f.write("# 情感聚类分析报告\n\n") f.write(f"共分析 {len(X)} 条语音，分为 {len(np.unique(labels))} 类。\n\n") for cid, profile in cluster_profiles.items(): f.write(f"## 簇 {cid}: {profile['dominant']} 主导型\n") f.write(f"- 典型样本: `{file_names[top_global_idx]}`\n") f.write(f"- 主要情感得分: {profile['avg_scores']}\n\n") f.write("![聚类可视化](emotion_clustering_umap.png)\n") f.write("\n> 报告生成时间: " + str(datetime.now().strftime("%Y-%m-%d %H:%M:%S"))) print(" 报告已保存至 emotion_clustering_report.md")

4.3 与业务系统对接（轻量级 API）

把聚类能力封装成函数，供其他系统调用：

def get_audio_cluster(audio_path: str) -> dict: """ 输入音频路径，返回其所属簇及解释 """ # 步骤：1. 调用 WebUI API 提取 embedding # 2. 加载预训练 scaler & kmeans 模型 # 3. 预测并返回结构化结果 pass # 实现略，核心逻辑同上 # 示例调用 result = get_audio_cluster("./audio_samples/complaint.mp3") print(result) # 输出: {"cluster_id": 1, "dominant_emotion": "sad", "similarity": 0.87}