GLM-4-9B-Chat-1M对话日志分析：使用Python构建评估系统-平芜编程栈

GLM-4-9B-Chat-1M对话日志分析：使用Python构建评估系统

1. 引言

你有没有遇到过这样的情况：部署了一个大语言模型，用户在使用过程中产生了海量对话记录，但你却不知道这些对话的质量如何，用户到底在聊什么话题，或者模型在哪些场景下表现不佳？

GLM-4-9B-Chat-1M作为支持百万级上下文长度的开源大模型，在实际应用中会产生大量的对话数据。这些数据就像一座金矿，蕴含着用户需求、模型表现、热点话题等宝贵信息。但如何从这些海量日志中提取有价值的信息呢？

今天我们就来手把手教你用Python构建一个完整的对话日志分析系统，让你能够自动评估对话质量、挖掘热点话题、分析用户意图，真正把数据变成洞察。

2. 环境准备与数据理解

2.1 所需工具和库

首先确保你的Python环境已经安装了这些基础库：

pip install pandas numpy matplotlib seaborn scikit-learn jieba wordcloud

如果你想要更高级的文本分析功能，还可以安装：

pip install transformers sentence-transformers umap-learn

2.2 对话日志数据结构

GLM-4-9B-Chat-1M的对话日志通常包含这些信息：

# 典型的对话日志结构示例 sample_log = { "timestamp": "2024-01-15 10:30:25", "session_id": "sess_123456", "user_input": "请问如何用Python处理大数据？", "model_response": "处理大数据时可以使用PySpark或Dask等分布式计算框架...", "response_time": 2.5, # 响应时间（秒） "tokens_used": 150, # 使用的token数量 "rating": 4 # 用户评分（如果有） }

在实际项目中，日志可能以JSON文件、数据库记录或CSV格式存储。我们假设你已经有了这样的数据源。

3. 核心分析功能实现

3.1 对话质量评估模块

对话质量评估是分析系统的核心，我们可以从多个维度来评估：

import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity class DialogueQualityEvaluator: def __init__(self): self.quality_keywords = { 'helpful': ['有用', '帮助', '解决', '感谢', '明白了'], 'unhelpful': ['不懂', '不对', '错误', '没用', '重新问'], 'detailed': ['详细', '具体', '例子', '步骤', '说明'], 'vague': ['简单', '大概', '不清楚', '模糊'] } def calculate_relevance_score(self, query, response): """计算查询与响应的相关性""" vectorizer = TfidfVectorizer() try: tfidf_matrix = vectorizer.fit_transform([query, response]) similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0] return round(similarity, 2) except: return 0.0 def analyze_response_quality(self, response): """分析响应质量特征""" quality_scores = {} response_lower = response.lower() # 计算各种质量指标的得分 for category, keywords in self.quality_keywords.items(): score = sum(1 for keyword in keywords if keyword in response_lower) quality_scores[category] = min(score / len(keywords), 1.0) # 计算响应长度得分（适中的长度更好） word_count = len(response.split()) length_score = 1 - abs(word_count - 100) / 200 # 以100词为理想值 quality_scores['length_appropriate'] = max(0, min(length_score, 1)) return quality_scores # 使用示例 evaluator = DialogueQualityEvaluator() sample_response = "处理大数据时可以使用PySpark，它是一个强大的分布式计算框架。首先需要安装PySpark，然后创建SparkSession，接着就可以使用DataFrame API来处理数据了。" quality_scores = evaluator.analyze_response_quality(sample_response) print("质量得分:", quality_scores)

3.2 热点话题挖掘

从海量对话中自动发现用户最关心的话题：

from collections import Counter import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt class TopicAnalyzer: def __init__(self): # 添加一些停用词 self.stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '吗', '可以', '如何', '怎么', '什么', '为什么']) def extract_topics(self, dialogues, top_n=10): """提取热门话题""" all_text = ' '.join([d['user_input'] for d in dialogues]) # 使用jieba进行中文分词 words = jieba.cut(all_text) filtered_words = [word for word in words if len(word) > 1 and word not in self.stopwords] # 统计词频 word_freq = Counter(filtered_words) return word_freq.most_common(top_n) def generate_wordcloud(self, dialogues, output_path='wordcloud.png'): """生成词云图""" all_text = ' '.join([d['user_input'] for d in dialogues]) words = jieba.cut(all_text) filtered_text = ' '.join([word for word in words if len(word) > 1 and word not in self.stopwords]) wordcloud = WordCloud( font_path='SimHei.ttf', # 中文字体路径 width=800, height=600, background_color='white' ).generate(filtered_text) plt.figure(figsize=(10, 8)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.savefig(output_path, dpi=300, bbox_inches='tight') plt.close() # 使用示例 analyzer = TopicAnalyzer() hot_topics = analyzer.extract_topics(dialogues) print("热门话题:", hot_topics) analyzer.generate_wordcloud(dialogues)

3.3 用户意图分析

理解用户对话背后的真实意图：

class IntentAnalyzer: def __init__(self): self.intent_categories = { 'information': ['是什么', '什么是', '介绍', '解释', '定义'], 'howto': ['怎么', '如何', '步骤', '方法', '教程'], 'troubleshooting': ['错误', '问题', '解决', '修复', '无法'], 'comparison': ['区别', '对比', '哪个好', '优缺点'], 'opinion': ['觉得', '看法', '观点', '建议', '推荐'] } def classify_intent(self, query): """分类用户意图""" query_lower = query.lower() intent_scores = {} for intent, keywords in self.intent_categories.items(): score = sum(1 for keyword in keywords if keyword in query_lower) intent_scores[intent] = score # 返回得分最高的意图 if sum(intent_scores.values()) == 0: return 'other' else: return max(intent_scores.items(), key=lambda x: x[1])[0] def analyze_intent_patterns(self, dialogues): """分析意图分布模式""" intent_counts = {} for dialogue in dialogues: intent = self.classify_intent(dialogue['user_input']) intent_counts[intent] = intent_counts.get(intent, 0) + 1 return intent_counts # 使用示例 intent_analyzer = IntentAnalyzer() intent_distribution = intent_analyzer.analyze_intent_patterns(dialogues) print("意图分布:", intent_distribution)

4. 完整系统搭建

现在我们把各个模块组合成一个完整的分析系统：

class DialogueAnalysisSystem: def __init__(self): self.quality_evaluator = DialogueQualityEvaluator() self.topic_analyzer = TopicAnalyzer() self.intent_analyzer = IntentAnalyzer() self.dialogues = [] def load_data(self, data_path, format='json'): """加载对话数据""" if format == 'json': import json with open(data_path, 'r', encoding='utf-8') as f: self.dialogues = json.load(f) elif format == 'csv': self.dialogues = pd.read_csv(data_path).to_dict('records') print(f"成功加载 {len(self.dialogues)} 条对话记录") def run_complete_analysis(self): """运行完整分析""" print("开始对话日志分析...") # 质量评估 print("\n1. 对话质量评估") quality_results = [] for dialogue in self.dialogues: scores = self.quality_evaluator.analyze_response_quality(dialogue['model_response']) relevance = self.quality_evaluator.calculate_relevance_score( dialogue['user_input'], dialogue['model_response'] ) scores['relevance'] = relevance quality_results.append(scores) # 话题分析 print("2. 热点话题挖掘") hot_topics = self.topic_analyzer.extract_topics(self.dialogues) self.topic_analyzer.generate_wordcloud(self.dialogues) # 意图分析 print("3. 用户意图分析") intent_distribution = self.intent_analyzer.analyze_intent_patterns(self.dialogues) return { 'quality_analysis': quality_results, 'hot_topics': hot_topics, 'intent_distribution': intent_distribution } def generate_report(self, analysis_results, output_path='analysis_report.html'): """生成分析报告""" # 这里可以实现一个漂亮的HTML报告生成器 # 包括图表、统计数据和关键发现 print(f"分析报告已生成: {output_path}") # 使用完整系统 analysis_system = DialogueAnalysisSystem() analysis_system.load_data('dialogues.json') results = analysis_system.run_complete_analysis() analysis_system.generate_report(results)

5. 实际应用案例

让我们看一个具体的应用场景。假设你运营着一个AI助手平台，使用GLM-4-9B-Chat-1M提供服务。通过这个分析系统，你可以：

发现模型优势领域：通过质量评估，发现模型在技术问答、编程帮助方面得分很高，但在创意写作方面相对较弱。

优化服务方向：通过话题分析，发现用户最常询问的是"Python数据处理"和"机器学习基础"，可以考虑在这方面提供更专业的回答。

改进用户体验：通过意图分析，发现很多用户询问"如何提问更好"，可以添加提示词建议功能。

资源分配优化：通过响应时间和token使用分析，可以优化资源配置，提高服务效率。

6. 进阶功能建议

当你熟悉基础分析后，还可以考虑这些进阶功能：

情感分析：分析用户查询的情感倾向，了解用户满意度

from transformers import pipeline sentiment_analyzer = pipeline('sentiment-analysis') def analyze_sentiment(text): result = sentiment_analyzer(text[:512]) # 限制长度 return result[0]['label'], result[0]['score']

对话流程分析：分析多轮对话的连贯性和上下文保持能力

异常检测：自动识别异常对话模式或模型异常表现

个性化推荐：基于用户历史对话推荐相关资源或功能