GLM-4.7-Flash代码实例：Python调用vLLM API实现流式响应开发-平芜编程栈

GLM-4.7-Flash代码实例：Python调用vLLM API实现流式响应开发

你是不是也遇到过这种情况：调用一个大模型API，输入问题后，屏幕上那个小圆圈转啊转，等了十几秒甚至更久，才突然“哗啦”一下把整个答案都吐出来。中间这段时间，你完全不知道模型在想什么，是卡住了还是在认真思考？

这种体验确实不太友好。好在，现在有了更好的解决方案——流式响应。

今天，我就带你用Python一步步实现调用GLM-4.7-Flash的vLLM API，实现那种打字机式的流式输出效果。你输入问题，模型一边思考一边回答，文字一个字一个字地显示出来，就像在和真人聊天一样自然。

1. 为什么你需要流式响应？

在开始写代码之前，我们先搞清楚一个问题：流式响应到底有什么用？

传统的一次性响应就像等快递：你下单后，只能干等着，直到快递员把整个包裹送到你手上，你才能拆开看到里面是什么。

流式响应则像看直播：主播一边讲解，你一边就能看到内容，不用等整场直播结束。

对于大模型应用来说，流式响应有几个实实在在的好处：

用户体验大幅提升：用户不用盯着空白屏幕干等，能看到实时的思考过程
感知速度更快：即使总生成时间相同，流式输出让人感觉响应更快
内存占用更优：不需要一次性缓存整个长文本，可以边生成边处理
实时交互可能：在某些场景下，可以中途打断或调整生成方向

特别是对于GLM-4.7-Flash这样的30B参数大模型，生成长文本时，流式响应的优势会更加明显。

2. 环境准备与快速检查

2.1 确认你的GLM-4.7-Flash镜像状态

在开始写代码之前，我们先确保服务是正常运行的。如果你使用的是预配置的GLM-4.7-Flash镜像，通常服务已经自动启动了。

打开终端，执行以下命令检查服务状态：

# 检查vLLM推理服务是否运行（端口8000） curl -s http://127.0.0.1:8000/health | python -m json.tool # 或者直接查看supervisor状态 supervisorctl status

你应该能看到类似这样的输出：

glm_vllm RUNNING pid 12345, uptime 0:10:00 glm_ui RUNNING pid 12346, uptime 0:10:00

如果服务没有运行，可以手动启动：

# 启动所有服务 supervisorctl start all # 等待约30秒让模型加载完成 sleep 30

2.2 安装必要的Python库

我们需要的主要是requests库，用于发送HTTP请求。如果你还没有安装，可以这样安装：

pip install requests

对于更复杂的应用，你可能还需要其他库，但今天我们只聚焦于核心的流式响应实现。

3. 基础API调用：从简单开始

在实现流式响应之前，我们先看看传统的非流式调用是怎么做的。这样你就能更清楚地理解两者的区别。

3.1 传统的非流式调用

import requests import json import time def chat_without_streaming(question): """传统的非流式聊天函数""" # API端点地址 url = "http://127.0.0.1:8000/v1/chat/completions" # 请求参数 payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [ {"role": "user", "content": question} ], "temperature": 0.7, # 控制随机性，0-1之间 "max_tokens": 1024, # 最大生成token数 "stream": False # 关键：关闭流式输出 } # 记录开始时间 start_time = time.time() # 发送请求 print(f"正在向GLM-4.7-Flash提问：{question}") print("等待响应中...") response = requests.post(url, json=payload) # 记录结束时间 end_time = time.time() if response.status_code == 200: result = response.json() answer = result["choices"][0]["message"]["content"] print(f"\n收到完整回答（耗时：{end_time - start_time:.2f}秒）：") print("-" * 50) print(answer) print("-" * 50) return answer else: print(f"请求失败，状态码：{response.status_code}") print(f"错误信息：{response.text}") return None # 测试非流式调用 if __name__ == "__main__": question = "请用中文写一篇关于人工智能未来发展的短文，约200字。" chat_without_streaming(question)

运行这个代码，你会发现整个回答是一次性返回的。在模型生成的过程中，你只能等待。

3.2 非流式调用的局限性

这种方式的缺点很明显：

等待焦虑：用户不知道模型是否在工作
内存压力：长文本需要一次性存储在内存中
无法中断：一旦开始生成，只能等它完成
延迟感知：即使实际生成很快，用户也觉得慢

4. 实现流式响应：核心代码解析

现在，让我们进入正题，看看如何实现流式响应。

4.1 基本的流式响应实现

import requests import json def chat_with_streaming_basic(question): """基础的流式聊天函数""" url = "http://127.0.0.1:8000/v1/chat/completions" payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [ {"role": "user", "content": question} ], "temperature": 0.7, "max_tokens": 1024, "stream": True # 关键：开启流式输出 } print(f"提问：{question}") print("模型正在思考...\n") # 发送流式请求 response = requests.post(url, json=payload, stream=True) full_response = "" if response.status_code == 200: # 逐行读取流式响应 for line in response.iter_lines(): if line: # 解码字节为字符串 line_text = line.decode('utf-8') # 跳过SSE格式的前缀 if line_text.startswith("data: "): data_str = line_text[6:] # 去掉"data: "前缀 # 跳过结束标记 if data_str == "[DONE]": break try: # 解析JSON数据 data = json.loads(data_str) # 提取生成的文本 if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: # 实时打印内容 print(content, end="", flush=True) full_response += content except json.JSONDecodeError: # 忽略解析错误 pass else: print(f"请求失败，状态码：{response.status_code}") print(f"错误信息：{response.text}") return None print("\n\n--- 回答完成 ---") return full_response # 测试基础流式调用 if __name__ == "__main__": question = "请解释一下什么是机器学习，用通俗易懂的语言。" chat_with_streaming_basic(question)

运行这个代码，你会看到文字一个字一个字地显示出来，就像有人在打字一样。

4.2 代码关键点解析

让我们仔细看看这段代码的几个关键部分：

1.stream=True参数

# 在requests.post中设置 response = requests.post(url, json=payload, stream=True)

这个参数告诉requests库不要一次性下载所有内容，而是保持连接开放，等待服务器持续发送数据。

2. 逐行读取响应

for line in response.iter_lines():

iter_lines()方法会按行读取响应内容，每次返回一行数据。

3. SSE格式处理

if line_text.startswith("data: "): data_str = line_text[6:] # 去掉"data: "前缀

vLLM API使用Server-Sent Events（SSE）格式，每行数据以"data: "开头。

4. 实时显示内容

print(content, end="", flush=True)

end=""确保不换行
flush=True强制立即输出，不等待缓冲区满

5. 进阶功能：打造更完善的流式聊天应用

基础版本虽然能用，但还有很多可以改进的地方。让我们来打造一个更完善的流式聊天应用。

5.1 添加对话历史支持

在实际应用中，我们通常需要支持多轮对话。下面是一个支持对话历史的版本：

import requests import json from typing import List, Dict class StreamingChatClient: """流式聊天客户端，支持多轮对话""" def __init__(self, api_url="http://127.0.0.1:8000/v1/chat/completions"): self.api_url = api_url self.conversation_history = [] self.model_name = "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash" def add_message(self, role: str, content: str): """添加消息到对话历史""" self.conversation_history.append({ "role": role, "content": content }) def chat_stream(self, user_input: str, temperature: float = 0.7, max_tokens: int = 1024): """流式聊天主函数""" # 添加用户输入到历史 self.add_message("user", user_input) print(f"\n你：{user_input}") print("\nGLM-4.7-Flash：", end="", flush=True) # 准备请求数据 payload = { "model": self.model_name, "messages": self.conversation_history, "temperature": temperature, "max_tokens": max_tokens, "stream": True } # 发送请求 response = requests.post(self.api_url, json=payload, stream=True) assistant_response = "" if response.status_code == 200: for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": break try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: print(content, end="", flush=True) assistant_response += content except json.JSONDecodeError: continue else: print(f"\n请求失败：{response.status_code}") return None # 添加助手回复到历史 self.add_message("assistant", assistant_response) print() # 换行 return assistant_response def clear_history(self): """清空对话历史""" self.conversation_history = [] print("对话历史已清空") def show_history(self): """显示对话历史""" print("\n=== 对话历史 ===") for i, msg in enumerate(self.conversation_history): role = "用户" if msg["role"] == "user" else "助手" # 只显示前100个字符预览 preview = msg["content"][:100] + "..." if len(msg["content"]) > 100 else msg["content"] print(f"{i+1}. [{role}] {preview}") print("================\n") # 使用示例 if __name__ == "__main__": # 创建客户端 client = StreamingChatClient() # 第一轮对话 client.chat_stream("你好，请介绍一下你自己") # 第二轮对话（基于历史） client.chat_stream("你刚才说你是大语言模型，那你能做什么呢？") # 查看历史 client.show_history() # 清空历史 client.clear_history()

5.2 添加速率控制和进度显示

有时候，流式输出太快或太慢都会影响体验。我们可以添加一些控制：

import time def chat_with_streaming_enhanced(question, words_per_minute=200): """增强版流式聊天，支持速率控制""" url = "http://127.0.0.1:8000/v1/chat/completions" payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": question}], "temperature": 0.7, "max_tokens": 1024, "stream": True } print(f"提问：{question}") print("模型正在思考...\n") response = requests.post(url, json=payload, stream=True) full_response = "" char_count = 0 start_time = time.time() if response.status_code == 200: print("GLM-4.7-Flash：", end="", flush=True) for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": break try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: # 控制输出速率 for char in content: print(char, end="", flush=True) full_response += char char_count += 1 # 计算理想延迟（基于字数/分钟） if words_per_minute > 0: # 假设平均每个中文字符等于一个英文单词 delay_per_char = 60.0 / (words_per_minute * 1.0) time.sleep(delay_per_char) except json.JSONDecodeError: continue end_time = time.time() duration = end_time - start_time print(f"\n\n--- 统计信息 ---") print(f"总字符数：{char_count}") print(f"总耗时：{duration:.2f}秒") print(f"平均速度：{char_count/duration:.1f}字符/秒") else: print(f"请求失败：{response.status_code}") return None return full_response # 测试不同速率 if __name__ == "__main__": question = "请写一首关于春天的诗" print("=== 快速模式（300字/分钟）===") chat_with_streaming_enhanced(question, words_per_minute=300) print("\n\n=== 自然模式（150字/分钟）===") chat_with_streaming_enhanced(question, words_per_minute=150)

5.3 添加错误处理和重试机制

在实际生产环境中，网络可能不稳定，我们需要添加错误处理：

import requests import json import time def robust_chat_stream(question, max_retries=3): """健壮的流式聊天，带重试机制""" url = "http://127.0.0.1:8000/v1/chat/completions" payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": question}], "temperature": 0.7, "max_tokens": 1024, "stream": True } for attempt in range(max_retries): try: print(f"尝试 {attempt + 1}/{max_retries}...") # 设置超时时间 response = requests.post( url, json=payload, stream=True, timeout=30 # 30秒超时 ) response.raise_for_status() # 检查HTTP错误 full_response = "" print(f"\n提问：{question}") print("GLM-4.7-Flash：", end="", flush=True) for line in response.iter_lines(): if line: line_text = line.decode('utf-8', errors='ignore') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": print("\n\n--- 流式响应完成 ---") return full_response try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: print(content, end="", flush=True) full_response += content except json.JSONDecodeError: # 记录错误但继续 continue return full_response except requests.exceptions.Timeout: print(f"请求超时，尝试 {attempt + 1}/{max_retries}") if attempt < max_retries - 1: time.sleep(2 ** attempt) # 指数退避 continue except requests.exceptions.RequestException as e: print(f"网络错误：{e}") if attempt < max_retries - 1: time.sleep(2 ** attempt) continue except Exception as e: print(f"未知错误：{e}") break print(f"经过 {max_retries} 次尝试后仍然失败") return None # 测试错误处理 if __name__ == "__main__": # 正常测试 result = robust_chat_stream("什么是深度学习？") if result: print(f"\n最终回答长度：{len(result)}字符")

6. 实际应用场景与代码示例

现在，让我们看看流式响应在实际应用中的几个典型场景。

6.1 场景一：智能客服机器人

import requests import json import threading import queue class CustomerServiceBot: """智能客服机器人，支持流式响应""" def __init__(self): self.api_url = "http://127.0.0.1:8000/v1/chat/completions" self.system_prompt = """你是一个专业的客服助手，请用友好、专业的态度回答用户问题。 回答要简洁明了，重点突出。如果遇到不确定的问题，不要编造，可以建议用户联系人工客服。""" # 初始化对话 self.messages = [ {"role": "system", "content": self.system_prompt} ] def stream_response(self, user_query, callback): """流式生成响应，通过回调函数返回结果""" # 添加用户消息 self.messages.append({"role": "user", "content": user_query}) payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": self.messages, "temperature": 0.3, # 客服需要更稳定的输出 "max_tokens": 512, "stream": True } def generate(): response_text = "" try: response = requests.post(self.api_url, json=payload, stream=True, timeout=60) for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": # 添加助手回复到历史 self.messages.append({"role": "assistant", "content": response_text}) callback("DONE", response_text) break try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: response_text += content callback("DATA", content) except json.JSONDecodeError: continue except Exception as e: callback("ERROR", str(e)) # 在新线程中生成响应 thread = threading.Thread(target=generate) thread.daemon = True thread.start() return thread # 使用示例 if __name__ == "__main__": bot = CustomerServiceBot() def handle_stream_event(event_type, data): if event_type == "DATA": print(data, end="", flush=True) elif event_type == "DONE": print("\n\n[回答完成]") elif event_type == "ERROR": print(f"\n[错误] {data}") print("客服机器人已启动！输入'退出'结束对话\n") while True: user_input = input("\n用户：") if user_input.lower() in ["退出", "exit", "quit"]: print("感谢使用，再见！") break print("客服：", end="", flush=True) # 启动流式响应 bot.stream_response(user_input, handle_stream_event) # 简单等待响应完成 time.sleep(1)

6.2 场景二：实时翻译工具

import requests import json class StreamingTranslator: """流式翻译工具""" def __init__(self): self.api_url = "http://127.0.0.1:8000/v1/chat/completions" def translate_stream(self, text, target_language="英文", source_language="中文"): """流式翻译文本""" prompt = f"""请将以下{source_language}文本翻译成{target_language}。 要求：翻译准确、自然流畅，保持原文风格。 原文：{text} 翻译：""" print(f"原文（{source_language}）：{text}") print(f"\n翻译（{target_language}）：", end="", flush=True) payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": prompt}], "temperature": 0.1, # 翻译需要高准确性 "max_tokens": 1024, "stream": True } translation = "" try: response = requests.post(self.api_url, json=payload, stream=True, timeout=30) for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": break try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: print(content, end="", flush=True) translation += content except json.JSONDecodeError: continue except Exception as e: print(f"\n翻译出错：{e}") return None print("\n\n--- 翻译完成 ---") return translation # 使用示例 if __name__ == "__main__": translator = StreamingTranslator() # 中译英 chinese_text = "人工智能是未来科技发展的重要方向，它将深刻改变我们的生活和工作方式。" translator.translate_stream(chinese_text, "英文", "中文") print("\n" + "="*50 + "\n") # 英译中 english_text = "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed." translator.translate_stream(english_text, "中文", "英文")

6.3 场景三：代码生成助手

import requests import json class CodeGenerator: """流式代码生成助手""" def __init__(self): self.api_url = "http://127.0.0.1:8000/v1/chat/completions" def generate_code_stream(self, requirement, language="Python"): """流式生成代码""" prompt = f"""请根据以下需求，用{language}编写代码。 要求：代码要简洁、高效、有注释，符合最佳实践。 需求：{requirement} 代码：""" print(f"需求：{requirement}") print(f"\n正在生成{language}代码...\n") print("```" + language.lower()) payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": prompt}], "temperature": 0.2, # 代码生成需要稳定性 "max_tokens": 2048, "stream": True } full_code = "" try: response = requests.post(self.api_url, json=payload, stream=True, timeout=60) for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith("data: "): data_str = line_text[6:] if data_str == "[DONE]": break try: data = json.loads(data_str) if "choices" in data and len(data["choices"]) > 0: delta = data["choices"][0].get("delta", {}) content = delta.get("content", "") if content: print(content, end="", flush=True) full_code += content except json.JSONDecodeError: continue except Exception as e: print(f"\n代码生成出错：{e}") return None print("\n```") print("\n--- 代码生成完成 ---") # 返回代码用于后续使用 return full_code # 使用示例 if __name__ == "__main__": coder = CodeGenerator() # 生成Python代码 requirement = "写一个函数，计算斐波那契数列的第n项，要求使用动态规划优化" code = coder.generate_code_stream(requirement, "Python") if code: print(f"\n生成的代码长度：{len(code)}字符") # 可以进一步处理代码，比如保存到文件 with open("generated_fibonacci.py", "w", encoding="utf-8") as f: f.write(code) print("代码已保存到 generated_fibonacci.py")

7. 性能优化与最佳实践

在实际使用中，你可能还需要考虑性能优化。这里分享几个实用的技巧：

7.1 连接池管理

对于高频调用的应用，使用连接池可以显著提升性能：

import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry def create_http_session(pool_connections=10, pool_maxsize=10, max_retries=3): """创建配置好的HTTP会话""" session = requests.Session() # 配置重试策略 retry_strategy = Retry( total=max_retries, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504] ) # 配置适配器 adapter = HTTPAdapter( pool_connections=pool_connections, pool_maxsize=pool_maxsize, max_retries=retry_strategy ) session.mount("http://", adapter) session.mount("https://", adapter) return session # 使用连接池 session = create_http_session() def chat_with_pool(question): """使用连接池的聊天函数""" url = "http://127.0.0.1:8000/v1/chat/completions" payload = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": question}], "temperature": 0.7, "max_tokens": 1024, "stream": True } # 使用session而不是requests.post response = session.post(url, json=payload, stream=True, timeout=30) # ... 处理流式响应 ...

7.2 批量流式处理

如果需要处理多个请求，可以考虑批量处理：

import concurrent.futures def batch_chat_stream(questions, max_workers=3): """批量流式聊天""" results = {} def process_question(q): """处理单个问题""" print(f"\n处理问题：{q[:50]}...") result = chat_with_streaming_basic(q) return q, result # 使用线程池并发处理 with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor: # 提交所有任务 future_to_question = { executor.submit(process_question, q): q for q in questions } # 收集结果 for future in concurrent.futures.as_completed(future_to_question): question = future_to_question[future] try: q, result = future.result() results[q] = result print(f"✓ 完成：{q[:30]}...") except Exception as e: print(f"✗ 错误：{question[:30]}... - {e}") results[question] = None return results # 批量处理示例 if __name__ == "__main__": questions = [ "什么是人工智能？", "机器学习有哪些主要类型？", "深度学习与机器学习有什么区别？", "神经网络的基本原理是什么？" ] print("开始批量处理...") results = batch_chat_stream(questions, max_workers=2) print(f"\n处理完成，成功：{sum(1 for r in results.values() if r)}/{len(questions)}")

8. 总结

通过今天的实践，我们深入探讨了如何使用Python调用GLM-4.7-Flash的vLLM API实现流式响应。让我们回顾一下关键要点：

8.1 核心收获

流式响应的价值：大幅提升用户体验，让AI交互更加自然流畅
技术实现核心：stream=True参数 + SSE格式解析 + 实时输出控制
GLM-4.7-Flash优势：30B参数的强大能力，配合vLLM的高效推理，为流式响应提供了坚实基础
实际应用广泛：从智能客服到代码生成，流式响应在各种场景都能发挥价值

8.2 给你的实践建议

如果你打算在实际项目中使用流式响应，我有几个建议：

从简单开始：先实现基础版本，确保核心功能稳定
逐步增强：根据需要添加错误处理、速率控制、历史管理等功能
性能测试：在实际负载下测试，调整连接池大小和并发数
用户体验优先：关注响应速度和输出质量，不断优化

8.3 下一步探索方向

掌握了基础之后，你还可以进一步探索：

前端集成：将流式响应与Web前端结合，打造完整的聊天应用
多模态扩展：如果GLM-4.7-Flash支持多模态，可以尝试图片、语音的流式处理
性能监控：添加详细的性能指标监控，了解模型的实际表现
自定义优化：根据你的具体需求，调整模型参数和生成策略

流式响应不仅仅是一个技术特性，它代表了AI交互方式的重要演进。通过今天的学习，你已经掌握了实现这一能力的关键技术。现在，就去创建让用户惊艳的AI应用吧！

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

GLM-4.7-Flash代码实例：Python调用vLLM API实现流式响应开发