DeepSeek-R1-Distill-Qwen-1.5B实战教程：添加语音输入/输出模块打造全模态助手-平芜编程栈

DeepSeek-R1-Distill-Qwen-1.5B实战教程：添加语音输入/输出模块打造全模态助手

1. 项目介绍：从纯文本到全模态的升级

DeepSeek-R1-Distill-Qwen-1.5B是一个超轻量的智能对话模型，已经在本地环境中运行得很好了。但如果我们想让这个助手更智能、更实用，就需要给它加上"耳朵"和"嘴巴"——也就是语音输入和输出功能。

想象一下这样的场景：你不需要打字，直接说话就能和AI交流；AI也不只是显示文字，而是用语音回答你。这样的全模态助手才能真正融入我们的日常生活。

本教程将带你一步步为现有的DeepSeek-R1模型添加语音功能，打造一个既能看文字又能听声音的智能助手。整个过程完全在本地运行，不需要联网，确保你的对话隐私安全。

升级后的全模态助手能做什么：

语音输入：直接说话提问，不用打字
语音输出：AI用自然的声音回答你
文字备份：同时显示文字记录，方便查看
完全本地：所有处理都在你电脑上完成

2. 环境准备：安装必要的语音处理库

在开始添加语音功能前，我们需要安装一些专门的语音处理库。打开终端，执行以下命令：

pip install speechrecognition pyaudio gtts playsound

这些库各自负责不同的功能：

speechrecognition：用于语音识别，把你说的话转成文字
pyaudio：音频输入输出支持，用来录制你的声音
gtts：文字转语音，把AI的回答变成声音
playsound：播放生成的语音文件

安装常见问题解决：

如果你在安装pyaudio时遇到问题，可以尝试这样解决：

# 对于Windows系统 pip install pipwin pipwin install pyaudio # 对于Mac系统 brew install portaudio pip install pyaudio # 对于Linux系统 sudo apt-get install python3-pyaudio pip install pyaudio

安装完成后，我们可以通过一个简单的测试来验证环境是否正常：

import speech_recognition as sr print("语音识别库版本:", sr.__version__) import pyaudio print("PyAudio测试通过") from gtts import gTTS print("文字转语音库就绪")

3. 语音输入模块：让AI听懂你的话

语音输入模块的核心是把你的声音转换成文字，然后交给DeepSeek模型处理。我们使用SpeechRecognition库来实现这个功能。

3.1 基础语音识别实现

import speech_recognition as sr import threading class VoiceInput: def __init__(self): self.recognizer = sr.Recognizer() self.microphone = sr.Microphone() self.is_listening = False # 调整麦克风环境噪音 with self.microphone as source: self.recognizer.adjust_for_ambient_noise(source) def start_listening(self, callback): """开始监听语音输入""" def listen_thread(): self.is_listening = True while self.is_listening: try: with self.microphone as source: print("请说话...") audio = self.recognizer.listen(source, timeout=5, phrase_time_limit=10) # 识别语音 text = self.recognizer.recognize_google(audio, language='zh-CN') print(f"识别结果: {text}") callback(text) except sr.WaitTimeoutError: continue except sr.UnknownValueError: print("无法识别语音") except Exception as e: print(f"识别错误: {e}") thread = threading.Thread(target=listen_thread) thread.daemon = True thread.start() def stop_listening(self): """停止监听""" self.is_listening = False

3.2 集成到Streamlit界面

现在我们把语音输入功能添加到现有的聊天界面中：

import streamlit as st import threading # 初始化语音输入模块 if 'voice_input' not in st.session_state: st.session_state.voice_input = VoiceInput() def voice_callback(text): """语音识别回调函数""" st.session_state.user_input = text # 自动触发发送 if st.session_state.auto_send: process_user_input() # 在侧边栏添加语音控制选项 with st.sidebar: st.header("🎤 语音设置") st.session_state.auto_send = st.checkbox("自动发送识别结果", value=True) if st.button("🎤 开始语音输入"): if not st.session_state.voice_input.is_listening: st.session_state.voice_input.start_listening(voice_callback) st.success("语音监听已启动") if st.button("⏹️ 停止语音输入"): st.session_state.voice_input.stop_listening() st.info("语音监听已停止")

4. 语音输出模块：让AI会说话

语音输出模块负责把AI生成的文字回答转换成语音播放出来。我们使用gTTS（Google Text-to-Speech）来生成语音文件，然后用playsound播放。

4.1 文字转语音实现

from gtts import gTTS import pygame import io import os import threading class VoiceOutput: def __init__(self): pygame.mixer.init() self.is_playing = False def text_to_speech(self, text, lang='zh-cn'): """将文字转换为语音并播放""" try: # 创建内存中的音频文件 tts = gTTS(text=text, lang=lang, slow=False) audio_bytes = io.BytesIO() tts.write_to_fp(audio_bytes) audio_bytes.seek(0) # 保存临时文件并播放 with open("temp_audio.mp3", "wb") as f: f.write(audio_bytes.getvalue()) self.play_audio("temp_audio.mp3") except Exception as e: print(f"语音合成失败: {e}") def play_audio(self, audio_file): """播放音频文件""" def play_thread(): self.is_playing = True try: pygame.mixer.music.load(audio_file) pygame.mixer.music.play() while pygame.mixer.music.get_busy(): pygame.time.Clock().tick(10) finally: self.is_playing = False # 清理临时文件 if os.path.exists(audio_file): os.remove(audio_file) thread = threading.Thread(target=play_thread) thread.daemon = True thread.start() def stop_audio(self): """停止当前播放""" if self.is_playing: pygame.mixer.music.stop() self.is_playing = False

4.2 集成语音输出到聊天流程

在AI生成回答后，自动触发语音输出：

# 初始化语音输出模块 if 'voice_output' not in st.session_state: st.session_state.voice_output = VoiceOutput() def process_user_input(): """处理用户输入并生成回答""" user_input = st.session_state.user_input if not user_input: return # 添加到聊天历史 st.session_state.messages.append({"role": "user", "content": user_input}) # 生成AI回答（使用原有的DeepSeek模型） with st.spinner("思考中..."): ai_response = generate_ai_response(user_input) # 添加到聊天历史 st.session_state.messages.append({"role": "assistant", "content": ai_response}) # 清空输入框 st.session_state.user_input = "" # 语音播报AI回答 if st.session_state.voice_output_enabled: st.session_state.voice_output.text_to_speech(ai_response) # 在侧边栏添加语音输出控制 with st.sidebar: st.session_state.voice_output_enabled = st.checkbox("启用语音输出", value=True) if st.button("🔇 停止当前语音"): st.session_state.voice_output.stop_audio()

5. 完整集成：打造全模态聊天界面

现在我们把所有功能整合到一起，创建一个完整的全模态聊天界面：

import streamlit as st import time from voice_input import VoiceInput from voice_output import VoiceOutput # 初始化会话状态 if 'messages' not in st.session_state: st.session_state.messages = [] if 'voice_input' not in st.session_state: st.session_state.voice_input = VoiceInput() if 'voice_output' not in st.session_state: st.session_state.voice_output = VoiceOutput() if 'voice_output_enabled' not in st.session_state: st.session_state.voice_output_enabled = True # 页面标题和描述 st.title("🎤 DeepSeek全模态智能助手") st.markdown(""" 支持**语音输入**和**语音输出**的本地智能对话助手 - 🎤 点击麦克风按钮开始说话 - 🔊 AI会用语音回答你的问题 - 📝 同时显示文字聊天记录 - 🔒 所有处理都在本地完成，保护隐私 """) # 聊天记录显示 for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) # 底部输入区域 col1, col2 = st.columns([6, 1]) with col1: user_input = st.text_input( "输入你的问题或点击麦克风说话...", key="user_input", label_visibility="collapsed" ) with col2: if st.button("🎤", help="开始语音输入"): if not st.session_state.voice_input.is_listening: st.session_state.voice_input.start_listening( lambda text: setattr(st.session_state, 'user_input', text) ) st.rerun() # 处理用户输入 if user_input: process_user_input() st.rerun() # 侧边栏控制面板 with st.sidebar: st.header("⚙️ 控制面板") # 语音设置 st.subheader("🎤 语音设置") st.session_state.voice_output_enabled = st.checkbox("启用语音输出", value=True) st.session_state.auto_send = st.checkbox("语音自动发送", value=True) # 语音控制按钮 col1, col2 = st.columns(2) with col1: if st.button("开始监听", type="primary"): st.session_state.voice_input.start_listening( lambda text: setattr(st.session_state, 'user_input', text) ) with col2: if st.button("停止监听"): st.session_state.voice_input.stop_listening() if st.button("停止当前语音"): st.session_state.voice_output.stop_audio() # 清空聊天记录 if st.button("🧹 清空对话"): st.session_state.messages = [] st.session_state.voice_output.stop_audio() st.rerun() # 系统状态显示 st.subheader("📊 系统状态") st.write(f"语音监听: {'运行中' if st.session_state.voice_input.is_listening else '已停止'}") st.write(f"语音播放: {'运行中' if st.session_state.voice_output.is_playing else '空闲'}") st.write(f"消息数量: {len(st.session_state.messages)}")

6. 使用技巧和最佳实践

6.1 优化语音识别准确率

def optimize_voice_recognition(): """优化语音识别设置""" recognizer = sr.Recognizer() microphone = sr.Microphone() # 调整环境噪音阈值 with microphone as source: recognizer.adjust_for_ambient_noise(source, duration=2) # 设置识别参数 recognizer.dynamic_energy_threshold = True recognizer.pause_threshold = 0.8 # 句子结束的静音时间 recognizer.phrase_threshold = 0.3 # 开始识别的最小音频长度 return recognizer, microphone

6.2 处理长文本的语音输出

对于较长的AI回答，我们需要分段播放以避免音频生成问题：

def split_long_text(text, max_length=100): """将长文本分割成适合语音合成的段落""" sentences = text.split('。') chunks = [] current_chunk = "" for sentence in sentences: if len(current_chunk) + len(sentence) < max_length: current_chunk += sentence + "。" else: if current_chunk: chunks.append(current_chunk) current_chunk = sentence + "。" if current_chunk: chunks.append(current_chunk) return chunks def speak_long_text(text): """分段播放长文本""" chunks = split_long_text(text) for chunk in chunks: if chunk.strip(): st.session_state.voice_output.text_to_speech(chunk) # 等待当前段落播放完成 while st.session_state.voice_output.is_playing: time.sleep(0.1)

6.3 添加语音反馈提示音

为了让用户体验更好，可以添加一些音频反馈：

def play_notification_sound(sound_type): """播放提示音""" sounds = { "start": "audio/start.wav", "stop": "audio/stop.wav", "error": "audio/error.wav" } if sound_type in sounds and os.path.exists(sounds[sound_type]): pygame.mixer.Sound(sounds[sound_type]).play()