news 2026/4/15 8:04:14

AutoGen Studio详细步骤:Qwen3-4B在Team Builder中添加Tool并授权调用

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
AutoGen Studio详细步骤:Qwen3-4B在Team Builder中添加Tool并授权调用

AutoGen Studio详细步骤:Qwen3-4B在Team Builder中添加Tool并授权调用

1. AutoGen Studio是什么:低代码构建AI代理团队的利器

AutoGen Studio不是一个需要从零写代码的开发环境,而是一个专为快速落地AI应用设计的低代码界面。它把原本需要大量工程化封装的多智能体协作流程,变成了点点鼠标、填填参数就能完成的操作。

你可以把它理解成一个“AI代理组装车间”——在这里,你不需要手写Agent类、不纠结消息路由逻辑、也不用反复调试LLM调用链路。你只需要定义角色(比如产品经理、工程师、测试员)、配置它们背后的模型能力、给它们配上能干活的工具(比如查天气、搜网页、运行Python代码),再让它们围坐一桌“开会”,任务就自然推进下去了。

它的底层基于AutoGen AgentChat——这是微软开源的、被工业界广泛验证过的多代理框架。但AutoGen Studio做了关键一层封装:把API调用、状态管理、对话历史可视化、工具注册与权限控制这些容易出错、重复度高的环节,全部收进图形界面里。对开发者来说,这意味着从“写框架”回归到“想业务”;对非技术背景的产品或业务人员来说,第一次接触也能在30分钟内跑通一个带工具调用的三人协作流程。

更重要的是,它不是玩具级Demo平台。它支持本地vLLM服务接入、兼容OpenAI格式API、允许自定义Tool Schema、提供完整的Session回溯与日志导出——这些能力让它真正能走进日常研发流程,成为AI工程化落地的轻量级中枢。

2. 内置vLLM部署的Qwen3-4B-Instruct-2507:开箱即用的中文强推理底座

本次实践基于一个已预置环境:系统内已通过vLLM一键部署了Qwen3-4B-Instruct-2507模型服务。这个模型是通义千问系列最新发布的4B级别指令微调版本,相比前代,在中文长文本理解、多步逻辑推理、工具调用意图识别等维度有明显提升。尤其适合做Agent的“大脑”——响应快(vLLM加持下首token延迟<120ms)、上下文稳(支持32K tokens)、指令遵循准(Instruct后缀代表专为对话优化)。

它不是挂在云端的黑盒API,而是实实在在运行在你本地容器里的服务。这意味着:

  • 没有网络依赖,数据不出本地;
  • 可完全掌控输入输出,便于调试和审计;
  • 能自由扩展工具链,不受第三方平台功能限制;
  • 后续还可无缝切换为Qwen3-14B或混合专家模型,升级路径清晰。

下面所有操作,都建立在这个已就绪的服务基础之上。我们不做模型训练、不改推理参数、不碰Docker命令——只聚焦一件事:如何让AutoGen Studio里的Agent,真正“用上”这个本地大模型,并让它具备调用外部工具的能力。

3. 验证模型服务状态:确认vLLM已就绪

在开始配置AutoGen Studio之前,必须确保底层模型服务确实在运行。这不是可选步骤,而是避免后续所有配置失败的根本前提。

打开终端,执行以下命令查看vLLM启动日志:

cat /root/workspace/llm.log

你应当看到类似这样的输出片段:

INFO 01-26 14:22:33 [config.py:629] Using device: cuda INFO 01-26 14:22:33 [config.py:630] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:631] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:632] Using quantization: None INFO 01-26 14:22:33 [config.py:633] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:634] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:635] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:636] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:637] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:638] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:639] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:640] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:641] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:642] Using enable lora: False INFO 01-26 14:22:33 [config.py:643] Using max loras: 1 INFO 01-26 14:22:33 [config.py:644] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:645] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:646] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:647] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:648] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:649] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:650] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:651] Using device: cuda INFO 01-26 14:22:33 [config.py:652] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:653] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:654] Using quantization: None INFO 01-26 14:22:33 [config.py:655] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:656] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:657] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:658] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:659] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:660] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:661] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:662] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:663] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:664] Using enable lora: False INFO 01-26 14:22:33 [config.py:665] Using max loras: 1 INFO 01-26 14:22:33 [config.py:666] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:667] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:668] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:669] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:670] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:671] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:672] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:673] Using device: cuda INFO 01-26 14:22:33 [config.py:674] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:675] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:676] Using quantization: None INFO 01-26 14:22:33 [config.py:677] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:678] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:679] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:680] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:681] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:682] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:683] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:684] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:685] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:686] Using enable lora: False INFO 01-26 14:22:33 [config.py:687] Using max loras: 1 INFO 01-26 14:22:33 [config.py:688] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:689] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:690] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:691] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:692] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:693] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:694] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:695] Using device: cuda INFO 01-26 14:22:33 [config.py:696] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:697] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:698] Using quantization: None INFO 01-26 14:22:33 [config.py:699] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:700] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:701] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:702] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:703] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:704] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:705] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:706] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:707] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:708] Using enable lora: False INFO 01-26 14:22:33 [config.py:709] Using max loras: 1 INFO 01-26 14:22:33 [config.py:710] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:711] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:712] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:713] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:714] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:715] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:716] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:717] Using device: cuda INFO 01-26 14:22:33 [config.py:718] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:719] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:720] Using quantization: None INFO 01-26 14:22:33 [config.py:721] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:722] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:723] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:724] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:725] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:726] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:727] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:728] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:729] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:730] Using enable lora: False INFO 01-26 14:22:33 [config.py:731] Using max loras: 1 INFO 01-26 14:22:33 [config.py:732] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:733] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:734] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:735] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:736] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:737] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:738] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:739] Using device: cuda INFO 01-26 14:22:33 [config.py:740] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:741] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:742] Using quantization: None INFO 01-26 14:22:33 [config.py:743] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:744] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:745] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:746] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:747] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:748] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:749] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:750] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:751] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:752] Using enable lora: False INFO 01-26 14:22:33 [config.py:753] Using max loras: 1 INFO 01-26 14:22:33 [config.py:754] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:755] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:756] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:757] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:758] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:759] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:760] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:761] Using device: cuda INFO 01-26 14:22:33 [config.py:762] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:763] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:764] Using quantization: None INFO 01-26 14:22:33 [config.py:765] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:766] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:767] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:768] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:769] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:770] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:771] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:772] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:773] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:774] Using enable lora: False INFO 01-26 14:22:33 [config.py:775] Using max loras: 1 INFO 01-26 14:22:33 [config.py:776] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:777] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:778] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:779] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:780] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:781] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:782] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:783] Using device: cuda INFO 01-26 14:22:33 [config.py:784] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:785] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:786] Using quantization: None INFO 01-26 14:22:33 [config.py:787] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:788] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:789] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:790] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:791] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:792] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:793] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:794] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:795] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:796] Using enable lora: False INFO 01-26 14:22:33 [config.py:797] Using max loras: 1 INFO 01-26 14:22:33 [config.py:798] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:799] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:800] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:801] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:802] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:803] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:804] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:805] Using device: cuda INFO 01-26 14:22:33 [config.py:806] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:807] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:808] Using quantization: None INFO 01-26 14:22:33 [config.py:809] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:810] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:811] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:812] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:813] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:814] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:815] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:816] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:817] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:818] Using enable lora: False INFO 01-26 14:22:33 [config.py:819] Using max loras: 1 INFO 01-26 14:22:33 [config.py:820] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:821] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:822] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:823] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:824] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:825] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:826] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:827] Using device: cuda INFO 01-26 14:22:33 [config.py:828] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:829] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:830] Using quantization: None INFO 01-26 14:22:33 [config.py:831] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:832] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:833] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:834] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:835] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:836] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:837] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:838] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:839] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:840] Using enable lora: False INFO 01-26 14:22:33 [config.py:841] Using max loras: 1 INFO 01-26 14:22:33 [config.py:842] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:843] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:844] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:845] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:846] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:847] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:848] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:849] Using device: cuda INFO 01-26 14:22:33 [config.py:850] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:851] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:852] Using quantization: None INFO 01-26 14:22:33 [config.py:853] Using tensor parallel size: 1
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/11 23:56:41

Pi0多场景落地教程:养老陪护机器人、盲人辅助导航任务分解

Pi0多场景落地教程&#xff1a;养老陪护机器人、盲人辅助导航任务分解 1. Pi0是什么&#xff1a;一个能“看懂听懂动手做”的机器人模型 你可能见过很多AI模型&#xff0c;有的会写诗&#xff0c;有的会画画&#xff0c;有的能聊天——但Pi0不一样。它不只停留在“说”和“想…

作者头像 李华
网站建设 2026/4/10 23:40:42

一文说清ISR和普通函数的区别:图文对比说明

以下是对您原文的 深度润色与重构版本 ,严格遵循您提出的全部优化要求: ✅ 彻底去除AI痕迹,全文以一位有十年嵌入式开发+汽车电子功能安全认证经验的工程师口吻自然展开; ✅ 摒弃所有模板化标题(如“引言”“总结”“展望”),改用真实工程场景切入、层层递进的叙事逻…

作者头像 李华
网站建设 2026/4/11 19:35:29

LLaVA-v1.6-7B实战部署:Kubernetes集群中Ollama多实例调度方案

LLaVA-v1.6-7B实战部署&#xff1a;Kubernetes集群中Ollama多实例调度方案 在多模态AI应用快速落地的今天&#xff0c;如何让视觉语言模型既保持高性能又具备生产级稳定性&#xff0c;成了很多技术团队的实际挑战。LLaVA-v1.6-7B作为当前轻量级多模态模型中的佼佼者&#xff0…

作者头像 李华
网站建设 2026/4/4 23:20:06

AI手势识别在智能设备中的应用:低成本部署案例

AI手势识别在智能设备中的应用&#xff1a;低成本部署案例 1. 为什么手势识别正在走进 everyday 设备 你有没有想过&#xff0c;家里的智能音箱、工厂的工业平板、学校的电子白板&#xff0c;甚至一台老款笔记本电脑&#xff0c;其实都能“看懂”你的手势&#xff1f;不是靠昂…

作者头像 李华
网站建设 2026/4/15 3:29:34

WeKnora参数详解:streaming响应模式对Web界面用户体验的影响

WeKnora参数详解&#xff1a;streaming响应模式对Web界面用户体验的影响 1. WeKnora是什么&#xff1a;一个专注“所问即所得”的知识库问答系统 WeKnora不是另一个泛泛而谈的聊天机器人&#xff0c;它是一个为“精准信息提取”而生的轻量级知识库问答系统。它的设计哲学非常…

作者头像 李华