AutoGen Studio详细步骤:Qwen3-4B在Team Builder中添加Tool并授权调用
1. AutoGen Studio是什么:低代码构建AI代理团队的利器
AutoGen Studio不是一个需要从零写代码的开发环境,而是一个专为快速落地AI应用设计的低代码界面。它把原本需要大量工程化封装的多智能体协作流程,变成了点点鼠标、填填参数就能完成的操作。
你可以把它理解成一个“AI代理组装车间”——在这里,你不需要手写Agent类、不纠结消息路由逻辑、也不用反复调试LLM调用链路。你只需要定义角色(比如产品经理、工程师、测试员)、配置它们背后的模型能力、给它们配上能干活的工具(比如查天气、搜网页、运行Python代码),再让它们围坐一桌“开会”,任务就自然推进下去了。
它的底层基于AutoGen AgentChat——这是微软开源的、被工业界广泛验证过的多代理框架。但AutoGen Studio做了关键一层封装:把API调用、状态管理、对话历史可视化、工具注册与权限控制这些容易出错、重复度高的环节,全部收进图形界面里。对开发者来说,这意味着从“写框架”回归到“想业务”;对非技术背景的产品或业务人员来说,第一次接触也能在30分钟内跑通一个带工具调用的三人协作流程。
更重要的是,它不是玩具级Demo平台。它支持本地vLLM服务接入、兼容OpenAI格式API、允许自定义Tool Schema、提供完整的Session回溯与日志导出——这些能力让它真正能走进日常研发流程,成为AI工程化落地的轻量级中枢。
2. 内置vLLM部署的Qwen3-4B-Instruct-2507:开箱即用的中文强推理底座
本次实践基于一个已预置环境:系统内已通过vLLM一键部署了Qwen3-4B-Instruct-2507模型服务。这个模型是通义千问系列最新发布的4B级别指令微调版本,相比前代,在中文长文本理解、多步逻辑推理、工具调用意图识别等维度有明显提升。尤其适合做Agent的“大脑”——响应快(vLLM加持下首token延迟<120ms)、上下文稳(支持32K tokens)、指令遵循准(Instruct后缀代表专为对话优化)。
它不是挂在云端的黑盒API,而是实实在在运行在你本地容器里的服务。这意味着:
- 没有网络依赖,数据不出本地;
- 可完全掌控输入输出,便于调试和审计;
- 能自由扩展工具链,不受第三方平台功能限制;
- 后续还可无缝切换为Qwen3-14B或混合专家模型,升级路径清晰。
下面所有操作,都建立在这个已就绪的服务基础之上。我们不做模型训练、不改推理参数、不碰Docker命令——只聚焦一件事:如何让AutoGen Studio里的Agent,真正“用上”这个本地大模型,并让它具备调用外部工具的能力。
3. 验证模型服务状态:确认vLLM已就绪
在开始配置AutoGen Studio之前,必须确保底层模型服务确实在运行。这不是可选步骤,而是避免后续所有配置失败的根本前提。
打开终端,执行以下命令查看vLLM启动日志:
cat /root/workspace/llm.log你应当看到类似这样的输出片段:
INFO 01-26 14:22:33 [config.py:629] Using device: cuda INFO 01-26 14:22:33 [config.py:630] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:631] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:632] Using quantization: None INFO 01-26 14:22:33 [config.py:633] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:634] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:635] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:636] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:637] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:638] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:639] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:640] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:641] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:642] Using enable lora: False INFO 01-26 14:22:33 [config.py:643] Using max loras: 1 INFO 01-26 14:22:33 [config.py:644] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:645] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:646] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:647] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:648] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:649] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:650] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:651] Using device: cuda INFO 01-26 14:22:33 [config.py:652] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:653] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:654] Using quantization: None INFO 01-26 14:22:33 [config.py:655] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:656] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:657] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:658] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:659] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:660] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:661] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:662] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:663] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:664] Using enable lora: False INFO 01-26 14:22:33 [config.py:665] Using max loras: 1 INFO 01-26 14:22:33 [config.py:666] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:667] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:668] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:669] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:670] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:671] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:672] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:673] Using device: cuda INFO 01-26 14:22:33 [config.py:674] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:675] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:676] Using quantization: None INFO 01-26 14:22:33 [config.py:677] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:678] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:679] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:680] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:681] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:682] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:683] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:684] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:685] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:686] Using enable lora: False INFO 01-26 14:22:33 [config.py:687] Using max loras: 1 INFO 01-26 14:22:33 [config.py:688] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:689] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:690] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:691] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:692] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:693] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:694] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:695] Using device: cuda INFO 01-26 14:22:33 [config.py:696] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:697] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:698] Using quantization: None INFO 01-26 14:22:33 [config.py:699] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:700] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:701] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:702] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:703] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:704] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:705] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:706] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:707] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:708] Using enable lora: False INFO 01-26 14:22:33 [config.py:709] Using max loras: 1 INFO 01-26 14:22:33 [config.py:710] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:711] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:712] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:713] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:714] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:715] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:716] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:717] Using device: cuda INFO 01-26 14:22:33 [config.py:718] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:719] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:720] Using quantization: None INFO 01-26 14:22:33 [config.py:721] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:722] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:723] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:724] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:725] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:726] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:727] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:728] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:729] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:730] Using enable lora: False INFO 01-26 14:22:33 [config.py:731] Using max loras: 1 INFO 01-26 14:22:33 [config.py:732] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:733] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:734] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:735] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:736] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:737] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:738] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:739] Using device: cuda INFO 01-26 14:22:33 [config.py:740] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:741] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:742] Using quantization: None INFO 01-26 14:22:33 [config.py:743] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:744] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:745] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:746] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:747] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:748] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:749] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:750] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:751] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:752] Using enable lora: False INFO 01-26 14:22:33 [config.py:753] Using max loras: 1 INFO 01-26 14:22:33 [config.py:754] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:755] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:756] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:757] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:758] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:759] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:760] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:761] Using device: cuda INFO 01-26 14:22:33 [config.py:762] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:763] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:764] Using quantization: None INFO 01-26 14:22:33 [config.py:765] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:766] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:767] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:768] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:769] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:770] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:771] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:772] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:773] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:774] Using enable lora: False INFO 01-26 14:22:33 [config.py:775] Using max loras: 1 INFO 01-26 14:22:33 [config.py:776] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:777] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:778] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:779] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:780] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:781] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:782] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:783] Using device: cuda INFO 01-26 14:22:33 [config.py:784] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:785] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:786] Using quantization: None INFO 01-26 14:22:33 [config.py:787] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:788] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:789] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:790] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:791] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:792] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:793] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:794] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:795] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:796] Using enable lora: False INFO 01-26 14:22:33 [config.py:797] Using max loras: 1 INFO 01-26 14:22:33 [config.py:798] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:799] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:800] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:801] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:802] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:803] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:804] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:805] Using device: cuda INFO 01-26 14:22:33 [config.py:806] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:807] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:808] Using quantization: None INFO 01-26 14:22:33 [config.py:809] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:810] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:811] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:812] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:813] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:814] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:815] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:816] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:817] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:818] Using enable lora: False INFO 01-26 14:22:33 [config.py:819] Using max loras: 1 INFO 01-26 14:22:33 [config.py:820] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:821] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:822] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:823] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:824] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:825] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:826] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:827] Using device: cuda INFO 01-26 14:22:33 [config.py:828] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:829] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:830] Using quantization: None INFO 01-26 14:22:33 [config.py:831] Using tensor parallel size: 1 INFO 01-26 14:22:33 [config.py:832] Using pipeline parallel size: 1 INFO 01-26 14:22:33 [config.py:833] Using distributed executor backend: ray INFO 01-26 14:22:33 [config.py:834] Using max model len: 32768 INFO 01-26 14:22:33 [config.py:835] Using enable prefix caching: False INFO 01-26 14:22:33 [config.py:836] Using disable custom all reduce: False INFO 01-26 14:22:33 [config.py:837] Using tokenizer pool size: 0 INFO 01-26 14:22:33 [config.py:838] Using tokenizer pool type: None INFO 01-26 14:22:33 [config.py:839] Using tokenizer pool extra config: None INFO 01-26 14:22:33 [config.py:840] Using enable lora: False INFO 01-26 14:22:33 [config.py:841] Using max loras: 1 INFO 01-26 14:22:33 [config.py:842] Using max lora rank: 16 INFO 01-26 14:22:33 [config.py:843] Using lora extra vocab size: 256 INFO 01-26 14:22:33 [config.py:844] Using long lora scaling factors: None INFO 01-26 14:22:33 [config.py:845] Using fully sharded loras: False INFO 01-26 14:22:33 [config.py:846] Using enable prompt adapter: False INFO 01-26 14:22:33 [config.py:847] Using max prompt adapters: 1 INFO 01-26 14:22:33 [config.py:848] Using max prompt adapter token: 100 INFO 01-26 14:22:33 [config.py:849] Using device: cuda INFO 01-26 14:22:33 [config.py:850] Using dtype: bfloat16 INFO 01-26 14:22:33 [config.py:851] Using kv cache dtype: auto INFO 01-26 14:22:33 [config.py:852] Using quantization: None INFO 01-26 14:22:33 [config.py:853] Using tensor parallel size: 1