GLM-4.7-Flash实战教程：FastAPI封装GLM-4.7-Flash API并添加鉴权中间件-平芜编程栈

GLM-4.7-Flash实战教程：FastAPI封装GLM-4.7-Flash API并添加鉴权中间件

1. 为什么需要自己封装API？原生vLLM够用吗？

你可能已经注意到，CSDN星图镜像里预装的GLM-4.7-Flash服务自带OpenAI兼容接口（http://127.0.0.1:8000/v1/chat/completions）和Swagger文档（/docs），开箱即用，连Web界面都配好了。那为什么还要多此一举，用FastAPI再包一层？

答案很实在：生产环境不等于开发环境。

原生vLLM接口虽然标准、高效，但它默认没有身份验证、没有调用频控、没有请求日志审计、没有统一错误码、也没有业务层的参数校验和上下文增强。比如：

你的前端应用直接暴露vLLM地址，等于把模型服务完全裸奔在公网；
多个团队共用一个GPU节点，没人知道谁在高频刷请求；
某个用户发来超长提示词或恶意payload，vLLM会照单全收，可能拖慢整机响应；
你想给不同客户返回带水印的回复，或者自动追加免责声明——这些逻辑没法塞进vLLM配置里。

FastAPI不是为了“替代”vLLM，而是做它前面那道可控、可管、可审计的业务网关。就像给一辆高性能跑车加装了方向盘、油门踏板、仪表盘和行车记录仪——引擎还是vLLM，但驾驶权、监控权和责任归属，回到了你手里。

本教程不讲理论，只带你一步步落地：从零启动一个带完整鉴权的FastAPI服务，对接本地vLLM，支持Bearer Token认证、请求限流、结构化日志，并提供可直接运行的代码。

2. 环境准备与依赖安装

我们假设你已在CSDN星图平台成功拉起GLM-4.7-Flash镜像（含vLLM推理服务和Web UI），且能通过curl http://127.0.0.1:8000/health确认vLLM正常运行。

接下来，在同一台机器上新建一个独立Python环境，避免与系统或vLLM环境冲突：

# 创建虚拟环境（推荐使用conda或venv） python -m venv ./glm47flash-api-env source ./glm47flash-api-env/bin/activate # Linux/Mac # .\glm47flash-api-env\Scripts\activate # Windows # 安装核心依赖 pip install --upgrade pip pip install fastapi uvicorn python-jose[cryptography] passlib python-multipart requests loguru

说明：
fastapi+uvicorn是Web框架和ASGI服务器；
python-jose用于JWT生成与校验；
passlib辅助密码哈希（虽本例不用密码登录，但为扩展留接口）；
loguru替代原生logging，更简洁易用；
requests用于向本地vLLM发起HTTP调用。

无需安装PyTorch、transformers或vLLM——这些已由镜像预装，我们只做API代理层。

3. 鉴权体系设计与Token管理

我们采用行业通用的Bearer Token + JWT（JSON Web Token）方案，轻量、无状态、易集成。整个流程如下：

运维人员预先生成一组长期有效的API密钥（如sk-glm47flash-prod-xxxxx）；
将密钥存入内存字典或简单文件（生产环境建议用Redis或数据库）；
用户携带Authorization: Bearer sk-glm47flash-prod-xxxxx请求API；
FastAPI中间件解析Token，校验是否存在且未过期；
校验通过则放行，否则返回401 Unauthorized。

这里不引入OAuth2复杂流程，也不做用户注册登录——因为这是模型服务网关，不是用户管理系统。重点是：快、稳、可追溯。

创建auth.py文件，定义鉴权逻辑：

# auth.py from fastapi import Depends, HTTPException, status from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials from jose import JWTError, jwt from loguru import logger import os # 模拟API密钥池（生产环境请替换为数据库/Redis） API_KEYS = { "sk-glm47flash-prod-2024": "production-team", "sk-glm47flash-dev-123": "development-team", } security = HTTPBearer() def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)) -> str: """ 验证Bearer Token是否在白名单中 返回关联的团队标识，可用于后续日志或限流 """ token = credentials.credentials if token not in API_KEYS: logger.warning(f"Invalid API key attempted: {token[:8]}...") raise HTTPException( status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid or expired API key", headers={"WWW-Authenticate": "Bearer"}, ) logger.info(f"Valid API key used by: {API_KEYS[token]}") return API_KEYS[token]

这段代码做了三件事：
拦截所有带Authorization: Bearer xxx的请求；
检查Token是否在预设字典中；
记录日志并返回所属团队名（为后续按团队限流埋点）。

4. FastAPI主服务搭建与vLLM代理

创建main.py，这是整个服务的核心：

# main.py from fastapi import FastAPI, Depends, HTTPException, status, Request from fastapi.responses import StreamingResponse, JSONResponse from fastapi.middleware.cors import CORSMiddleware from loguru import logger import requests import json import time from typing import Dict, Any, Optional from auth import verify_api_key app = FastAPI( title="GLM-4.7-Flash API Gateway", description="FastAPI wrapper for GLM-4.7-Flash vLLM service with auth & logging", version="1.0.0" ) # 允许前端跨域（如你的React/Vue项目） app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # vLLM服务地址（镜像内固定为8000端口） VLLM_BASE_URL = "http://127.0.0.1:8000" @app.middleware("http") async def log_requests(request: Request, call_next): """全局请求日志中间件""" start_time = time.time() try: response = await call_next(request) process_time = time.time() - start_time logger.info( f"{request.method} {request.url.path} " f"{response.status_code} {process_time:.3f}s " f"from {request.client.host}" ) return response except Exception as e: process_time = time.time() - start_time logger.error( f"{request.method} {request.url.path} ERROR {process_time:.3f}s " f"from {request.client.host} - {str(e)}" ) raise @app.post("/v1/chat/completions") async def chat_completions( request: Request, team_id: str = Depends(verify_api_key) # 注入鉴权结果 ): """ 代理vLLM的chat/completions接口，增加鉴权与日志 """ try: # 读取原始JSON体 body = await request.json() # 强制指定model路径（避免客户端乱填） body["model"] = "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash" # 添加团队标识到日志上下文（可选） logger.contextualize(team=team_id) # 转发请求到vLLM vllm_response = requests.post( f"{VLLM_BASE_URL}/v1/chat/completions", json=body, timeout=120 # 给大模型充足响应时间 ) # 流式响应处理 if body.get("stream", False): return StreamingResponse( vllm_response.iter_content(chunk_size=64), media_type="text/event-stream", status_code=vllm_response.status_code ) # 普通JSON响应 return JSONResponse( content=vllm_response.json(), status_code=vllm_response.status_code ) except requests.exceptions.Timeout: logger.error("vLLM timeout after 120s") raise HTTPException(status_code=504, detail="Model inference timeout") except requests.exceptions.ConnectionError: logger.error("Cannot connect to vLLM service") raise HTTPException(status_code=503, detail="vLLM service unavailable") except Exception as e: logger.exception("Unexpected error in chat_completions") raise HTTPException(status_code=500, detail="Internal server error") @app.get("/health") def health_check(): """健康检查端点，供K8s或监控系统调用""" return {"status": "ok", "timestamp": int(time.time())} if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8001, workers=2)

关键点解析：
🔹@app.middleware("http")：全局记录每个请求耗时、状态码、来源IP，便于排查问题；
🔹team_id = Depends(verify_api_key)：将鉴权结果注入路由，后续可用来做团队级限流或计费；
🔹body["model"] = ...：强制覆盖客户端传入的model字段，防止误调其他模型；
🔹StreamingResponse：原样透传vLLM的SSE流式响应，前端体验无缝；
🔹timeout=120：大模型推理可能较长，设合理超时避免挂起；
🔹/health：标准健康检查，方便容器编排系统探活。

5. 启动服务与测试验证

5.1 启动FastAPI网关

在镜像终端中执行：

# 确保已激活虚拟环境 source ./glm47flash-api-env/bin/activate # 启动服务（监听8001端口，避开vLLM的8000） uvicorn main:app --host 0.0.0.0 --port 8001 --workers 2 --reload

--workers 2：双进程提升并发能力；
--reload：开发时自动重载（生产环境请去掉）；
日志会实时输出到终端，同时被loguru写入文件（默认runtime.log）。

5.2 测试鉴权是否生效

打开新终端，用curl测试：

# ❌ 无Token → 应返回401 curl -X POST http://127.0.0.1:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"你好"}]}' # ❌ 错误Token → 应返回401 curl -X POST http://127.0.0.1:8001/v1/chat/completions \ -H "Authorization: Bearer invalid-key" \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"你好"}]}' # 正确Token → 应返回200并生成回复 curl -X POST http://127.0.0.1:8001/v1/chat/completions \ -H "Authorization: Bearer sk-glm47flash-prod-2024" \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role":"user","content":"用中文写一首关于春天的五言绝句"}], "temperature": 0.5, "max_tokens": 256 }'

观察FastAPI终端日志，你会看到类似：

2024-06-15 14:22:31.892 | INFO | __main__:log_requests:42 - POST /v1/chat/completions 200 1.245s from 127.0.0.1 2024-06-15 14:22:31.893 | INFO | auth:verify_api_key:28 - Valid API key used by: production-team

说明鉴权、日志、代理全部走通。

5.3 测试流式响应（前端友好）

用Python脚本模拟流式消费：

# test_stream.py import requests url = "http://127.0.0.1:8001/v1/chat/completions" headers = { "Authorization": "Bearer sk-glm47flash-prod-2024", "Content-Type": "application/json" } data = { "model": "/root/.cache/huggingface/ZhipuAI/GLM-4.7-Flash", "messages": [{"role": "user", "content": "请逐字解释‘人工智能’四个字的含义"}], "stream": True } with requests.post(url, headers=headers, json=data, stream=True) as r: for line in r.iter_lines(): if line: line_str = line.decode('utf-8') if line_str.startswith("data: "): try: chunk = json.loads(line_str[6:]) if "choices" in chunk and chunk["choices"][0]["delta"].get("content"): print(chunk["choices"][0]["delta"]["content"], end="", flush=True) except: pass

运行后，文字会像打字机一样逐字输出，证明流式链路完整。

6. 生产就绪增强：限流与错误统一处理

当前版本已满足基础需求，但离生产还差两步：防刷限流和错误标准化。我们用几行代码补上。

6.1 基于团队ID的简单限流

修改main.py，在顶部添加限流逻辑（使用内存计数器，生产环境建议换Redis）：

# main.py 开头新增 from collections import defaultdict, deque import time # 内存限流：每分钟最多30次请求 per team RATE_LIMIT_WINDOW = 60 # 秒 RATE_LIMIT_MAX = 30 _request_counts = defaultdict(deque) # team_id -> [timestamp, timestamp, ...] def check_rate_limit(team_id: str) -> bool: now = time.time() # 清理窗口外的旧请求记录 while _request_counts[team_id] and _request_counts[team_id][0] < now - RATE_LIMIT_WINDOW: _request_counts[team_id].popleft() # 检查是否超限 if len(_request_counts[team_id]) >= RATE_LIMIT_MAX: return False _request_counts[team_id].append(now) return True

然后在chat_completions函数开头加入：

# 在 try: 之后、读取body之前插入 if not check_rate_limit(team_id): logger.warning(f"Rate limit exceeded for team: {team_id}") raise HTTPException( status_code=status.HTTP_429_TOO_MANY_REQUESTS, detail="Too many requests, please try again later" )

6.2 统一错误响应格式

FastAPI默认错误返回HTML或纯文本，不利于前端解析。我们定义一个标准错误响应类：

# 在main.py中添加 from pydantic import BaseModel class ErrorResponse(BaseModel): error: dict @app.exception_handler(HTTPException) async def http_exception_handler(request: Request, exc: HTTPException): return JSONResponse( status_code=exc.status_code, content={ "error": { "message": exc.detail, "type": "api_error", "param": None, "code": str(exc.status_code) } } ) @app.exception_handler(Exception) async def general_exception_handler(request: Request, exc: Exception): logger.exception("Unhandled exception") return JSONResponse( status_code=500, content={ "error": { "message": "An unexpected error occurred", "type": "server_error", "param": None, "code": "500" } } )

现在所有错误都返回结构化JSON，前端可统一处理：

{ "error": { "message": "Invalid or expired API key", "type": "api_error", "param": null, "code": "401" } }

7. 总结：你已掌握模型服务网关的核心能力

回顾整个过程，你亲手构建了一个具备生产级特性的GLM-4.7-Flash API网关：

安全可控：通过Bearer Token实现最小权限访问，杜绝未授权调用；
可观测：全链路结构化日志，精确到毫秒级耗时与来源IP；
高可用：健康检查+超时熔断+异常兜底，保障服务稳定性；
易扩展：模块化设计，后续可轻松接入Redis限流、Prometheus监控、OpenTelemetry追踪；
零侵入：完全不改动vLLM源码，仅做轻量代理，升级vLLM不影响网关。

这不仅是“封装一个API”，更是为你搭建了一条通往AI工程化的标准路径——模型是引擎，API是方向盘，而网关，就是你手里的驾驶执照。

下一步你可以：
➡ 将API密钥存储到环境变量或Secret Manager；
➡ 集成Prometheus exporter暴露QPS、延迟、错误率指标；
➡ 为不同团队配置差异化max_tokens和temperature默认值；
➡ 添加异步队列（如Celery）处理超长请求，避免阻塞主线程。

技术没有银弹，但扎实的工程实践，永远是最可靠的护城河。