vLLM（vLLM.ai）K8S生产环境部署Qwen大模型-育师

🏗️ 一、整体架构拓扑（生产级）

✅ 核心原则：
GPU 资源隔离：专用 GPU 节点池 + taint/toleration
零信任网络：服务间 mTLS（Istio）
模型不可变：Docker 镜像封装量化模型
全链路可观测：指标 + 日志 + 链路追踪

🧱 二、详细部署步骤

1. 基础设施准备

Kubernetes 集群要求：

版本：v1.26+
GPU 节点池：
- 实例类型：g5.2xlarge（1×A10）或 p4d.24xlarge（8×A100）
- 标签：node.kubernetes.io/gpu-type=A10
- Taint：dedicated=gpu:NoSchedule
安装组件：
- NVIDIA Device Plugin
- DCGM Exporter（GPU 监控）
- Istio（1.22+，启用 mTLS）

2. 构建 vLLM 生产镜像（含量化模型）

Dockerfile（Qwen-7B-AWQ 示例）：

FROM nvidia/cuda:12.1-runtime-ubuntu22.04# 安装依赖RUNaptupdate&&aptinstall-y python3-pipgitRUN pipinstall--no-cache-dirvllm==0.4.3modelscope==1.14.0# 复制 AWQ 量化模型（由 CI 流水线生成）COPY ./models/qwen/Qwen-7B-Chat-AWQ /models/qwen-7b-chat-awq# 非 root 运行RUNuseradd-m -u1001vllm&&chown-R vllm:vllm /modelsUSER1001EXPOSE8000CMD["python","-m","vllm.entrypoints.openai.api_server",\"--model","/models/qwen-7b-chat-awq",\"--trust-remote-code",\"--dtype","auto",\"--max-model-len","8192",\"--gpu-memory-utilization","0.92",\"--port","8000"]

🔑 关键参数说明：
–trust-remote-code：Qwen/ChatGLM 必须
–gpu-memory-utilization=0.92：避免 OOM
–max-model-len=8192：支持长上下文

构建并推送：

docker build -t harbor.internal/llm/vllm-qwen-7b-awq:v1.0.docker push harbor.internal/llm/vllm-qwen-7b-awq:v1.0

3. Kubernetes 部署（KServe + HPA）

KServe InferenceService YAML：

# vllm-qwen-isvc.yamlapiVersion:serving.kserve.io/v1beta1kind:InferenceServicemetadata:name:qwen-7b-vllmnamespace:llm-prodspec:predictor:minReplicas:3maxReplicas:20scaleMetric:concurrency# 基于并发请求数扩缩容containers:-name:kserve-containerimage:harbor.internal/llm/vllm-qwen-7b-awq:v1.0resources:limits:nvidia.com/gpu:1memory:32Gicpu:"8"requests:nvidia.com/gpu:1memory:16Gicpu:"4"ports:-containerPort:8000livenessProbe:httpGet:{path:/health,port:8000}initialDelaySeconds:120readinessProbe:httpGet:{path:/health,port:8000}initialDelaySeconds:60volumeMounts:-name:model-cachemountPath:/modelsvolumes:-name:model-cachepersistentVolumeClaim:claimName:pvc-nfs-models# NFS 共享存储（多副本共享模型）

GPU 指标 HPA（基于利用率）：

# hpa-gpu.yamlapiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:qwen-7b-vllm-hpaspec:scaleTargetRef:apiVersion:serving.kserve.io/v1beta1kind:InferenceServicename:qwen-7b-vllmmetrics:-type:Podspods:metric:name:DCGM_FI_DEV_GPU_UTIL# 来自 DCGM Exportertarget:type:AverageValueaverageValue:"70"# GPU 利用率 70% 触发扩容minReplicas:3maxReplicas:20

💡 HPA 前提：已部署 Prometheus Adapter 将 GPU 指标暴露给 K8s

4. 网络与安全

Istio mTLS + 授权策略：

# peer-authentication.yamlapiVersion:security.istio.io/v1beta1kind:PeerAuthenticationmetadata:name:defaultspec:mtls:mode:STRICT# 强制服务间 mTLS

# authorization-policy.yamlapiVersion:security.istio.io/v1beta1kind:AuthorizationPolicymetadata:name:vllm-accessspec:selector:matchLabels:app:qwen-7b-vllmrules:-from:-source:principals:["cluster.local/ns/llm-prod/sa/api-gateway"]

API Gateway 配置（Kong）：

插件：JWT 认证、限流（100 req/min）、CORS
路由：POST /v1/chat/completions → http://qwen-7b-vllm.llm-prod.svc:8000

5. 可观测性体系

Prometheus 指标（vLLM 自动暴露 /metrics）：

指标	说明
vllm:request_duration_seconds	请求延迟（P99 < 300ms）
vllm:tokens_processed_total	Token 吞吐量
DCGM_FI_DEV_GPU_UTIL	GPU 利用率

Grafana 仪表盘关键 Panel：

实时 QPS（按模型版本）
平均首 Token 延迟（TTFT）
GPU 显存使用趋势
错误率（HTTP 5xx）

结构化日志（JSON）：

{"timestamp":"2025-12-05T10:00:00Z","service":"qwen-7b-vllm","request_id":"req-a1b2c3","user_id":"user_123","prompt_tokens":128,"completion_tokens":64,"total_time_ms":185,"status_code":200}