推理部署 on 黄文卓 | DevOps Engineer

推理部署 on 黄文卓 | DevOps Engineerhttps://socake.github.io/tags/%E6%8E%A8%E7%90%86%E9%83%A8%E7%BD%B2/Recent content in 推理部署 on 黄文卓 | DevOps EngineerHugo -- gohugo.iozh-CN17691281867@163.com (Wenzhuo Huang)17691281867@163.com (Wenzhuo Huang)© 2026 Wenzhuo HuangSat, 14 Mar 2026 16:45:00 +0800SGLang 结构化生成实战：RadixAttention、约束解码与多轮对话优化https://socake.github.io/posts/sglang-structured-generation/Sat, 14 Mar 2026 16:45:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/sglang-structured-generation/SGLang 是被低估的 LLM 推理框架，RadixAttention 对多轮对话和 Agent 场景收益巨大。本文讲清 SGLang 的核心机制、前端 DSL、约束解码、部署方式和踩坑。vLLM 多机多卡分布式推理：Tensor Parallel 调优与踩坑实录https://socake.github.io/posts/vllm-multi-node-distributed/Tue, 03 Mar 2026 09:30:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/vllm-multi-node-distributed/从单机 8 卡讲到多机多卡，把 vLLM 的 TP/PP 拆分、Ray 启动方式、NCCL 调优、PagedAttention 显存核算和常见翻车场景串成一条完整的落地路径。