<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>推理部署 on 黄文卓 | DevOps Engineer</title><link>https://socake.github.io/categories/%E6%8E%A8%E7%90%86%E9%83%A8%E7%BD%B2/</link><description>Recent content in 推理部署 on 黄文卓 | DevOps Engineer</description><generator>Hugo -- gohugo.io</generator><language>zh-CN</language><managingEditor>17691281867@163.com (Wenzhuo Huang)</managingEditor><webMaster>17691281867@163.com (Wenzhuo Huang)</webMaster><copyright>© 2026 Wenzhuo Huang</copyright><lastBuildDate>Sun, 29 Mar 2026 10:45:00 +0800</lastBuildDate><atom:link href="https://socake.github.io/categories/%E6%8E%A8%E7%90%86%E9%83%A8%E7%BD%B2/index.xml" rel="self" type="application/rss+xml"/><item><title>Ray Serve 模型部署实战：Deployment、DAG 编排与弹性伸缩</title><link>https://socake.github.io/posts/ray-serve-model-deployment/</link><pubDate>Sun, 29 Mar 2026 10:45:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ray-serve-model-deployment/</guid><description>Ray Serve 是被很多团队忽视的模型服务框架。它在复杂 DAG、异构资源、弹性伸缩上的表现远超单纯的 FastAPI。本文讲清它的核心抽象和生产落地。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ray-serve-model-deployment/featured.jpg"/></item><item><title>SGLang 结构化生成实战：RadixAttention、约束解码与多轮对话优化</title><link>https://socake.github.io/posts/sglang-structured-generation/</link><pubDate>Sat, 14 Mar 2026 16:45:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sglang-structured-generation/</guid><description>SGLang 是被低估的 LLM 推理框架，RadixAttention 对多轮对话和 Agent 场景收益巨大。本文讲清 SGLang 的核心机制、前端 DSL、约束解码、部署方式和踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sglang-structured-generation/featured.jpg"/></item><item><title>Triton Inference Server 生产部署：模型编排、动态批处理与多框架混部</title><link>https://socake.github.io/posts/triton-inference-server-production/</link><pubDate>Wed, 11 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/triton-inference-server-production/</guid><description>把 Triton 从一个陌生的 NVIDIA 推理服务器讲清楚：model repository、backend、动态批处理、ensemble、BLS、Python backend、生产监控和踩坑实录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/triton-inference-server-production/featured.jpg"/></item><item><title>TensorRT-LLM 推理加速实战：从 engine 编译到 kernel 调优</title><link>https://socake.github.io/posts/tensorrt-llm-inference/</link><pubDate>Sat, 07 Mar 2026 14:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tensorrt-llm-inference/</guid><description>TensorRT-LLM 是 NVIDIA 端到端推理栈的关键一环，这篇把 engine 编译流程、plugin 机制、量化策略、inflight batching、kernel 调优和生产踩坑都梳理清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tensorrt-llm-inference/featured.jpg"/></item><item><title>vLLM 多机多卡分布式推理：Tensor Parallel 调优与踩坑实录</title><link>https://socake.github.io/posts/vllm-multi-node-distributed/</link><pubDate>Tue, 03 Mar 2026 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vllm-multi-node-distributed/</guid><description>从单机 8 卡讲到多机多卡，把 vLLM 的 TP/PP 拆分、Ray 启动方式、NCCL 调优、PagedAttention 显存核算和常见翻车场景串成一条完整的落地路径。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vllm-multi-node-distributed/featured.jpg"/></item></channel></rss>