VLLM on 黄文卓 | DevOps Engineer

VLLM on 黄文卓 | DevOps Engineerhttps://socake.github.io/tags/vllm/Recent content in VLLM on 黄文卓 | DevOps EngineerHugo -- gohugo.iozh-CN17691281867@163.com (Wenzhuo Huang)17691281867@163.com (Wenzhuo Huang)© 2026 Wenzhuo HuangTue, 03 Mar 2026 09:30:00 +0800vLLM 多机多卡分布式推理：Tensor Parallel 调优与踩坑实录https://socake.github.io/posts/vllm-multi-node-distributed/Tue, 03 Mar 2026 09:30:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/vllm-multi-node-distributed/从单机 8 卡讲到多机多卡，把 vLLM 的 TP/PP 拆分、Ray 启动方式、NCCL 调优、PagedAttention 显存核算和常见翻车场景串成一条完整的落地路径。LLM 生产服务化：vLLM 部署与 GPU 推理优化实战https://socake.github.io/posts/llm-production-serving-vllm/Tue, 13 Jan 2026 13:36:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/llm-production-serving-vllm/团队把 Ollama 搬上生产后，高峰期请求排队超过 30 秒，用户纷纷反映 AI 功能不可用。这篇文章记录我们迁移到 vLLM 的全过程，包括 PagedAttention、Continuous Batching 原理，以及 Kubernetes GPU 部署的完整配置。