<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on 黄文卓 | DevOps Engineer</title><link>https://socake.github.io/posts/</link><description>Recent content in Posts on 黄文卓 | DevOps Engineer</description><generator>Hugo -- gohugo.io</generator><language>zh-CN</language><managingEditor>17691281867@163.com (Wenzhuo Huang)</managingEditor><webMaster>17691281867@163.com (Wenzhuo Huang)</webMaster><copyright>© 2026 Wenzhuo Huang</copyright><lastBuildDate>Sat, 18 Apr 2026 14:00:00 +0800</lastBuildDate><atom:link href="https://socake.github.io/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>Nacos 一文通：从零基础到生产精通的配置中心与服务发现实战</title><link>https://socake.github.io/posts/nacos-config-service-discovery-guide/</link><pubDate>Sat, 18 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/nacos-config-service-discovery-guide/</guid><description>Nacos 同时承担配置中心和服务注册发现两个核心职责，是 Spring Cloud Alibaba 生态的基石。本文系统梳理 Nacos 的数据模型、一致性协议、长轮询推送机制、临时实例健康检查、生产集群部署、多语言 SDK 接入、灰度发布、权限控制、常见故障排查（配置不生效/密码漂移/集群脑裂）以及云原生时代的定位，适合从入门到生产运维的完整参考。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/nacos-config-service-discovery-guide/featured.jpg"/></item><item><title>多云中间件横向速查与跨环境隔离实战</title><link>https://socake.github.io/posts/multi-cloud-middleware-and-isolation/</link><pubDate>Sat, 18 Apr 2026 13:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/multi-cloud-middleware-and-isolation/</guid><description>做多云运维最容易的事就是把 AWS 那套思维原样搬到阿里云，然后在某次故障里发现选型完全错位。本文整理了一份 AWS↔阿里云中间件横向对照表，附上跨环境隔离强制 checklist 和高频运维命令速查，是我自己工作中反复回查的一份速记。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/multi-cloud-middleware-and-isolation/featured.jpg"/></item><item><title>Headscale 自建零信任 VPN：跨云多机房内网打通</title><link>https://socake.github.io/posts/headscale-zero-trust-vpn/</link><pubDate>Sun, 12 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/headscale-zero-trust-vpn/</guid><description>从 WireGuard 协议原理到 Headscale 完整部署，包括 DERP 自建、Subnet Router 配置、K8s 集成和 ACL 策略设计，用 Mesh VPN 替代传统堡垒机的完整实操指南。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/headscale-zero-trust-vpn/featured.jpg"/></item><item><title>Linux 火焰图实战：从采集到定位问题</title><link>https://socake.github.io/posts/linux-flame-graph-practice/</link><pubDate>Sun, 12 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/linux-flame-graph-practice/</guid><description>CPU 飙高、响应慢、内存泄漏——这三类问题用火焰图都能快速定位。本文从怎么读火焰图开始，讲到 perf、async-profiler、py-spy 各自的适用场景，最后用一个真实的 Go 服务案例走完完整排查流程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/linux-flame-graph-practice/featured.jpg"/></item><item><title>MySQL 高可用实战：MGR + ProxySQL + Orchestrator 完整部署</title><link>https://socake.github.io/posts/mysql-ha-mgr-proxysql/</link><pubDate>Sun, 12 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mysql-ha-mgr-proxysql/</guid><description>详细讲解 MySQL 8.0 MGR 单主模式完整搭建过程、脑裂与 GTID 不一致处理方法、ProxySQL 读写分离配置和健康检查脚本、Orchestrator 自动故障转移与 ProxySQL 联动，以及 mysqld_exporter 监控集成。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mysql-ha-mgr-proxysql/featured.jpg"/></item><item><title>OpenCost 实战：Kubernetes 成本可见性与多团队费用分摊</title><link>https://socake.github.io/posts/opencost-kubernetes-cost-visibility/</link><pubDate>Sun, 12 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/opencost-kubernetes-cost-visibility/</guid><description>Kubernetes 成本不透明是 FinOps 落地的最大障碍。本文通过 OpenCost 构建完整的成本可见性体系，涵盖部署集成、云厂商价格接入、按团队分摊、Grafana 看板、超预算告警和自动周报推送，提供可直接复用的配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/opencost-kubernetes-cost-visibility/featured.jpg"/></item><item><title>Argo Workflows 工作流实战：批处理与 ML Pipeline</title><link>https://socake.github.io/posts/argo-workflows-practice/</link><pubDate>Sun, 12 Apr 2026 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/argo-workflows-practice/</guid><description>Argo Workflows 是 Kubernetes 原生的工作流引擎，适合批处理和 ML Pipeline 场景。本文涵盖与 Airflow/Temporal 的选型对比、核心资源模型、三个完整实战（DAG 数据处理、ML 训练 Pipeline、定时备份）、资源管控（Semaphore/Node Selector）、Argo Events 事件驱动触发，以及 Prometheus 监控和常见问题处理。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/argo-workflows-practice/featured.jpg"/></item><item><title>Kubernetes cgroup v2 迁移实践</title><link>https://socake.github.io/posts/kubernetes-cgroup-v2-migration/</link><pubDate>Sun, 12 Apr 2026 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-cgroup-v2-migration/</guid><description>K8s 1.25+ 默认启用 cgroup v2，MemoryQoS 和 PSI 等新特性只在 v2 支持。本文给出完整的节点迁移操作流程和常见问题解决方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-cgroup-v2-migration/featured.jpg"/></item><item><title>USE Method：系统性能分析方法论</title><link>https://socake.github.io/posts/use-method-performance-analysis/</link><pubDate>Sun, 12 Apr 2026 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/use-method-performance-analysis/</guid><description>随机尝试是性能排查的大敌。USE Method 用一个三维框架（使用率/饱和度/错误）把所有系统资源纳入统一分析体系，本文从原理到实战全面解析这套方法论，并提供 K8s 环境下的 PromQL 映射和工具链速查表。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/use-method-performance-analysis/featured.jpg"/></item><item><title>bpftrace 实战：线上问题排查的瑞士军刀</title><link>https://socake.github.io/posts/bpftrace-performance-debug/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/bpftrace-performance-debug/</guid><description>strace 太重、perf 太原始、BCC 工具集要装一堆依赖——bpftrace 是这三者之间的平衡点。本文用四个真实场景讲清楚 bpftrace 的工作方式，帮你把它变成日常排查工具。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/bpftrace-performance-debug/featured.jpg"/></item><item><title>FinOps 实践：Kubernetes 成本治理体系建设</title><link>https://socake.github.io/posts/finops-kubernetes-cost-governance/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/finops-kubernetes-cost-governance/</guid><description>一套完整的 Kubernetes FinOps 落地路径：如何识别僵尸资源、配置成本分摊模型、利用 Karpenter 降低节点成本，以及如何将月账单从 $50k 压到 $30k。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/finops-kubernetes-cost-governance/featured.jpg"/></item><item><title>gRPC 微服务实践：协议、负载均衡与 Kubernetes 集成</title><link>https://socake.github.io/posts/grpc-microservices-practice/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grpc-microservices-practice/</guid><description>从协议原理到 Kubernetes 生产落地，系统梳理 gRPC 微服务的核心实践：Protobuf 向后兼容设计、拦截器链（日志/限流/OTel）、长连接负载不均问题（headless Service + round_robin vs Envoy L7）、健康检查 Probe 配置、以及 grpc-gateway REST 共存方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grpc-microservices-practice/featured.jpg"/></item><item><title>Kubernetes v1.33 新特性深度解读：GA 特性全览与升级指南</title><link>https://socake.github.io/posts/kubernetes-v133-features/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-v133-features/</guid><description>Kubernetes v1.33 带来了多项重量级 GA 特性，本文深入解读 In-Place Pod Vertical Scaling、原生 Sidecar Containers、Pod Scheduling Readiness、KMS v2 加密等核心变更，并提供实际可用的配置示例和生产升级建议。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-v133-features/featured.jpg"/></item><item><title>PostgreSQL 高可用实战：Patroni + HAProxy + etcd 完整部署指南</title><link>https://socake.github.io/posts/postgresql-ha-patroni/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/postgresql-ha-patroni/</guid><description>详解 Patroni 自动故障转移机制，手把手完成 etcd 三节点集群搭建、Patroni 完整配置（含 pg_hba.conf 托管）、HAProxy 读写分离配置，以及 kill primary 故障切换演练全过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/postgresql-ha-patroni/featured.jpg"/></item><item><title>Service Mesh 技术选型：Istio vs Cilium vs Linkerd 深度对比</title><link>https://socake.github.io/posts/service-mesh-comparison/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/service-mesh-comparison/</guid><description>Istio、Cilium Service Mesh、Linkerd 三种方案各有侧重：Istio 功能最全但最重，Cilium 基于 eBPF 性能最优，Linkerd 最轻量最易运维。本文从架构、性能、功能、运维四个维度全面拆解，帮助架构师做出有数据支撑的选型决策。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/service-mesh-comparison/featured.jpg"/></item><item><title>从 Ingress 迁移到 Gateway API：完整实操指南</title><link>https://socake.github.io/posts/ingress-to-gateway-api-migration/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ingress-to-gateway-api-migration/</guid><description>Gateway API 是 Kubernetes 官方下一代流量入口标准，解决了 Ingress 注解泛滥、跨实现不可移植等历史遗留问题。本文带你从零完成生产迁移。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ingress-to-gateway-api-migration/featured.jpg"/></item><item><title>Flagger 渐进式交付实战：金丝雀、蓝绿、A/B 与 Istio/NGINX/Gateway API 集成</title><link>https://socake.github.io/posts/flagger-progressive-delivery/</link><pubDate>Sat, 11 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/flagger-progressive-delivery/</guid><description>传统的 kubectl apply 发布方式让风险集中在发布那一刻。Flagger 通过指标驱动的渐进式切流（Canary Analysis），把风险摊到整个发布过程，异常自动回滚。本文基于官方文档，系统讲解 Canary CR 的完整字段、三种策略的配置模板、与 Istio/NGINX Ingress/Gateway API 的集成、自定义指标分析、自动化回滚机制，以及与 Argo Rollouts 的选型对比。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/flagger-progressive-delivery/featured.jpg"/></item><item><title>Temporal 分布式工作流引擎实战：Worker、Activity、重试语义与生产部署</title><link>https://socake.github.io/posts/temporal-workflow-engine/</link><pubDate>Wed, 08 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/temporal-workflow-engine/</guid><description>长流程业务编排历来头疼——状态机、定时器、补偿、幂等、失败恢复都要自己写。Temporal 用 event sourcing + 确定性 replay 把这些问题一次性解决。本文以 Go SDK 为主线，从编程模型、Workflow 确定性约束、Activity 重试、Signal/Query、child workflow、到生产集群部署、监控和容量规划，给出可直接落地的范式。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/temporal-workflow-engine/featured.jpg"/></item><item><title>故障排查实录：Terway CRD IPAM IP 泄漏导致 Pod 无法调度</title><link>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5-terway-ip%E6%B3%84%E6%BC%8F/</link><pubDate>Tue, 07 Apr 2026 09:54:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5-terway-ip%E6%B3%84%E6%BC%8F/</guid><description>一次真实的连锁故障：节点磁盘告警 → Pod 被驱逐 → Terway IPAM IP 未正常回收 → 节点 ENI IP 耗尽 → 新 Pod 无法调度。排查链路、根因分析与修复方案完整记录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5-terway-ip%E6%B3%84%E6%BC%8F/featured.jpg"/></item><item><title>AutoGen 多 Agent 协作实战：从 Group Chat 到生产落地</title><link>https://socake.github.io/posts/autogen-multi-agent-practice/</link><pubDate>Mon, 06 Apr 2026 11:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/autogen-multi-agent-practice/</guid><description>AutoGen 把多 Agent 协作从玩具推向生产。本文讲清它的核心抽象 (Conversable Agent / Group Chat / 工具调用)，以及从 demo 到生产要处理的那些事。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/autogen-multi-agent-practice/featured.jpg"/></item><item><title>运维工程师的 AI 工具实践</title><link>https://socake.github.io/posts/%E8%BF%90%E7%BB%B4%E5%B7%A5%E7%A8%8B%E5%B8%88ai%E5%B7%A5%E5%85%B7%E5%AE%9E%E8%B7%B5/</link><pubDate>Fri, 03 Apr 2026 11:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E8%BF%90%E7%BB%B4%E5%B7%A5%E7%A8%8B%E5%B8%88ai%E5%B7%A5%E5%85%B7%E5%AE%9E%E8%B7%B5/</guid><description>从写 Shell 脚本、解读错误信息到辅助故障排查，分享运维工程师真实使用 AI 工具的高效场景、无效场景和 Prompt 技巧，以及各工具的适合场景。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E8%BF%90%E7%BB%B4%E5%B7%A5%E7%A8%8B%E5%B8%88ai%E5%B7%A5%E5%85%B7%E5%AE%9E%E8%B7%B5/featured.jpg"/></item><item><title>LiteLLM 网关实战：多模型统一接入、限流、成本追踪与故障切换</title><link>https://socake.github.io/posts/litellm-gateway-proxy/</link><pubDate>Thu, 02 Apr 2026 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/litellm-gateway-proxy/</guid><description>LiteLLM 是 LLM 多模型接入的事实标准。本文讲清它的 Proxy 模式部署、Model Config、Virtual Key、Router Fallback、成本追踪和踩坑实录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/litellm-gateway-proxy/featured.jpg"/></item><item><title>Tetragon eBPF 运行时安全实战：进程/网络/文件策略、与 Falco 的对比</title><link>https://socake.github.io/posts/tetragon-runtime-security/</link><pubDate>Thu, 02 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tetragon-runtime-security/</guid><description>Kubernetes 运行时安全是传统 EDR 难以覆盖的盲区。Tetragon 用 eBPF 在内核态采集进程、网络、文件和系统调用事件，并能在内核就地阻断攻击动作。本文从架构原理出发，讲解 TracingPolicy 语法、典型攻击检测（反弹 shell、提权、敏感文件访问）、阻断机制、性能开销，以及它与 Falco 的差异。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tetragon-runtime-security/featured.jpg"/></item><item><title>Ollama 在 K8s 上跑大模型：本地 LLM 的运维实践</title><link>https://socake.github.io/posts/ollama-kubernetes-llm/</link><pubDate>Mon, 30 Mar 2026 09:08:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ollama-kubernetes-llm/</guid><description>在 Kubernetes 上部署 Ollama 运行本地大模型，从 GPU 调度到 CPU 推理降级，再到运维场景的实际集成，记录完整的踩坑与实践过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ollama-kubernetes-llm/featured.jpg"/></item><item><title>Ray Serve 模型部署实战：Deployment、DAG 编排与弹性伸缩</title><link>https://socake.github.io/posts/ray-serve-model-deployment/</link><pubDate>Sun, 29 Mar 2026 10:45:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ray-serve-model-deployment/</guid><description>Ray Serve 是被很多团队忽视的模型服务框架。它在复杂 DAG、异构资源、弹性伸缩上的表现远超单纯的 FastAPI。本文讲清它的核心抽象和生产落地。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ray-serve-model-deployment/featured.jpg"/></item><item><title>GitHub Copilot 工程化使用：不只是代码补全</title><link>https://socake.github.io/posts/github-copilot-engineering/</link><pubDate>Sat, 28 Mar 2026 12:51:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/github-copilot-engineering/</guid><description>GitHub Copilot不只是Tab补全。Copilot Chat的/fix /explain /tests命令、workspace上下文、Copilot for CLI、在Terraform/Dockerfile/K8s YAML中的实际用法，以及提高补全命中率的技巧。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/github-copilot-engineering/featured.jpg"/></item><item><title>Volcano 批调度实战：AI 训练集群的 Gang Scheduling、队列与抢占</title><link>https://socake.github.io/posts/volcano-gpu-batch-scheduling/</link><pubDate>Wed, 25 Mar 2026 15:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/volcano-gpu-batch-scheduling/</guid><description>K8s 默认调度器对 AI 训练极不友好。Volcano 把 HPC 调度理念搬进 K8s：Gang Scheduling、Queue、Fairshare、Preemption、拓扑亲和。这篇讲清楚它在 AI 训练集群的落地。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/volcano-gpu-batch-scheduling/featured.jpg"/></item><item><title>Cursor AI 编程助手深度使用指南</title><link>https://socake.github.io/posts/cursor-ai-editor-guide/</link><pubDate>Wed, 25 Mar 2026 13:07:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/cursor-ai-editor-guide/</guid><description>Cursor不是装了AI插件的VSCode，它重新设计了人机协作的交互模型。本文拆解Tab补全、@上下文引用、Composer、Agent模式、.cursorrules配置，并以重构运维脚本为例演示完整工作流。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/cursor-ai-editor-guide/featured.jpg"/></item><item><title>ComfyUI + Stable Diffusion：工作流自动化图像生成</title><link>https://socake.github.io/posts/comfyui-stable-diffusion-workflow/</link><pubDate>Mon, 23 Mar 2026 12:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/comfyui-stable-diffusion-workflow/</guid><description>对比SDXL/FLUX/SD3生态选型，讲清楚ComfyUI vs WebUI如何选，然后深入ComfyUI安装、节点图工作流设计、常用节点配置，重点讲API无头调用和服务器端批量生成部署方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/comfyui-stable-diffusion-workflow/featured.jpg"/></item><item><title>FluxCD vs ArgoCD 深度对比与迁移实战：架构、语义、多租户与选型决策</title><link>https://socake.github.io/posts/fluxcd-vs-argocd-migration/</link><pubDate>Sun, 22 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/fluxcd-vs-argocd-migration/</guid><description>GitOps 的两条主流路线——FluxCD 与 ArgoCD——在架构、语义、运维成本和扩展性上有显著差异。本文基于官方文档和生产实战，按同步模型、应用抽象、多租户隔离、Helm 支持、可观测性、扩展机制逐项对比，给出选型决策树，并提供一套可复用的从 ArgoCD 迁移到 FluxCD 的操作手册。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/fluxcd-vs-argocd-migration/featured.jpg"/></item><item><title>Unsloth 高效微调实战：单卡 QLoRA 的极致性能与内部原理</title><link>https://socake.github.io/posts/unsloth-efficient-finetuning/</link><pubDate>Sun, 22 Mar 2026 09:15:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/unsloth-efficient-finetuning/</guid><description>Unsloth 用手写 Triton kernel 把单卡 LoRA 微调速度和显存压到极致。本文讲清 Unsloth 的原理、和 LLaMA Factory/TRL 的组合用法，以及真实使用的坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/unsloth-efficient-finetuning/featured.jpg"/></item><item><title>Linux 内核网络参数深度调优：高并发场景实战</title><link>https://socake.github.io/posts/linux-kernel-network-tuning/</link><pubDate>Fri, 20 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/linux-kernel-network-tuning/</guid><description>在高并发场景下，Linux 默认内核参数往往成为系统瓶颈。本文从原理出发，系统讲解 TCP backlog、TIME_WAIT、keepalive、内存缓冲区、conntrack、网卡队列（RSS/RPS/RFS）的调优方法，并提供 K8s 节点专属的 sysctl DaemonSet 方案和完整的压测验证流程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/linux-kernel-network-tuning/featured.jpg"/></item><item><title>FastGPT 知识库问答系统：从部署到应用</title><link>https://socake.github.io/posts/fastgpt-knowledge-base-practice/</link><pubDate>Fri, 20 Mar 2026 09:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/fastgpt-knowledge-base-practice/</guid><description>FastGPT是专注知识库问答的开源平台，相比Dify上手更快。本文覆盖MongoDB+PgVector部署、知识库创建与文档导入、Flow工作流配置、相似度阈值调优、API接入钉钉，以及运维知识库的实战案例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/fastgpt-knowledge-base-practice/featured.jpg"/></item><item><title>LLaMA Factory 微调工具链实战：从数据准备到 LoRA 合并的全流程</title><link>https://socake.github.io/posts/llamafactory-finetuning/</link><pubDate>Wed, 18 Mar 2026 11:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llamafactory-finetuning/</guid><description>LLaMA Factory 把大模型微调的很多 trick 工程化了。本文按一个完整项目的节奏讲：数据、SFT、LoRA、DPO、合并、评估和常见坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llamafactory-finetuning/featured.jpg"/></item><item><title>容器镜像构建优化：BuildKit、多阶段构建与供应链安全</title><link>https://socake.github.io/posts/container-image-build-optimization/</link><pubDate>Wed, 18 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/container-image-build-optimization/</guid><description>深入剖析容器镜像构建优化的每个环节：BuildKit 并行构建与 Secrets 注入、Go/Python/Node.js 多阶段 Dockerfile 模板、&amp;ndash;mount=type=cache 与远程缓存、Distroless vs Alpine 选型、dive 分析层内容，以及完整的供应链安全闭环（syft SBOM + Cosign 签名 + K8s 准入控制验签）。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/container-image-build-optimization/featured.jpg"/></item><item><title>ClickHouse 生产运维实战：集群部署、副本分片、性能调优与故障排查</title><link>https://socake.github.io/posts/clickhouse-ops-practice/</link><pubDate>Sun, 15 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/clickhouse-ops-practice/</guid><description>ClickHouse 高吞吐 OLAP 能力背后有一套独特的运维范式：ReplicatedMergeTree、ZooKeeper/Keeper、分布式表、物化视图、TTL、MergeTree 家族选型。本文按生产落地路径，从集群规划、副本分片、写入优化、查询调优、物化视图到慢查询排查，配套可直接复用的 SQL 与运维脚本。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/clickhouse-ops-practice/featured.jpg"/></item><item><title>SGLang 结构化生成实战：RadixAttention、约束解码与多轮对话优化</title><link>https://socake.github.io/posts/sglang-structured-generation/</link><pubDate>Sat, 14 Mar 2026 16:45:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sglang-structured-generation/</guid><description>SGLang 是被低估的 LLM 推理框架，RadixAttention 对多轮对话和 Agent 场景收益巨大。本文讲清 SGLang 的核心机制、前端 DSL、约束解码、部署方式和踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sglang-structured-generation/featured.jpg"/></item><item><title>Dify 私有化部署与 RAG 应用构建实战</title><link>https://socake.github.io/posts/dify-self-hosted-rag-practice/</link><pubDate>Thu, 12 Mar 2026 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/dify-self-hosted-rag-practice/</guid><description>Dify是当前私有化部署最成熟的LLM应用构建平台。本文覆盖Docker Compose部署、多模型Provider配置、知识库创建与切片调优、RAG对话应用构建、工作流编排，以及API发布与生产监控。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/dify-self-hosted-rag-practice/featured.jpg"/></item><item><title>Triton Inference Server 生产部署：模型编排、动态批处理与多框架混部</title><link>https://socake.github.io/posts/triton-inference-server-production/</link><pubDate>Wed, 11 Mar 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/triton-inference-server-production/</guid><description>把 Triton 从一个陌生的 NVIDIA 推理服务器讲清楚：model repository、backend、动态批处理、ensemble、BLS、Python backend、生产监控和踩坑实录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/triton-inference-server-production/featured.jpg"/></item><item><title>多模态大模型实践：图像理解与视觉分析</title><link>https://socake.github.io/posts/multimodal-llm-vision-practice/</link><pubDate>Mon, 09 Mar 2026 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/multimodal-llm-vision-practice/</guid><description>覆盖主流多模态模型选型对比、图像理解API调用方式、OCR/文档理解/图表解析等实际场景，以及一个完整的运维场景实战：用多模态模型自动分析Grafana截图并生成告警摘要。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/multimodal-llm-vision-practice/featured.jpg"/></item><item><title>Prompt Engineering 完全指南：从入门到工程化</title><link>https://socake.github.io/posts/prompt-engineering-guide/</link><pubDate>Mon, 09 Mar 2026 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prompt-engineering-guide/</guid><description>Prompt Engineering 不是玄学，而是有规律可循的工程实践。从基础技巧到企业级工程化，本文覆盖提示词设计的完整方法论，包括 A/B 测试、版本管理、失效模式分析，以及在生产系统中管理提示词的最佳实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prompt-engineering-guide/featured.jpg"/></item><item><title>TensorRT-LLM 推理加速实战：从 engine 编译到 kernel 调优</title><link>https://socake.github.io/posts/tensorrt-llm-inference/</link><pubDate>Sat, 07 Mar 2026 14:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tensorrt-llm-inference/</guid><description>TensorRT-LLM 是 NVIDIA 端到端推理栈的关键一环，这篇把 engine 编译流程、plugin 机制、量化策略、inflight batching、kernel 调优和生产踩坑都梳理清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tensorrt-llm-inference/featured.jpg"/></item><item><title>OpenAI API 工程化实践：从 Hello World 到生产</title><link>https://socake.github.io/posts/openai-api-engineering/</link><pubDate>Tue, 03 Mar 2026 11:41:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/openai-api-engineering/</guid><description>OpenAI API 是大多数 LLM 应用开发者的起点，但从 Hello World 到真正可靠的生产系统，中间有很多工程细节需要处理。本文覆盖 Function Calling、Structured Output、Batch API、Embeddings 的完整实践，以及速率限制、错误处理和成本控制的系统方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/openai-api-engineering/featured.jpg"/></item><item><title>vLLM 多机多卡分布式推理：Tensor Parallel 调优与踩坑实录</title><link>https://socake.github.io/posts/vllm-multi-node-distributed/</link><pubDate>Tue, 03 Mar 2026 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vllm-multi-node-distributed/</guid><description>从单机 8 卡讲到多机多卡，把 vLLM 的 TP/PP 拆分、Ray 启动方式、NCCL 调优、PagedAttention 显存核算和常见翻车场景串成一条完整的落地路径。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vllm-multi-node-distributed/featured.jpg"/></item><item><title>MCP 协议实战：给 AI Agent 接上运维工具</title><link>https://socake.github.io/posts/mcp-protocol-devops/</link><pubDate>Fri, 27 Feb 2026 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mcp-protocol-devops/</guid><description>Model Context Protocol 让 AI 能够标准化地调用外部工具。本文用 Python 实现一个运维 MCP Server，接入 kubectl、Prometheus、Loki，让 AI 直接查集群状态。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mcp-protocol-devops/featured.jpg"/></item><item><title>Claude Code CLI 使用指南：AI 驱动的终端编程助手</title><link>https://socake.github.io/posts/claude-code-cli-guide/</link><pubDate>Thu, 26 Feb 2026 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/claude-code-cli-guide/</guid><description>Claude Code是Anthropic推出的终端AI编程助手，不同于编辑器插件，它在终端里直接操作文件、执行命令、理解整个代码库。本文覆盖安装配置、核心交互模式、CLAUDE.md自定义、K8s排障和自动化脚本场景。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/claude-code-cli-guide/featured.jpg"/></item><item><title>自动化发版实战：semantic-release、release-please、changesets 对比选型</title><link>https://socake.github.io/posts/release-automation-changelog/</link><pubDate>Wed, 25 Feb 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/release-automation-changelog/</guid><description>手动维护 CHANGELOG.md、手动打 git tag、手动写 release notes——这些都是十年前的工作方式。现代发版应该是：每次合并 PR 时工具自动决定下一个版本号、自动生成 changelog、自动打 tag、自动发布。本文讲清楚三种方案的差异和选型。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/release-automation-changelog/featured.jpg"/></item><item><title>Claude API 开发完全指南：从调用到生产应用</title><link>https://socake.github.io/posts/claude-api-development-guide/</link><pubDate>Tue, 24 Feb 2026 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/claude-api-development-guide/</guid><description>Claude API 的设计哲学和 OpenAI 有些不同，但一旦理解其模式，就会发现它在长文本、代码生成和工具调用上非常可靠。本文覆盖从 SDK 配置到 Prompt Caching、Tool Use、Vision 的完整开发实践，以及生产中的错误处理与成本控制策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/claude-api-development-guide/featured.jpg"/></item><item><title>Embedding 模型选型与优化实战：从 BGE 到 OpenAI Embedding</title><link>https://socake.github.io/posts/embedding-model-selection-guide/</link><pubDate>Sat, 21 Feb 2026 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/embedding-model-selection-guide/</guid><description>系统对比 2026 年主流 Embedding 模型，从原理到工程实践，覆盖选型决策、缓存设计和批量优化</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/embedding-model-selection-guide/featured.jpg"/></item><item><title>Renovate 依赖升级机器人：从零到生产配置</title><link>https://socake.github.io/posts/renovate-bot-dependency-upgrade/</link><pubDate>Thu, 19 Feb 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/renovate-bot-dependency-upgrade/</guid><description>Dependabot 足够简单但能力单薄，Snyk 聚焦安全漏洞。Renovate 是介于两者之间的中庸选择：能升级一切、能分组、能调度、能自动合并、能 self-host。本文是完整的生产配置指南。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/renovate-bot-dependency-upgrade/featured.jpg"/></item><item><title>LangGraph 工作流编排：构建有状态的 AI 应用</title><link>https://socake.github.io/posts/langgraph-workflow-orchestration/</link><pubDate>Sun, 15 Feb 2026 12:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/langgraph-workflow-orchestration/</guid><description>从LangChain Chain的局限出发，讲清楚LangGraph的状态机模型、Graph/Node/Edge的设计方式，以及条件分支、循环、人工介入、Checkpoint持久化的工程实现，最后用一个运维诊断工作流串起来所有概念。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/langgraph-workflow-orchestration/featured.jpg"/></item><item><title>Langfuse：LLM 应用可观测性平台实战</title><link>https://socake.github.io/posts/langfuse-llm-observability/</link><pubDate>Sat, 14 Feb 2026 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/langfuse-llm-observability/</guid><description>讲清楚为什么LLM应用必须要可观测性，以及如何用Langfuse从链路追踪、Prompt版本管理、评估实验到成本分析做到全覆盖，包含Docker自托管部署和Python SDK完整集成示例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/langfuse-llm-observability/featured.jpg"/></item><item><title>Terragrunt 规模化 Terraform 工程化：从 DRY 到 Stacks</title><link>https://socake.github.io/posts/terragrunt-terraform-at-scale/</link><pubDate>Sat, 14 Feb 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/terragrunt-terraform-at-scale/</guid><description>Terraform 写到 10 个 state 以上就开始痛苦：重复的 provider 配置、散落的变量、无法跨 state 引用、run-all 时的依赖混乱。Terragrunt 是 Terraform 的 wrapper，解决的就是&amp;rsquo;大规模&amp;rsquo;这个字——本文讲清楚它怎么用。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/terragrunt-terraform-at-scale/featured.jpg"/></item><item><title>LangChain 从入门到实战：构建 LLM 应用的工程框架</title><link>https://socake.github.io/posts/langchain-practical-guide/</link><pubDate>Mon, 09 Feb 2026 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/langchain-practical-guide/</guid><description>LangChain 是构建 LLM 应用最流行的框架，但也是踩坑最多的框架之一。本文从 LCEL 表达式、ReAct Agent、LangGraph 工作流到生产部署，梳理真正有用的部分，并指出哪些功能实际工程中应该避免。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/langchain-practical-guide/featured.jpg"/></item><item><title>Pulumi vs Terraform vs OpenTofu：2026 年 IaC 选型深度对比</title><link>https://socake.github.io/posts/pulumi-vs-terraform/</link><pubDate>Mon, 09 Feb 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/pulumi-vs-terraform/</guid><description>2023 年之后 IaC 世界变了：HashiCorp 把 Terraform 改成 BSL，Linux Foundation 接管了 OpenTofu。Pulumi 依然在代码式 IaC 的路上坚持。团队选型时面对的不是 Terraform 一家独大，而是三条技术路线的真实对比。本文试图给出一个不偏不倚的答案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/pulumi-vs-terraform/featured.jpg"/></item><item><title>RAG 评估体系：RAGAS 指标与幻觉检测实践</title><link>https://socake.github.io/posts/rag-evaluation-ragas/</link><pubDate>Thu, 05 Feb 2026 10:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/rag-evaluation-ragas/</guid><description>RAG 系统上线后，&amp;lsquo;感觉回答质量还不错&amp;rsquo;不是一个可持续的评估方式。RAGAS 提供了一套可量化的评估框架，让你能追踪 Faithfulness、Answer Relevancy 等指标随时间的变化，并在每次改动后自动验证系统质量没有退化。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/rag-evaluation-ragas/featured.jpg"/></item><item><title>Advanced RAG：超越 Naive RAG 的高级检索增强技术</title><link>https://socake.github.io/posts/advanced-rag-techniques/</link><pubDate>Wed, 04 Feb 2026 11:33:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/advanced-rag-techniques/</guid><description>系统拆解 Naive RAG 的三类失败模式，提供混合检索、HyDE、查询改写、Parent-Child 分块等高级技术的完整实现</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/advanced-rag-techniques/featured.jpg"/></item><item><title>Earthly 在 Monorepo 的构建统一：Earthfile + Satellites 实战</title><link>https://socake.github.io/posts/earthly-buildfile-monorepo/</link><pubDate>Tue, 03 Feb 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/earthly-buildfile-monorepo/</guid><description>Bazel 复杂度太高，Makefile 表达力不够，Dockerfile 只能构建一个镜像——Earthly 填的就是这个缝：像 Dockerfile 一样熟悉，像 Makefile 一样组合，像 Bazel 一样可并发、可缓存、可复用。本文讲清楚它在 Monorepo 里的真实位置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/earthly-buildfile-monorepo/featured.jpg"/></item><item><title>大模型赋能运维：LLM 在故障排查和自动化中的实际应用</title><link>https://socake.github.io/posts/aiops-llm-devops/</link><pubDate>Sat, 31 Jan 2026 12:06:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/aiops-llm-devops/</guid><description>LLM 不能替代运维工程师，但确实能把重复性、低价值的工作自动化掉。本文分享我在实际工作中用 Claude 落地的几个场景。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/aiops-llm-devops/featured.jpg"/></item><item><title>AI Agent 设计模式：从单步到复杂工作流</title><link>https://socake.github.io/posts/ai-agent-design-patterns/</link><pubDate>Thu, 29 Jan 2026 09:17:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ai-agent-design-patterns/</guid><description>Agent不是更智能的ChatGPT调用，它是一个能自主规划和执行多步骤任务的循环系统。本文拆解ReAct推理循环、Tool调用设计原则、Multi-Agent协作模式、Human-in-the-loop设计，以及告警分析Agent和巡检Agent的实战实现。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ai-agent-design-patterns/featured.jpg"/></item><item><title>Nix + devcontainer：彻底终结 works on my machine</title><link>https://socake.github.io/posts/nix-devcontainer-reproducible-env/</link><pubDate>Wed, 28 Jan 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/nix-devcontainer-reproducible-env/</guid><description>新同事入职第一天配环境要花一天，CI 和本地构建结果不一致，升级 Node 16 到 20 引发连锁故障——这些痛都源于&amp;rsquo;环境不是代码&amp;rsquo;。Nix 把工具链当成代码版本化，和 direnv/devcontainer 配合能做到 &amp;lsquo;git clone 后 10 秒进入完整可用环境&amp;rsquo;。本文是完整落地教程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/nix-devcontainer-reproducible-env/featured.jpg"/></item><item><title>LLM 应用安全：Prompt Injection 防御与 AI Guardrails 实战</title><link>https://socake.github.io/posts/llm-security-guardrails/</link><pubDate>Fri, 23 Jan 2026 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-security-guardrails/</guid><description>我们的 AI 客服系统曾被一个用户用一句话绕过所有限制，让它泄露了内部知识库的敏感信息。这篇文章系统梳理 LLM 应用的安全威胁模型，以及我们在生产系统中实施的防御层次。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-security-guardrails/featured.jpg"/></item><item><title>Dagger 实战：用代码而不是 YAML 编写 CI/CD</title><link>https://socake.github.io/posts/dagger-programmable-cicd/</link><pubDate>Wed, 21 Jan 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/dagger-programmable-cicd/</guid><description>每次迁移 CI 平台（Jenkins → GitLab → GitHub Actions → Tekton），业务流水线都要重写一遍。Dagger 的思路是：把流水线写成可移植的代码（Go/Python/TS），底层引擎负责执行和缓存，CI 平台只是调用方。本文讲清楚它怎么工作、什么时候值得引入。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/dagger-programmable-cicd/featured.jpg"/></item><item><title>LLM 成本优化实战：从 Token 预算到模型路由</title><link>https://socake.github.io/posts/llm-cost-optimization/</link><pubDate>Mon, 19 Jan 2026 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-cost-optimization/</guid><description>我们的 AI 功能上线第一个月，LLM API 账单是 $18,000。通过模型路由、Prompt Caching 和 Batch API，第三个月降到了 $3,200。这篇文章记录具体怎么做到的。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-cost-optimization/featured.jpg"/></item><item><title>LLM Tool Use 完全指南：Function Calling 设计模式与生产实践</title><link>https://socake.github.io/posts/llm-tool-use-function-calling/</link><pubDate>Sun, 18 Jan 2026 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-tool-use-function-calling/</guid><description>从工程视角深入 LLM Tool Use：覆盖 OpenAI 与 Claude API 差异、工具 Schema 设计、并发调用、错误恢复，附完整运维助手代码示例</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-tool-use-function-calling/featured.jpg"/></item><item><title>Tekton Pipelines 企业级落地：从 Task 抽象到供应链签名</title><link>https://socake.github.io/posts/tekton-pipelines-production/</link><pubDate>Thu, 15 Jan 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tekton-pipelines-production/</guid><description>Jenkins 扛不动 K8s Native 的调度压力，GitLab Runner 又太 monolithic。Tekton 把 &amp;lsquo;CI job&amp;rsquo; 拆成 Task + Pipeline + PipelineRun 三层 CRD，所有执行都是 Pod，天然贴合 K8s。本文讲清楚它在企业里该怎么用——以及怎么避免把它用成 YAML 地狱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tekton-pipelines-production/featured.jpg"/></item><item><title>LLM 微调入门：LoRA 让大模型适配私有场景</title><link>https://socake.github.io/posts/llm-finetuning-lora-practice/</link><pubDate>Wed, 14 Jan 2026 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-finetuning-lora-practice/</guid><description>什么时候该微调、什么时候该用提示工程？本文给出决策框架，然后用Unsloth+QLoRA实战微调Qwen2.5-7B，覆盖数据格式、训练监控、权重合并、部署到vLLM测试，以及10个真实踩坑记录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-finetuning-lora-practice/featured.jpg"/></item><item><title>LLM 生产服务化：vLLM 部署与 GPU 推理优化实战</title><link>https://socake.github.io/posts/llm-production-serving-vllm/</link><pubDate>Tue, 13 Jan 2026 13:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-production-serving-vllm/</guid><description>团队把 Ollama 搬上生产后，高峰期请求排队超过 30 秒，用户纷纷反映 AI 功能不可用。这篇文章记录我们迁移到 vLLM 的全过程，包括 PagedAttention、Continuous Batching 原理，以及 Kubernetes GPU 部署的完整配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-production-serving-vllm/featured.jpg"/></item><item><title>2026 大模型全景：主力模型横评与选型指南</title><link>https://socake.github.io/posts/llm-landscape-2025/</link><pubDate>Fri, 09 Jan 2026 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-landscape-2025/</guid><description>GPT-5.4、Claude Opus 4.6、Gemini 2.5 Pro、Llama 4 Scout、DeepSeek V3.2——2026年4月的大模型格局已经和一年前完全不同。本文从工程师视角梳理当前主力模型的真实规格与适用边界，给出场景化选型矩阵，并讨论开源追平闭源、推理模型标配化、agent workload 崛起这三个2026年的核心判断。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-landscape-2025/featured.jpg"/></item><item><title>ko 实战：无 Dockerfile 构建 Go 容器镜像的正确姿势</title><link>https://socake.github.io/posts/ko-go-image-build/</link><pubDate>Fri, 09 Jan 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ko-go-image-build/</guid><description>同样是构建 Go 镜像，用 Dockerfile + BuildKit 要 2-3 分钟，用 ko 只需要 5-20 秒。差距来自 ko 不走 daemon、不写 tar、直接把 Go 编译产物塞进 OCI manifest。本文讲清楚这套 &amp;lsquo;Dockerfile-less&amp;rsquo; 构建到底怎么落地到生产，以及什么时候不该用它。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ko-go-image-build/featured.jpg"/></item><item><title>BuildKit 缓存生产实战：从多阶段到远端 Registry Cache</title><link>https://socake.github.io/posts/buildkit-cache-production/</link><pubDate>Sat, 03 Jan 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/buildkit-cache-production/</guid><description>BuildKit 的缓存体系看似简单一行 &amp;ndash;cache-to，实际生产里坑极多：mode=max 在多架构下的 manifest 行为、registry 后端每层 0.3s 的验证开销、cache mount 在 &amp;ndash;cache-to=registry 下不被导出的限制、GHA 后端 10GB 上限……本文基于真实 CI 流水线的调优记录，给出一套可复制的生产配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/buildkit-cache-production/featured.jpg"/></item><item><title>基于 Error Budget 的 Prometheus 告警设计——燃烧率告警实战</title><link>https://socake.github.io/posts/prometheus-error-budget-alerting/</link><pubDate>Thu, 25 Dec 2025 10:40:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-error-budget-alerting/</guid><description>错误率告警有一个致命问题：它不告诉你问题有多紧急。1% 的错误率，持续 2 小时和持续 10 分钟，对 SLO 的威胁完全不同。燃烧率告警从 Error Budget 消耗速度出发，让每一次告警都携带&amp;quot;紧急程度&amp;quot;信息。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-error-budget-alerting/featured.jpg"/></item><item><title>告警带图实战：Grafana Render + 钉钉推送趋势图</title><link>https://socake.github.io/posts/prometheus-alert-with-image/</link><pubDate>Tue, 23 Dec 2025 09:54:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-alert-with-image/</guid><description>收到告警只有一行数字，还要登录 Grafana 才能看趋势图——这是告警体验最大的痛点之一。本文介绍如何将 Grafana Image Renderer 与 Alertmanager Webhook 结合，实现告警消息自动附带趋势图的完整方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-alert-with-image/featured.jpg"/></item><item><title>Prometheus 进程监控：process-exporter 实战与告警配置</title><link>https://socake.github.io/posts/prometheus-process-monitoring/</link><pubDate>Thu, 18 Dec 2025 11:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-process-monitoring/</guid><description>K8s 有完善的 Pod 监控体系，但裸机和 VM 上运行的进程如何监控？本文介绍 process-exporter 的部署与配置实践，覆盖进程组匹配、核心指标、告警规则设计及实际踩坑经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-process-monitoring/featured.jpg"/></item><item><title>Kibana 实战：从日志查询到 Dashboard 可视化的完整指南</title><link>https://socake.github.io/posts/kibana-visualization-guide/</link><pubDate>Sat, 13 Dec 2025 09:08:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kibana-visualization-guide/</guid><description>Kibana 是我们 ELK 体系里使用频率最高的工具。这篇文章把我在实际运维中积累的 Kibana 使用技巧整理成体系，从 Discover 查询到 Dashboard 制作，再到 ILM 管理。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kibana-visualization-guide/featured.jpg"/></item><item><title>高级运维/DevOps 工程师面试题精选：系统设计与深度考察</title><link>https://socake.github.io/posts/devops-senior-interview/</link><pubDate>Thu, 11 Dec 2025 12:51:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devops-senior-interview/</guid><description>高级运维面试考什么？本文整理 5 道系统设计题和 10 道深度技术题，每题给出答题框架。从监控体系设计到 K8s 调度器原理，从生产事故复盘到新技术引入决策，帮你建立完整的回答思路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devops-senior-interview/featured.jpg"/></item><item><title>DevOps/运维工程师面试题精选：K8s、Linux、网络高频考点</title><link>https://socake.github.io/posts/devops-interview-questions/</link><pubDate>Sun, 07 Dec 2025 13:07:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devops-interview-questions/</guid><description>基于真实面试经验整理的运维/DevOps 面试题，覆盖 K8s 调度、故障排查、Linux 内核、网络协议等方向，附「面试官真正想考的点」，帮你把答案说到位。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devops-interview-questions/featured.jpg"/></item><item><title>SLSA 软件供应链等级实施：从 L1 到 L3 的工程化路径</title><link>https://socake.github.io/posts/supply-chain-slsa-framework/</link><pubDate>Fri, 05 Dec 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/supply-chain-slsa-framework/</guid><description>一份 SLSA v1.0 框架的实战落地笔记：讲清楚 Build Track 从 L1 到 L3 的具体要求、用 GitHub Actions 官方 generator 和 Tekton Chains 生成 provenance、用 slsa-verifier 和 Kyverno 做验证、以及和前面 Sigstore/Kyverno/Cosign 的整合。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/supply-chain-slsa-framework/featured.jpg"/></item><item><title>阿里云 SDK 运维自动化：ECS/ACK/RDS 资源管理与巡检脚本</title><link>https://socake.github.io/posts/aliyun-sdk-ops/</link><pubDate>Thu, 04 Dec 2025 12:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/aliyun-sdk-ops/</guid><description>用阿里云 Python SDK 实现 ECS 实例查询与监控、ACK 节点状态检查、RDS 慢查询巡检，整合成 HTML 格式巡检报告自动推送钉钉。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/aliyun-sdk-ops/featured.jpg"/></item><item><title>Kubernetes Operator 开发实战：Go + controller-runtime 完全指南</title><link>https://socake.github.io/posts/kubernetes-operator-development/</link><pubDate>Wed, 03 Dec 2025 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-operator-development/</guid><description>用 Go + controller-runtime 开发生产级 Kubernetes Operator 的完整实战指南。以 DatabaseCluster Operator 为例，深入讲解 CRD 设计、Reconcile 模式、Status Conditions、Finalizer 防孤儿资源、Leader Election、指标暴露、Webhook 验证，以及 envtest + Kind 测试策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-operator-development/featured.jpg"/></item><item><title>Kubernetes 多租户方案深度对比：vCluster vs Capsule vs HNC</title><link>https://socake.github.io/posts/kubernetes-multitenancy-deep-dive/</link><pubDate>Wed, 03 Dec 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-multitenancy-deep-dive/</guid><description>Namespace 级隔离远不够用。本文深入剖析 vCluster、Capsule、HNC 三种主流多租户方案的架构差异，给出完整的部署配置示例、隔离能力横向对比，以及 SaaS 平台、内部平台、开发环境三种场景下的选型建议。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-multitenancy-deep-dive/featured.jpg"/></item><item><title>基础设施即代码：Terraform 入门与实践</title><link>https://socake.github.io/posts/%E5%9F%BA%E7%A1%80%E8%AE%BE%E6%96%BD%E5%8D%B3%E4%BB%A3%E7%A0%81/</link><pubDate>Sun, 30 Nov 2025 09:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E5%9F%BA%E7%A1%80%E8%AE%BE%E6%96%BD%E5%8D%B3%E4%BB%A3%E7%A0%81/</guid><description>从 IaC 解决的本质问题出发，系统介绍 Terraform 的核心概念和工作流，重点覆盖 State 管理、模块化最佳实践，以及常见陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E5%9F%BA%E7%A1%80%E8%AE%BE%E6%96%BD%E5%8D%B3%E4%BB%A3%E7%A0%81/featured.jpg"/></item><item><title>Kyverno 策略即代码实战：从准入到变异到生成的全场景落地</title><link>https://socake.github.io/posts/kyverno-policy-as-code/</link><pubDate>Fri, 28 Nov 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kyverno-policy-as-code/</guid><description>一份基于 Kyverno 1.12+ 的生产落地笔记：覆盖 validate/mutate/generate/verifyImages 四种策略类型的实战用法、CEL 和 JMESPath 表达式语法、策略分层治理、PolicyException、性能调优和常见踩坑，并与 OPA Gatekeeper 做对比。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kyverno-policy-as-code/featured.jpg"/></item><item><title>零信任网络改造：从公网暴露到 Headscale VPN</title><link>https://socake.github.io/posts/%E9%9B%B6%E4%BF%A1%E4%BB%BB%E7%BD%91%E7%BB%9C%E5%AE%9E%E8%B7%B5/</link><pubDate>Sat, 22 Nov 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E9%9B%B6%E4%BF%A1%E4%BB%BB%E7%BD%91%E7%BB%9C%E5%AE%9E%E8%B7%B5/</guid><description>从发现公网暴露的安全隐患开始，到用 Headscale 自建零信任网络，替代跳板机体系，实现 kubectl 和运维系统的 VPN 接入。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E9%9B%B6%E4%BF%A1%E4%BB%BB%E7%BD%91%E7%BB%9C%E5%AE%9E%E8%B7%B5/featured.jpg"/></item><item><title>Pod Security Standards 生产落地：从 PSP 到 PSA 的迁移实战</title><link>https://socake.github.io/posts/kubernetes-pod-security-standards/</link><pubDate>Fri, 21 Nov 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-pod-security-standards/</guid><description>一份从 PSP 迁移到 Pod Security Standards 的实战笔记：对比 Baseline 与 Restricted 两套 profile 的实际约束、Pod Security Admission 的三种 mode、如何一次性迁移 200+ 命名空间、和 Kyverno/OPA 互补使用的最佳实践，以及遗留业务 securityContext 改造的典型模式。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-pod-security-standards/featured.jpg"/></item><item><title>如何设计一个好的告警体系</title><link>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</link><pubDate>Tue, 18 Nov 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</guid><description>从真实的告警噪音泛滥经历出发，分享如何用 SLI/SLO 重新设计告警体系，包括告警分级、规则设计原则、路由策略和复盘机制。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/featured.jpg"/></item><item><title>大模型核心概念：工程师需要理解的 LLM 基础</title><link>https://socake.github.io/posts/llm-core-concepts/</link><pubDate>Mon, 17 Nov 2025 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/llm-core-concepts/</guid><description>同事第一次用 GPT-4 API 写代码时问我：为什么我发了一段中文，token 消耗比英文多那么多？为什么模型有时候会一本正经地胡说八道？这篇文章把我认为工程师必须理解的 LLM 概念系统整理了一遍，不涉及 Transformer 数学，只讲对你写代码有帮助的部分。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/llm-core-concepts/featured.jpg"/></item><item><title>密钥自动轮换实战：Vault、AWS Secrets Manager 与 SOPS 的工程化方案</title><link>https://socake.github.io/posts/secret-rotation-automation/</link><pubDate>Fri, 14 Nov 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/secret-rotation-automation/</guid><description>一份来自生产环境的密钥轮换实战笔记：对比 Vault dynamic secret、AWS Secrets Manager 原生 rotation、SOPS + GitOps 三种方案的适用场景，给出数据库、Kafka SASL、TLS 证书、API key 的完整轮换工作流，并分享 ESO 同步、rotation 风暴、灰度发布等真实踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/secret-rotation-automation/featured.jpg"/></item><item><title>RAG 系统设计与实战：检索增强生成完全指南</title><link>https://socake.github.io/posts/rag-system-design-practice/</link><pubDate>Tue, 11 Nov 2025 11:41:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/rag-system-design-practice/</guid><description>RAG（检索增强生成）是目前企业落地 LLM 最主流的方式。本文覆盖 RAG 系统的完整设计：文档处理管线、分块策略、向量检索与关键词混合检索、Rerank 重排序、上下文压缩，以及用 RAGAS 框架评估 RAG 质量，最后分享生产环境踩坑记录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/rag-system-design-practice/featured.jpg"/></item><item><title>WebAssembly 在云原生中的应用：从浏览器到 K8s 数据面</title><link>https://socake.github.io/posts/webassembly-cloud-native/</link><pubDate>Sat, 08 Nov 2025 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/webassembly-cloud-native/</guid><description>WebAssembly 在云原生领域的热度持续上涨，但很多讨论都停留在概念层面。这篇文章试图给出一个务实的视角：Wasm 在哪些云原生场景已经可以生产落地，在哪些场景还需要等待，以及和容器相比的真实差异。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/webassembly-cloud-native/featured.jpg"/></item><item><title>Istio Ambient Mode 无 Sidecar 服务网格实践</title><link>https://socake.github.io/posts/istio-ambient-mesh-practice/</link><pubDate>Sat, 08 Nov 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/istio-ambient-mesh-practice/</guid><description>Sidecar 模式已经陪我们走了六七年，但它的问题也越来越难以忽视。Ambient Mode 不是缝缝补补，而是从架构层面重新设计了服务网格的数据面。本文从实际运维视角深入拆解 ztunnel + Waypoint 两层架构，并给出从 Sidecar 迁移到 Ambient 的完整路径。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/istio-ambient-mesh-practice/featured.jpg"/></item><item><title>用 WireGuard 构建多云 mesh VPN：从点对点到全网互联</title><link>https://socake.github.io/posts/wireguard-mesh-vpn/</link><pubDate>Fri, 07 Nov 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/wireguard-mesh-vpn/</guid><description>一份从实战出发的 WireGuard mesh VPN 笔记：讲清楚为什么不用 IPSec/OpenVPN、手写配置 vs Netmaker vs Tailscale 的选型对比、AWS 与阿里云跨云 mesh 的真实部署方案、MTU 与 NAT 穿透的踩坑，以及自动化密钥分发与监控方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/wireguard-mesh-vpn/featured.jpg"/></item><item><title>Milvus 向量数据库实战：从部署到生产应用</title><link>https://socake.github.io/posts/milvus-vector-database-practice/</link><pubDate>Thu, 06 Nov 2025 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/milvus-vector-database-practice/</guid><description>覆盖向量数据库选型对比（Milvus/Qdrant/Weaviate/pgvector）、Milvus Standalone与Cluster部署、Collection Schema设计、HNSW/IVF_FLAT索引调优、混合搜索实战，以及生产环境常见问题处理。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/milvus-vector-database-practice/featured.jpg"/></item><item><title>Kubernetes GPU 调度实战：AI 训练与推理基础设施</title><link>https://socake.github.io/posts/kubernetes-gpu-scheduling/</link><pubDate>Wed, 05 Nov 2025 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-gpu-scheduling/</guid><description>GPU 是 AI 基础设施的核心资源，如何在 Kubernetes 上高效调度和管理 GPU 直接影响训练效率和推理成本。本文从底层驱动安装到上层调度策略，完整覆盖 K8s GPU 基础设施的搭建、监控和优化实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-gpu-scheduling/featured.jpg"/></item><item><title>Python 操作 Elasticsearch：从索引管理到复杂聚合查询</title><link>https://socake.github.io/posts/python-elasticsearch-client/</link><pubDate>Tue, 04 Nov 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/python-elasticsearch-client/</guid><description>从客户端初始化到批量操作、scroll 查询、聚合统计，一篇文章搞定 Python 操作 Elasticsearch 的高频场景。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/python-elasticsearch-client/featured.jpg"/></item><item><title>Python 定时任务工程化：APScheduler 与 Celery Beat 实战对比</title><link>https://socake.github.io/posts/python-scheduled-tasks/</link><pubDate>Sat, 01 Nov 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/python-scheduled-tasks/</guid><description>APScheduler 和 Celery Beat 是 Python 定时任务的两大主流方案。本文从使用场景出发，对比两者的架构差异、适用边界，并介绍 K8s CronJob 作为第三条路的价值，帮你在项目里选对工具。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/python-scheduled-tasks/featured.jpg"/></item><item><title>Cilium NetworkPolicy 与 L7 过滤生产落地实战</title><link>https://socake.github.io/posts/cilium-network-policy-production/</link><pubDate>Fri, 31 Oct 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/cilium-network-policy-production/</guid><description>一份基于 Cilium 1.16+ 的生产落地笔记：讲清楚 Kubernetes NetworkPolicy 的局限、CiliumNetworkPolicy 的扩展能力、L7 HTTP/Kafka/DNS 过滤的真实用法、Hubble 可观测性、策略开发方法论，以及多集群 ClusterMesh 场景下的策略治理。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/cilium-network-policy-production/featured.jpg"/></item><item><title>CoreDNS 深度排障：K8s DNS 问题完全指南</title><link>https://socake.github.io/posts/coredns-troubleshooting-guide/</link><pubDate>Wed, 29 Oct 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/coredns-troubleshooting-guide/</guid><description>DNS 问题是 K8s 中最难定位的问题之一，因为它的失败往往是间歇性的、有延迟的，看起来像网络问题，实际上是 DNS 超时。本文记录了我在生产环境排查过的多类 DNS 故障，附详细的抓包分析和调优配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/coredns-troubleshooting-guide/featured.jpg"/></item><item><title>SBOM 生成与 Dependency-Track 漏洞管理实战</title><link>https://socake.github.io/posts/sbom-dependency-track/</link><pubDate>Fri, 24 Oct 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sbom-dependency-track/</guid><description>一份基于生产环境的 SBOM 实战指南：讲清楚 CycloneDX 与 SPDX 的格式差异、Syft/cdxgen/Trivy 三款主流生成器的对比，部署 Dependency-Track 4.12 做持续漏洞监测，通过策略违规自动化处置 CVE，并分享 SBOM 消费链路上的真实踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sbom-dependency-track/featured.jpg"/></item><item><title>k6 压测实战：从脚本编写到性能分析</title><link>https://socake.github.io/posts/k6-load-testing-practice/</link><pubDate>Tue, 21 Oct 2025 12:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/k6-load-testing-practice/</guid><description>压测不是跑一个脚本看能不能撑住，而是通过有设计的负载模型暴露系统瓶颈。本文记录了我用 k6 做生产级性能测试的完整实践：脚本设计、阈值配置、与 Grafana 集成，以及几个典型性能问题的定位过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/k6-load-testing-practice/featured.jpg"/></item><item><title>TCP/IP 网络排障：抓包与连接问题诊断</title><link>https://socake.github.io/posts/tcp-network-troubleshooting/</link><pubDate>Tue, 21 Oct 2025 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tcp-network-troubleshooting/</guid><description>网络问题排查的核心是「眼见为实」，没有抓包的排障都是猜测。本文系统梳理了 tcpdump 的实战用法、TCP 连接状态机分析、conntrack 追踪，以及 Kubernetes 中 NodePort/LoadBalancer 的典型网络故障定位方法。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tcp-network-troubleshooting/featured.jpg"/></item><item><title>Sigstore/Cosign 镜像签名实战：从 keyless 签名到准入策略验证</title><link>https://socake.github.io/posts/sigstore-cosign-signing-workflow/</link><pubDate>Fri, 17 Oct 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sigstore-cosign-signing-workflow/</guid><description>一份 Sigstore 生产化落地笔记：讲清楚 Fulcio/Rekor/Cosign 三件套的工作原理，演示 GitHub Actions 和 GitLab CI 下的 keyless 签名流水线，对接 Kyverno/Policy Controller 做准入验证，并分享签名验证性能、Rekor 不可用降级、多签策略等真实运维经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sigstore-cosign-signing-workflow/featured.jpg"/></item><item><title>Vector 日志处理管道：高性能日志采集与转换实践</title><link>https://socake.github.io/posts/vector-log-pipeline/</link><pubDate>Tue, 14 Oct 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vector-log-pipeline/</guid><description>从架构对比到 K8s DaemonSet 落地，结合 VRL 实战示例和踩坑经验，讲透 Vector 在日志采集管道中的应用。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vector-log-pipeline/featured.jpg"/></item><item><title>Filebeat + Logstash 日志采集管道：大规模日志处理实战</title><link>https://socake.github.io/posts/filebeat-logstash-pipeline/</link><pubDate>Fri, 10 Oct 2025 10:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/filebeat-logstash-pipeline/</guid><description>大流量日志场景下，Fleet 直写 ES 会出现严重写入堆积。本文记录了我们从 Fleet 切换到 Filebeat + Kafka + Logstash 管道的全过程，重点讲 Logstash pipeline 配置和性能调优。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/filebeat-logstash-pipeline/featured.jpg"/></item><item><title>SPIFFE/SPIRE 工作负载身份实战：零信任网络的身份基石</title><link>https://socake.github.io/posts/spiffe-spire-workload-identity/</link><pubDate>Fri, 10 Oct 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/spiffe-spire-workload-identity/</guid><description>一份从生产部署出发的 SPIFFE/SPIRE 实战笔记：讲清楚 SVID、节点证明、工作负载证明、信任域联邦这些核心概念，用 Kubernetes + Istio + 非 K8s 工作负载的混合场景展示 SPIRE 如何统一身份，并分享升级、备份、Agent 崩溃等真实运维踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/spiffe-spire-workload-identity/featured.jpg"/></item><item><title>ELK 集群监控：用 Prometheus + Grafana 监控 Elasticsearch 健康</title><link>https://socake.github.io/posts/elk-prometheus-monitoring/</link><pubDate>Wed, 08 Oct 2025 11:33:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elk-prometheus-monitoring/</guid><description>Kibana 内置的 Stack Monitoring 免费功能有限，告警媒介也受商业授权约束。我们最终选择 Prometheus + Grafana 方案监控 ELK 集群，这篇文章记录完整的落地过程和踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elk-prometheus-monitoring/featured.jpg"/></item><item><title>Elasticsearch 备份与恢复：快照管理与跨集群迁移实践</title><link>https://socake.github.io/posts/elasticsearch-backup-restore/</link><pubDate>Fri, 03 Oct 2025 12:06:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elasticsearch-backup-restore/</guid><description>Snapshot API 配置、S3 IRSA 认证、定时快照脚本，以及跨集群迁移三种方案的对比与实战踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elasticsearch-backup-restore/featured.jpg"/></item><item><title>Falco 运行时安全实战：从规则开发到生产级调优</title><link>https://socake.github.io/posts/falco-runtime-security-deep/</link><pubDate>Fri, 03 Oct 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/falco-runtime-security-deep/</guid><description>一份来自生产环境的 Falco 实战笔记：从 eBPF 驱动选型、规则开发方法论、误报治理，到与 Falcosidekick、Loki、SIEM 的告警联动，覆盖 0.40/0.41/0.42 三个版本的关键变更与真实踩坑案例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/falco-runtime-security-deep/featured.jpg"/></item><item><title>Elasticsearch 查询实战：从 URI Search 到 DSL 复杂聚合</title><link>https://socake.github.io/posts/elasticsearch-dsl-query/</link><pubDate>Wed, 01 Oct 2025 09:17:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elasticsearch-dsl-query/</guid><description>ES 查询是每个运维必须掌握的技能。这篇文章从 URI Search 快速上手，到 DSL bool 查询、聚合分析，再到运维常用的 _cat API，配合真实排障场景整理成一篇实战手册。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elasticsearch-dsl-query/featured.jpg"/></item><item><title>Prometheus 高基数治理实战：从 8 亿 series 到可控增长</title><link>https://socake.github.io/posts/metric-cardinality-governance/</link><pubDate>Sun, 28 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/metric-cardinality-governance/</guid><description>高基数是 Prometheus 生态里最常见的性能杀手。这篇把「为什么发生、怎么发现、怎么治理」讲清楚，并给出一套可推广的组织治理方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/metric-cardinality-governance/featured.jpg"/></item><item><title>Elasticsearch 索引策略：ILM 生命周期管理与写入性能优化</title><link>https://socake.github.io/posts/elasticsearch-index-optimization/</link><pubDate>Wed, 24 Sep 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elasticsearch-index-optimization/</guid><description>ILM 四阶段配置、rollover 策略、bulk 写入调优，以及分片数规划和 mapping 爆炸的避坑指南。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elasticsearch-index-optimization/featured.jpg"/></item><item><title>On-Call 轮值管理实战：从告警疲劳到可持续值班</title><link>https://socake.github.io/posts/oncall-rotation-management/</link><pubDate>Wed, 24 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/oncall-rotation-management/</guid><description>On-call 不是福利也不是惩罚，是一份职责。把它做成可持续的工程实践，比任何高级监控工具都重要。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/oncall-rotation-management/featured.jpg"/></item><item><title>Elasticsearch 集群部署实战：ECK 在 K8s 上的生产级配置</title><link>https://socake.github.io/posts/elasticsearch-cluster-deployment/</link><pubDate>Fri, 19 Sep 2025 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elasticsearch-cluster-deployment/</guid><description>从集群角色规划到 ECK Operator 落地，结合生产环境踩坑经验，完整讲解 Elasticsearch 在 Kubernetes 上的生产级部署方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elasticsearch-cluster-deployment/featured.jpg"/></item><item><title>eBPF 可观测性实践：Cilium 网络监控与 Tetragon 安全审计</title><link>https://socake.github.io/posts/ebpf-observability/</link><pubDate>Wed, 17 Sep 2025 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ebpf-observability/</guid><description>eBPF 正在重塑云原生可观测性的底层基础。本文记录在 K8s 集群中落地 Cilium + Hubble 网络监控和 Tetragon 安全审计的实践经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ebpf-observability/featured.jpg"/></item><item><title>混沌工程实战：Chaos Mesh 在 K8s 中注入故障</title><link>https://socake.github.io/posts/chaos-mesh-practice/</link><pubDate>Sat, 13 Sep 2025 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/chaos-mesh-practice/</guid><description>混沌工程不是破坏系统，而是在可控环境中提前暴露脆弱点。本文记录了我用 Chaos Mesh 在生产级 K8s 集群中设计并执行混沌演练的完整过程，包括安装、实验配置、Workflow 编排和游戏日流程设计。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/chaos-mesh-practice/featured.jpg"/></item><item><title>Backstage 开发者门户实战：构建内部开发者平台</title><link>https://socake.github.io/posts/backstage-developer-portal/</link><pubDate>Fri, 12 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/backstage-developer-portal/</guid><description>当团队规模超过 50 人，服务数量超过 100 个，「配置漂移」和「信息孤岛」就成了真实痛点。Backstage 是解决这个问题的平台工程利器。本文从部署到定制，完整拆解如何用 Backstage 构建真正能用起来的内部开发者平台。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/backstage-developer-portal/featured.jpg"/></item><item><title>OPA/Kyverno：K8s 准入控制策略实战</title><link>https://socake.github.io/posts/opa-kyverno-admission-control/</link><pubDate>Thu, 11 Sep 2025 13:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/opa-kyverno-admission-control/</guid><description>没有准入控制的 K8s 集群就像一个没有门卫的机房——任何人都能随意进出。本文记录了我在多个生产集群部署 Kyverno 策略的实战经验，涵盖资源限制强制、镜像来源白名单、标签规范、以及与 OPA Gatekeeper 的对比选型思路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/opa-kyverno-admission-control/featured.jpg"/></item><item><title>故障响应与 Blameless 复盘：让每一次事故都变成组织资产</title><link>https://socake.github.io/posts/incident-response-postmortem/</link><pubDate>Wed, 10 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/incident-response-postmortem/</guid><description>事故响应不是英雄主义，是一套可重复的流程。把流程、模板、文化讲清楚，让每次事故都能沉淀成组织资产。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/incident-response-postmortem/featured.jpg"/></item><item><title>供应链安全：Trivy 镜像扫描 + Cosign 签名验证实践</title><link>https://socake.github.io/posts/trivy-cosign-supply-chain/</link><pubDate>Sat, 06 Sep 2025 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/trivy-cosign-supply-chain/</guid><description>你的镜像安全吗？本文梳理容器供应链的主要攻击面，手把手演示 Trivy 扫描、Cosign 签名、K8s 准入控制三层防护的搭建过程，并给出 GitLab CI 集成示例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/trivy-cosign-supply-chain/featured.jpg"/></item><item><title>混沌工程 GameDay 实战指南：从第一次演练到常态化故障注入</title><link>https://socake.github.io/posts/chaos-engineering-gameday/</link><pubDate>Wed, 27 Aug 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/chaos-engineering-gameday/</guid><description>别把混沌工程理解成随便 kill pod。真正有价值的是一套假设驱动的演练方法论：演练前写下假设，演练中验证，复盘后改进系统和流程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/chaos-engineering-gameday/featured.jpg"/></item><item><title>用 Go 写 K8s 运维工具：client-go 实战</title><link>https://socake.github.io/posts/go-kubernetes-client-tools/</link><pubDate>Mon, 25 Aug 2025 09:08:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/go-kubernetes-client-tools/</guid><description>kubectl 能解决 80% 的日常问题，剩下 20% 需要你自己写工具。本文用实际可运行的 Go 代码，展示如何用 client-go 构建批量重启 Deployment、Pod 资源报告、过期 ConfigMap 清理等运维工具，并用 cobra 封装成 CLI。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/go-kubernetes-client-tools/featured.jpg"/></item><item><title>AWS EKS 生产实践：网络、安全与多集群管理</title><link>https://socake.github.io/posts/aws-eks-best-practices/</link><pubDate>Fri, 22 Aug 2025 12:51:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/aws-eks-best-practices/</guid><description>管理多套 EKS 集群两年下来，踩了不少坑。本文系统整理网络选型、IAM 权限、节点管理、集群升级、安全加固和成本控制这六个核心话题，每个话题都有具体配置示例和实际遇到的问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/aws-eks-best-practices/featured.jpg"/></item><item><title>DevSecOps 安全左移实践：从代码到生产的全链路安全</title><link>https://socake.github.io/posts/devsecops-practice/</link><pubDate>Wed, 20 Aug 2025 10:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devsecops-practice/</guid><description>安全不是最后一道关卡，而是嵌入每个研发环节的连续过程。本文从代码静态分析、依赖漏洞扫描、镜像安全、K8s 运行时防护到供应链签名，逐层拆解 DevSecOps 的完整实施路径，并给出一个可落地的流水线设计。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devsecops-practice/featured.jpg"/></item><item><title>Kubernetes 成本优化实战：系统性降本的四条路径</title><link>https://socake.github.io/posts/k8s-%E6%88%90%E6%9C%AC%E4%BC%98%E5%8C%96%E5%AE%9E%E6%88%98/</link><pubDate>Mon, 18 Aug 2025 13:07:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/k8s-%E6%88%90%E6%9C%AC%E4%BC%98%E5%8C%96%E5%AE%9E%E6%88%98/</guid><description>真实的降本案例：从发现成本异常到分析根因，通过 Karpenter 节点弹性伸缩、资源请求规格治理、大机型收敛等手段，系统性降低 AWS EC2 成本。包含具体配置和执行思路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/k8s-%E6%88%90%E6%9C%AC%E4%BC%98%E5%8C%96%E5%AE%9E%E6%88%98/featured.jpg"/></item><item><title>云原生转型实践：从传统运维到 K8s 的迁移经验</title><link>https://socake.github.io/posts/%E4%BA%91%E5%8E%9F%E7%94%9F%E8%BD%AC%E5%9E%8B%E7%BB%8F%E9%AA%8C/</link><pubDate>Thu, 14 Aug 2025 12:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E4%BA%91%E5%8E%9F%E7%94%9F%E8%BD%AC%E5%9E%8B%E7%BB%8F%E9%AA%8C/</guid><description>这是一篇个人经验向的文章，记录了从传统虚拟机运维转向 Kubernetes 的全过程：为什么要迁移、迁移中踩了哪些坑、团队如何度过学习曲线，以及回头看哪些事情当时做对了。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E4%BA%91%E5%8E%9F%E7%94%9F%E8%BD%AC%E5%9E%8B%E7%BB%8F%E9%AA%8C/featured.jpg"/></item><item><title>Kiali 服务网格可观测性实战：从拓扑图到告警联动</title><link>https://socake.github.io/posts/kiali-service-mesh-observability/</link><pubDate>Tue, 12 Aug 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kiali-service-mesh-observability/</guid><description>Kiali 不只是画拓扑图的工具，它是服务网格的诊断中心。本文把 Kiali 2.x 在生产中的配置、用法、踩坑都写清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kiali-service-mesh-observability/featured.jpg"/></item><item><title>平台工程实践：构建 Internal Developer Platform</title><link>https://socake.github.io/posts/platform-engineering-practice/</link><pubDate>Sun, 10 Aug 2025 09:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/platform-engineering-practice/</guid><description>平台工程不是给 DevOps 换个名字，而是把基础设施能力产品化——让开发者像用 SaaS 一样消费平台能力。这篇文章记录我们团队从 0 到 MVP 的六个月实践，包括 Backstage 落地、黄金路径设计、以及用 DORA 指标验证平台价值。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/platform-engineering-practice/featured.jpg"/></item><item><title>SLO/SLI/Error Budget 从理论到落地：SRE 可靠性工程实战</title><link>https://socake.github.io/posts/slo-sli-error-budget-practice/</link><pubDate>Fri, 01 Aug 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/slo-sli-error-budget-practice/</guid><description>从 SLI 指标选取到 Error Budget 消耗速率告警，系统讲解 SRE 可靠性工程体系的落地实践，包括 Prometheus recording rules 计算 SLI、多窗口 burn rate 告警规则配置、SLO 违规复盘流程，以及与开发团队的协作策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/slo-sli-error-budget-practice/featured.jpg"/></item><item><title>Cilium Hubble 实战：用 eBPF 看透 Kubernetes 网络</title><link>https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/</link><pubDate>Wed, 30 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/</guid><description>Cilium Hubble 是 Kubernetes 下最接近交换机镜像端口的东西。本文讲清楚它的架构、关键配置和生产上如何读 flow 定位网络问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/featured.jpg"/></item><item><title>VictoriaMetrics：比 Prometheus 更省资源的监控存储方案</title><link>https://socake.github.io/posts/victoriametrics-prometheus/</link><pubDate>Mon, 28 Jul 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/victoriametrics-prometheus/</guid><description>Prometheus 撑不住了？本文对比 VictoriaMetrics 与 Prometheus 的核心差异，介绍 remote_write 无缝迁移方案，以及 VM 在资源占用、压缩率、查询性能上的实际提升。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/victoriametrics-prometheus/featured.jpg"/></item><item><title>Thanos 实战：多 K8s 集群 Prometheus 统一监控与长期存储</title><link>https://socake.github.io/posts/thanos-multi-cluster/</link><pubDate>Sat, 26 Jul 2025 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/thanos-multi-cluster/</guid><description>记录我们将三套 EKS 集群的独立 Prometheus 迁移到 Thanos 统一监控体系的全过程，重点覆盖选型决策、生产配置和踩坑总结。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/thanos-multi-cluster/featured.jpg"/></item><item><title>OpenTelemetry 落地实践：统一采集 Traces、Metrics、Logs</title><link>https://socake.github.io/posts/opentelemetry-practice/</link><pubDate>Sun, 20 Jul 2025 11:41:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/opentelemetry-practice/</guid><description>从为什么选 OpenTelemetry 讲起，给出 DaemonSet + Gateway 的 Collector 部署架构、关键配置和实际踩坑记录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/opentelemetry-practice/featured.jpg"/></item><item><title>Grafana Tempo 大规模分布式追踪实战：从 OTel 接入到 TraceQL 调优</title><link>https://socake.github.io/posts/grafana-tempo-distributed-tracing/</link><pubDate>Wed, 16 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-tempo-distributed-tracing/</guid><description>Tempo 是目前最便宜的分布式追踪后端。本文把架构、接入、TraceQL、tail sampling、成本优化、事故案例都串起来，供团队直接抄作业。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-tempo-distributed-tracing/featured.jpg"/></item><item><title>可观测性三支柱实战：Metrics/Logs/Traces 联动</title><link>https://socake.github.io/posts/observability-three-pillars/</link><pubDate>Mon, 14 Jul 2025 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/observability-three-pillars/</guid><description>监控告诉你系统挂了，可观测性告诉你为什么挂。本文从三支柱的核心差异出发，讲透 Prometheus+Loki+Tempo 的联动排障流程，覆盖 OpenTelemetry 采集标准、Exemplar 原理与配置，以及可观测性建设的优先级策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/observability-three-pillars/featured.jpg"/></item><item><title>DORA 指标与平台工程效能度量：用数据驱动 DevOps 改进</title><link>https://socake.github.io/posts/dora-metrics-platform-engineering/</link><pubDate>Sat, 12 Jul 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/dora-metrics-platform-engineering/</guid><description>DORA 四个指标不是考核工具，是诊断工具。从 CI/CD 流水线和 Incident 系统采集数据，找到部署频率低、前置时间长的真实原因，然后用平台工程手段系统性改进。本文给出采集方案、Grafana 看板设计和常见误用陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/dora-metrics-platform-engineering/featured.jpg"/></item><item><title>分布式链路追踪实战：Jaeger 与 Tempo 选型对比</title><link>https://socake.github.io/posts/distributed-tracing-jaeger-tempo/</link><pubDate>Thu, 10 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/distributed-tracing-jaeger-tempo/</guid><description>系统梳理 Jaeger 与 Tempo 的架构差异与适用场景，结合 OpenTelemetry SDK 插桩、TraceQL 查询、采样策略和 Traces/Metrics/Logs 关联，给出可落地的生产实战方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/distributed-tracing-jaeger-tempo/featured.jpg"/></item><item><title>On-Call 工程实践：从告警响应到 Runbook 设计</title><link>https://socake.github.io/posts/on-call-engineering-practice/</link><pubDate>Tue, 08 Jul 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/on-call-engineering-practice/</guid><description>好的 On-Call 体系不是让人 24 小时盯着屏幕，而是让每一次叫醒都有价值。从告警质量到 Runbook 设计，从轮班制度到数据驱动改进，这篇文章是我们团队在生产环境打磨 3 年的实践总结。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/on-call-engineering-practice/featured.jpg"/></item><item><title>SRE 故障管理全生命周期：从响应到复盘</title><link>https://socake.github.io/posts/sre-incident-management/</link><pubDate>Sat, 05 Jul 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre-incident-management/</guid><description>故障处理不只是技术问题，更是协作和信息流问题。这篇文章完整梳理了从故障触发到 Post-Mortem 归档的每个环节，包括 IC 角色的意义、15 分钟定界框架，以及如何让 Post-Mortem 真正推动改进而不是走过场。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre-incident-management/featured.jpg"/></item><item><title>Pyroscope 持续性能剖析生产实战：给每一行代码一个性能画像</title><link>https://socake.github.io/posts/pyroscope-continuous-profiling/</link><pubDate>Wed, 02 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/pyroscope-continuous-profiling/</guid><description>为什么 metrics/logs/traces 之外还需要 profiling，它解决的是什么问题，Pyroscope 的架构是什么，怎样以 2%~5% overhead 把它铺到整个 K8s 集群。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/pyroscope-continuous-profiling/featured.jpg"/></item><item><title>Crossplane：用 GitOps 方式管理云资源（AWS/阿里云）</title><link>https://socake.github.io/posts/crossplane-gitops-cloud/</link><pubDate>Thu, 26 Jun 2025 12:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/crossplane-gitops-cloud/</guid><description>Crossplane 把 AWS RDS、S3、EKS 变成 K8s CRD，用 GitOps 方式持续协调云资源状态。记录从概念到落地的实践过程和踩坑经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/crossplane-gitops-cloud/featured.jpg"/></item><item><title>SRE 核心理念：从运维思维到可靠性工程</title><link>https://socake.github.io/posts/sre-concepts-and-principles/</link><pubDate>Thu, 26 Jun 2025 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre-concepts-and-principles/</guid><description>SRE 不是给运维换了个更好听的名字。它是一套用软件工程思维解决可靠性问题的方法论。本文从 Error Budget 切入，覆盖 SLI/SLO 制定、Toil 识别、On-call 设计、故障复盘文化，以及从传统运维转型 SRE 的实际路径。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre-concepts-and-principles/featured.jpg"/></item><item><title>OpenTofu 实战：开源 Terraform 管理 AWS 和阿里云基础设施</title><link>https://socake.github.io/posts/opentofu-terraform-practice/</link><pubDate>Wed, 18 Jun 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/opentofu-terraform-practice/</guid><description>Terraform 改协议了，OpenTofu 是开源的替代。本文介绍 OpenTofu 核心概念，并给出创建 AWS EKS 和阿里云 ACK 的完整配置示例，以及 State 管理、Module 复用和 Atlantis GitOps 集成方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/opentofu-terraform-practice/featured.jpg"/></item><item><title>Grafana Mimir 长期指标存储实战：从单集群 Prometheus 到 10 亿级 series</title><link>https://socake.github.io/posts/grafana-mimir-long-term-metrics/</link><pubDate>Wed, 18 Jun 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-mimir-long-term-metrics/</guid><description>从一套 Prometheus HA pair 起步，一路扩到跨三地多活 Mimir，把 series 数从千万推到十亿级。本文把架构、配置、监控、事故按顺序讲清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-mimir-long-term-metrics/featured.jpg"/></item><item><title>Kubernetes NetworkPolicy 网络隔离实战</title><link>https://socake.github.io/posts/kubernetes-network-policy/</link><pubDate>Sun, 15 Jun 2025 09:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-network-policy/</guid><description>系统讲解 Kubernetes NetworkPolicy 的工作机制与生产实战配置，覆盖 deny-all 基础模板、常见隔离场景、Cilium 扩展、多租户设计、测试验证方法及常见陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-network-policy/featured.jpg"/></item><item><title>Helm 工程化实践：从 Chart 设计到多环境管理</title><link>https://socake.github.io/posts/helm-engineering-practice/</link><pubDate>Sat, 14 Jun 2025 10:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/helm-engineering-practice/</guid><description>基于生产踩坑经验，系统梳理 Helm Chart 结构设计、_helpers.tpl 复用技巧、多环境 values 管理策略、私有 Harbor 仓库推送流程，以及 &amp;ndash;atomic 升级与回滚的正确姿势。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/helm-engineering-practice/featured.jpg"/></item><item><title>Karpenter 深度解析：下一代 K8s 节点自动扩缩</title><link>https://socake.github.io/posts/karpenter-deep-dive/</link><pubDate>Wed, 11 Jun 2025 11:33:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/karpenter-deep-dive/</guid><description>从 Cluster Autoscaler 迁移到 Karpenter 之后，集群扩容速度和节点利用率都有明显提升。本文详细拆解 Karpenter 的核心机制、关键配置项，以及在多套生产集群运行中踩过的坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/karpenter-deep-dive/featured.jpg"/></item><item><title>Istio Service Mesh 落地实战：从 Sidecar 注入到灰度发布</title><link>https://socake.github.io/posts/istio-service-mesh-practice/</link><pubDate>Fri, 06 Jun 2025 12:06:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/istio-service-mesh-practice/</guid><description>记录 Istio Service Mesh 从零落地的完整过程，包括 sidecar 注入原理、VirtualService 灰度发布流量切分、DestinationRule 熔断与负载均衡配置、PeerAuthentication mTLS 加固，以及用 istioctl analyze 排查常见问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/istio-service-mesh-practice/featured.jpg"/></item><item><title>Loki 架构深度解析：从写入路径到 PB 级日志查询优化</title><link>https://socake.github.io/posts/loki-architecture-deep-dive/</link><pubDate>Thu, 05 Jun 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/loki-architecture-deep-dive/</guid><description>围绕 Loki 3.x 架构拆解写入、索引、查询三条链路，给出 schema_config、compactor、bloom、TSDB 的可直接复用配置，并复盘两次线上事故带来的调参经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/loki-architecture-deep-dive/featured.jpg"/></item><item><title>GitOps 落地实战：ArgoCD + Kustomize 多环境管理</title><link>https://socake.github.io/posts/gitops-argocd/</link><pubDate>Tue, 03 Jun 2025 09:17:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/gitops-argocd/</guid><description>GitOps 不只是「把配置放 Git 里」，真正落地需要解决 overlay 结构设计、ApplicationSet 管理多集群、image updater 自动化，以及 sync wave、resource hook 这些细节。这篇文章记录我们团队从传统 CI/CD 迁移到 GitOps 的实际过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/gitops-argocd/featured.jpg"/></item><item><title>ArgoCD 高级模式：ApplicationSet、Sync Waves 与 GitOps 企业级实践</title><link>https://socake.github.io/posts/argocd-advanced-patterns/</link><pubDate>Tue, 27 May 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/argocd-advanced-patterns/</guid><description>从 ApplicationSet 的四种 Generator 到 Sync Waves 控制数据库迁移顺序，再到 Image Updater 打通 ECR 自动触发 GitOps 流程，这篇文章覆盖 ArgoCD 在企业级多集群环境下的高级用法和常见陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/argocd-advanced-patterns/featured.jpg"/></item><item><title>多集群 Kubernetes 运维：跨集群管理与统一可观测</title><link>https://socake.github.io/posts/multi-cluster-k8s-management/</link><pubDate>Wed, 21 May 2025 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/multi-cluster-k8s-management/</guid><description>从单集群到多集群，运维复杂度不是线性增加，而是指数级。这篇文章总结了我们管理跨地域、跨环境多套 K8s 集群的实际经验：如何用 ArgoCD ApplicationSet 统一部署、如何用 Thanos 聚合多集群指标、以及一次真实的跨集群迁移过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/multi-cluster-k8s-management/featured.jpg"/></item><item><title>业务上云实战：传统应用容器化迁移的踩坑与经验</title><link>https://socake.github.io/posts/kubernetes-migration-practice/</link><pubDate>Mon, 19 May 2025 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-migration-practice/</guid><description>把一批跑在虚拟机上的 Java 应用迁移到 Kubernetes，踩过的坑比想象中多。本文记录整个迁移过程的关键决策和教训。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-migration-practice/featured.jpg"/></item><item><title>Kubernetes 集群升级策略：零停机升级的完整实践指南</title><link>https://socake.github.io/posts/kubernetes-upgrade-strategy/</link><pubDate>Wed, 14 May 2025 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-upgrade-strategy/</guid><description>K8s 集群升级听起来简单，实际操作中坑很多：API 弃用导致的 Helm 失败、Admission Webhook 拦截升级流量、PDB 配置不当导致服务中断。这篇文章从真实的升级经验出发，给出一套可复用的零停机升级方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-upgrade-strategy/featured.jpg"/></item><item><title>K8s Gateway API：告别 Ingress，拥抱下一代流量路由</title><link>https://socake.github.io/posts/kubernetes-gateway-api/</link><pubDate>Mon, 12 May 2025 13:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-gateway-api/</guid><description>Gateway API 已经 GA，是时候认真考虑从 Ingress 迁移了。本文梳理 Gateway API 的设计理念、实际配置示例和迁移注意事项。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-gateway-api/featured.jpg"/></item><item><title>Kubernetes 存储体系生产实践：PV/PVC/StorageClass 全解</title><link>https://socake.github.io/posts/kubernetes-storage-practice/</link><pubDate>Tue, 06 May 2025 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-storage-practice/</guid><description>从存储基础概念到生产实战，覆盖 StorageClass 动态供给配置、AWS EBS 和 EFS CSI 驱动安装、StatefulSet 存储管理、PVC 在线扩容操作、跨 AZ 挂载失败排查，以及有状态服务数据迁移方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-storage-practice/featured.jpg"/></item><item><title>从 Nginx Ingress 迁移到 Traefik：为什么换，怎么换</title><link>https://socake.github.io/posts/traefik-vs-nginx-ingress/</link><pubDate>Sun, 27 Apr 2025 12:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/traefik-vs-nginx-ingress/</guid><description>从实际痛点出发，讲清楚 Traefik 和 Nginx Ingress 的本质区别，给出可直接参考的迁移路径和配置示例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/traefik-vs-nginx-ingress/featured.jpg"/></item><item><title>RabbitMQ 运维实战：集群部署、消费者可靠性与监控体系</title><link>https://socake.github.io/posts/rabbitmq-ops-practice/</link><pubDate>Tue, 22 Apr 2025 14:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/rabbitmq-ops-practice/</guid><description>系统梳理 RabbitMQ 运维核心技能：Quorum Queue 集群部署与镜像队列对比、生产配置调优、消费者 prefetch 与死信队列配置、基于 Management API 和 rabbitmq_exporter 的监控体系，以及消息堆积、脑裂等常见故障的处理方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/rabbitmq-ops-practice/featured.jpg"/></item><item><title>Celery 异步任务详解：任务队列、重试策略与分布式部署</title><link>https://socake.github.io/posts/celery-async-tasks/</link><pubDate>Tue, 22 Apr 2025 09:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/celery-async-tasks/</guid><description>从 Celery 架构到 K8s 部署，覆盖任务定义、重试策略、队列路由、Beat 定时任务和 Flower 监控，附完整的生产部署配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/celery-async-tasks/featured.jpg"/></item><item><title>ETCD 运维实战：部署、备份恢复与 K8s 集群数据管理</title><link>https://socake.github.io/posts/etcd-ops-practice/</link><pubDate>Sun, 13 Apr 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/etcd-ops-practice/</guid><description>ETCD 是 Kubernetes 的命脉，所有集群状态都存储在这里。本文从实际运维角度梳理部署、备份、恢复和配置动态更新的完整操作链路，包含多个踩坑经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/etcd-ops-practice/featured.jpg"/></item><item><title>自研 Kubernetes Admission Webhook 开发实战：从零到生产</title><link>https://socake.github.io/posts/kubernetes-admission-webhook-dev/</link><pubDate>Sat, 12 Apr 2025 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-admission-webhook-dev/</guid><description>Kubernetes 的 admission 体系是一个强大但脆弱的扩展点。webhook 挂了能让集群所有 Pod 创建卡死。写一个能上生产的 webhook 不难，但要让它在面对各种怪异请求、证书轮换、集群升级、大流量突发时都不挂，就是另一回事了。这是一份从零到生产的工程笔记。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-admission-webhook-dev/featured.jpg"/></item><item><title>数据库运维实践：MySQL 高可用与 PostgreSQL 调优经验</title><link>https://socake.github.io/posts/database-ops-practice/</link><pubDate>Tue, 08 Apr 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/database-ops-practice/</guid><description>数据库运维不复杂，但细节多、出问题代价大。本文整理了 MySQL 主从复制、慢查询分析、PostgreSQL 连接池这几个高频话题的实战经验，以及一些日常运维 SQL 备忘。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/database-ops-practice/featured.jpg"/></item><item><title>Kafka 运维实战：消息堆积排查、分区再平衡与监控体系</title><link>https://socake.github.io/posts/kafka-ops-practice/</link><pubDate>Mon, 07 Apr 2025 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kafka-ops-practice/</guid><description>系统梳理 Kafka 运维核心技能：消费者延迟监控告警、消息堆积根因分析、分区扩容规划、Rebalance 风暴处理，以及 KEDA 基于 lag 自动扩缩的配置实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kafka-ops-practice/featured.jpg"/></item><item><title>Cluster API 实战：用声明式的方式管理 Kubernetes 集群的生命周期</title><link>https://socake.github.io/posts/cluster-api-infrastructure/</link><pubDate>Sat, 05 Apr 2025 14:15:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/cluster-api-infrastructure/</guid><description>用 Terraform 建集群是起手式，但集群一旦多起来 Terraform 的代码量和状态管理开始爆炸。Cluster API 把&amp;rsquo;集群&amp;rsquo;本身做成了 Kubernetes CRD——你在 Management Cluster 里 kubectl apply 一个 Cluster 对象，就能得到一个新集群。这是 Kubernetes 治理 Kubernetes 的一种优雅解法。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/cluster-api-infrastructure/featured.jpg"/></item><item><title>MongoDB 运维入门：部署、备份与生产性能调优</title><link>https://socake.github.io/posts/mongodb-ops-practice/</link><pubDate>Mon, 31 Mar 2025 11:41:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mongodb-ops-practice/</guid><description>MongoDB 运维从选型到调优：何时选 MongoDB、Replica Set 三节点部署、索引设计、mongodump 备份，以及 wiredTiger、连接池、大文档等生产踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mongodb-ops-practice/featured.jpg"/></item><item><title>KubeVirt 生产实战：在 Kubernetes 上跑虚拟机的完整路线</title><link>https://socake.github.io/posts/kubevirt-vm-on-kubernetes/</link><pubDate>Sat, 29 Mar 2025 10:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubevirt-vm-on-kubernetes/</guid><description>Broadcom 吃掉 VMware 之后，VMware 替代方案成了所有基础设施团队的议题。KubeVirt 1.8 已经是个相当成熟的选择，能在 Kubernetes 里跑真正的 VM——不是轻量容器、不是 microVM，是完整的 Windows/Linux VM。这是一年多的实战笔记。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubevirt-vm-on-kubernetes/featured.jpg"/></item><item><title>Alertmanager Webhook 开发：自定义告警处理与 API 集成</title><link>https://socake.github.io/posts/alertmanager-webhook-api/</link><pubDate>Tue, 25 Mar 2025 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/alertmanager-webhook-api/</guid><description>Alertmanager 内置的通知渠道不支持钉钉、飞书等国内工具，Webhook 是扩展告警通知的标准方式。本文用 Python Flask 实现完整的 Webhook 接收器，涵盖消息格式化、降噪去重、Alertmanager API 集成和 K8s 部署。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/alertmanager-webhook-api/featured.jpg"/></item><item><title>Descheduler 深度实战：Kubernetes 自动再平衡的正确打开方式</title><link>https://socake.github.io/posts/descheduler-workload-rebalance/</link><pubDate>Sat, 22 Mar 2025 16:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/descheduler-workload-rebalance/</guid><description>kube-scheduler 只在 Pod 创建那一刻做决策，之后集群状态变了它就不管了。几个月下来，你的集群会变成 hot node + cold node 混杂、同一个 Deployment 的 Pod 全挤在一个 node、failure-domain 完全失衡。Descheduler 就是把调度决策后置、周期性重新评估的那只手。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/descheduler-workload-rebalance/featured.jpg"/></item><item><title>Alertmanager 完全指南：路由、抑制、静默与多渠道通知</title><link>https://socake.github.io/posts/alertmanager-routing-config/</link><pubDate>Sat, 22 Mar 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/alertmanager-routing-config/</guid><description>告警太多和告警太少一样有害。Alertmanager 的路由、抑制、分组机制是控制告警噪声的核心手段，本文从一个真实的多环境告警体系出发，讲清楚每个配置的意图和陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/alertmanager-routing-config/featured.jpg"/></item><item><title>Grafana API 自动化：用代码管理 Dashboard、数据源和告警</title><link>https://socake.github.io/posts/grafana-api-automation/</link><pubDate>Tue, 18 Mar 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-api-automation/</guid><description>手动点 UI 管理 Grafana Dashboard 在多环境场景下是噩梦。用 API 把 Dashboard 代码化，实现版本控制和环境同步，才是正确姿势。本文提供完整的 Python 工具脚本和实战踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-api-automation/featured.jpg"/></item><item><title>PostgreSQL 运维实战：配置调优、连接池、慢查询与高可用</title><link>https://socake.github.io/posts/postgresql-ops-practice/</link><pubDate>Tue, 18 Mar 2025 10:15:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/postgresql-ops-practice/</guid><description>系统梳理 PostgreSQL 运维核心技能：从 shared_buffers、WAL 参数调优，到 PgBouncer 事务模式配置；从 pg_stat_statements 慢查询分析到 PITR 时间点恢复；以及主从流复制、膨胀表清理和 Prometheus 监控指标的完整实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/postgresql-ops-practice/featured.jpg"/></item><item><title>Kueue 批处理调度实战：让 Kubernetes 真正承担 AI/HPC 工作负载</title><link>https://socake.github.io/posts/kueue-batch-workload/</link><pubDate>Sat, 15 Mar 2025 09:40:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kueue-batch-workload/</guid><description>把 AI 训练任务塞进 Kubernetes，第一天你会发现原生调度器完全不够用：没有队列、没有 quota、没有 gang scheduling、没有公平共享、preemption 语义一塌糊涂。Kueue 是 sig-scheduling 官方给出的答案，它比 Volcano 更贴近 Kubernetes 原生、比自研 controller 更成熟。这是一份真实的生产笔记。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kueue-batch-workload/featured.jpg"/></item><item><title>Prometheus 服务发现深度解析：kubernetes_sd_configs 实战</title><link>https://socake.github.io/posts/prometheus-service-discovery/</link><pubDate>Sat, 15 Mar 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-service-discovery/</guid><description>在 K8s 环境里手动维护 Prometheus scrape targets 是不现实的，kubernetes_sd_configs 配合 relabel_configs 是解决这个问题的核心机制。本文从原理到实践，把这套体系讲透。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-service-discovery/featured.jpg"/></item><item><title>vcluster 虚拟集群实战：比 namespace 强一百倍的多租户方案</title><link>https://socake.github.io/posts/vcluster-virtual-cluster/</link><pubDate>Sat, 08 Mar 2025 15:10:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vcluster-virtual-cluster/</guid><description>namespace 不是隔离边界，它只是一层命名约定。ClusterRole、CRD、webhook、LimitRange 全都穿透 namespace。真正的多租户需要每个租户有自己的 kube-apiserver。vcluster 让这件事便宜到几乎免费——一个 namespace 里起一个完整的 Kubernetes 控制平面。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vcluster-virtual-cluster/featured.jpg"/></item><item><title>Elastic Agent + Fleet：下一代统一日志采集管理实践</title><link>https://socake.github.io/posts/elastic-agent-fleet/</link><pubDate>Thu, 06 Mar 2025 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/elastic-agent-fleet/</guid><description>Filebeat + Metricbeat + Auditbeat 三个 Agent 各管一摊，配置分散难以维护。Elastic Agent 将它们统一为一个 All-in-One Agent，配合 Fleet 实现中央化管理。本文记录从部署到踩坑的完整实践过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/elastic-agent-fleet/featured.jpg"/></item><item><title>EFK 日志系统实战：Fluent Bit + Fluentd + Elasticsearch 完整部署</title><link>https://socake.github.io/posts/efk-logging-practice/</link><pubDate>Wed, 05 Mar 2025 12:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/efk-logging-practice/</guid><description>讲清楚为什么要 Fluent Bit + Fluentd 两层架构，给出可直接参考的完整 ConfigMap 配置和 ES 索引模板设计。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/efk-logging-practice/featured.jpg"/></item><item><title>Zookeeper 运维实战：集群部署、调优与故障排查</title><link>https://socake.github.io/posts/zookeeper-ops-practice/</link><pubDate>Wed, 05 Mar 2025 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/zookeeper-ops-practice/</guid><description>系统梳理 Zookeeper 生产运维核心技能：ZNode 类型与 Watcher 机制、ZAB 选举算法、3/5 节点集群部署配置、JVM 与 zoo.cfg 调优、四字命令实战诊断、常见故障处理，以及与 Kafka KRaft 模式的关系和云原生场景下的定位。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/zookeeper-ops-practice/featured.jpg"/></item><item><title>Karmada 多集群联邦实战：PropagationPolicy、OverridePolicy 与 FailOver 的真实用法</title><link>https://socake.github.io/posts/karmada-multi-cluster/</link><pubDate>Sun, 02 Mar 2025 11:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/karmada-multi-cluster/</guid><description>如果你有 2 个以上 Kubernetes 集群，跨集群发同一个应用这件事迟早成为你的日常。Karmada 是 CNCF 孵化项目里做多集群联邦最完整的一个，但它的 CRD 设计比较克制，生产要用得好，得理清资源分发、差异覆盖、调度和 failover 四层语义。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/karmada-multi-cluster/featured.jpg"/></item><item><title>Kubernetes 日志采集方案选型：从技术对比到生产落地</title><link>https://socake.github.io/posts/k8s-logging-solution/</link><pubDate>Tue, 25 Feb 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/k8s-logging-solution/</guid><description>记录我们团队从无到有建立 Kubernetes 日志采集系统的完整历程，最终选择 Fluent Bit + Fluentd + Elasticsearch 方案的技术依据，以及生产环境踩过的那些坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/k8s-logging-solution/featured.jpg"/></item><item><title>ExternalDNS 多云 DNS 同步实战：从 Route53 到 Cloudflare 再到阿里云 DNS</title><link>https://socake.github.io/posts/external-dns-multi-provider/</link><pubDate>Sat, 22 Feb 2025 09:45:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/external-dns-multi-provider/</guid><description>手工在 Cloudflare 控制台点 DNS 记录这件事，随着集群和业务增长最终必然崩溃。ExternalDNS 就是把 Kubernetes 资源当 source-of-truth、DNS provider 当执行器的一个 controller。但真要用好，你得理解 txtOwnerId、policy、provider 各自的限制以及跨集群共享 zone 的几个坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/external-dns-multi-provider/featured.jpg"/></item><item><title>Secret 管理实战：HashiCorp Vault + External Secrets Operator</title><link>https://socake.github.io/posts/vault-external-secrets/</link><pubDate>Thu, 20 Feb 2025 10:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vault-external-secrets/</guid><description>base64 不是加密。本文从 Secret 泄露风险说起，完整介绍 Vault 核心概念、K8s 部署方式、ESO 集成配置，以及动态数据库凭证的自动轮换实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vault-external-secrets/featured.jpg"/></item><item><title>Consul 服务注册与发现：从入门到生产级健康检查</title><link>https://socake.github.io/posts/consul-service-discovery/</link><pubDate>Tue, 18 Feb 2025 11:33:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/consul-service-discovery/</guid><description>微服务时代，动态 IP 和服务健康状态管理是绕不过去的问题。Consul 提供了一套完整的服务发现解决方案，本文从实操角度梳理其核心用法和生产踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/consul-service-discovery/featured.jpg"/></item><item><title>Harbor 镜像仓库生产运维：高可用、安全扫描与 CI/CD 集成</title><link>https://socake.github.io/posts/harbor-registry-ops/</link><pubDate>Tue, 18 Feb 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/harbor-registry-ops/</guid><description>从 Harbor 架构原理出发，系统梳理生产环境中高可用部署方案、镜像安全扫描策略、跨区域复制配置、权限体系设计，以及与 Jenkins/GitLab CI 的集成实践，附故障排查手册与 Prometheus 监控配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/harbor-registry-ops/featured.jpg"/></item><item><title>cert-manager 生产级实战：从 Let's Encrypt 到企业内网 PKI 的完整路线</title><link>https://socake.github.io/posts/cert-manager-production/</link><pubDate>Sat, 15 Feb 2025 14:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/cert-manager-production/</guid><description>cert-manager 几乎是每个 Kubernetes 集群的标配，但真正跑到生产的团队都会遇到：Let&amp;rsquo;s Encrypt 限流被打爆、通配符证书续期失败、内部服务想要私有 CA、Istio / Gateway API 的证书怎么发。这篇把一年里我在 5 个集群上做 cert-manager 运维踩过的坑写成一份实操手册。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/cert-manager-production/featured.jpg"/></item><item><title>Ansible 批量运维自动化：从临时命令到 Role 工程化</title><link>https://socake.github.io/posts/ansible-ops-automation/</link><pubDate>Wed, 12 Feb 2025 12:06:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ansible-ops-automation/</guid><description>Ansible 无 Agent、SSH 推送、幂等性三大特性让它成为 Linux 批量运维的利器。本文从入门用法到 Role 工程化实践，梳理了日常运维中高频场景的完整操作思路和踩坑经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ansible-ops-automation/featured.jpg"/></item><item><title>CI/CD 流水线设计：从代码提交到自动部署的工程化实践</title><link>https://socake.github.io/posts/cicd-pipeline-design/</link><pubDate>Sun, 09 Feb 2025 09:17:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/cicd-pipeline-design/</guid><description>一条好的 CI/CD 流水线不只是「能跑」，而是快、可靠、边界清晰。本文从构建缓存到 GitOps 分工，从多分支策略到故障排查，整理了在实际项目中反复用到的工程化实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/cicd-pipeline-design/featured.jpg"/></item><item><title>KEDA 事件驱动弹性伸缩实战：从 HPA 的尽头到真正按业务信号扩缩</title><link>https://socake.github.io/posts/keda-event-driven-autoscaling/</link><pubDate>Sat, 08 Feb 2025 10:12:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/keda-event-driven-autoscaling/</guid><description>HPA 只能看 CPU/内存，但生产环境真正的扩缩信号往往是 Kafka lag、RabbitMQ 队列深度、Prometheus 自定义指标、甚至 cron。本文把 KEDA 的架构、核心 CRD、常见 scaler 的坑和运维动作写成一份资深工程师的备忘录，不讲理论，只讲什么样的配置能在凌晨 3 点把你从告警里救出来。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/keda-event-driven-autoscaling/featured.jpg"/></item><item><title>GitLab CI/CD + Kubernetes：从代码提交到生产部署全流程</title><link>https://socake.github.io/posts/gitlab-ci-kubernetes/</link><pubDate>Sat, 01 Feb 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/gitlab-ci-kubernetes/</guid><description>从 GitLab Runner 的 Kubernetes executor 配置，到 kaniko 替代 DinD 的镜像构建方案，再到通过更新 GitOps 仓库完成生产部署——记录一套在真实 AWS EKS 环境跑通的 CI/CD 全流程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/gitlab-ci-kubernetes/featured.jpg"/></item><item><title>Jenkins + Kubernetes：动态 Agent 构建与流水线最佳实践</title><link>https://socake.github.io/posts/jenkins-kubernetes-cicd/</link><pubDate>Sun, 26 Jan 2025 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/jenkins-kubernetes-cicd/</guid><description>静态 Jenkins Slave 的资源浪费和配置混乱问题，在 Kubernetes 动态 Pod Agent 模式下得到根本解决。本文记录在真实生产环境中把 Jenkins 迁移到 K8s 的完整过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/jenkins-kubernetes-cicd/featured.jpg"/></item><item><title>Kubernetes RBAC 安全加固实战：最小权限到 NetworkPolicy</title><link>https://socake.github.io/posts/kubernetes-rbac-security/</link><pubDate>Fri, 24 Jan 2025 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-rbac-security/</guid><description>从真实安全事件出发，系统讲解 Kubernetes RBAC 最小权限设计、ClusterRole 与 Role 的适用场景、审计日志分析 RBAC 问题的方法，以及 NetworkPolicy 实现命名空间和 Pod 级别的网络隔离。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-rbac-security/featured.jpg"/></item><item><title>Doris 与 StarRocks：一次严肃的生产选型笔记</title><link>https://socake.github.io/posts/columnar-warehouse-doris-starrocks/</link><pubDate>Wed, 22 Jan 2025 15:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/columnar-warehouse-doris-starrocks/</guid><description>Doris 和 StarRocks 同源、相似、又各有偏好。选哪个不是&amp;quot;谁更好&amp;quot;的问题，而是&amp;quot;谁更适合我们的场景&amp;quot;的问题。这篇文章是我在两套 OLAP 集群（一套 Doris、一套 StarRocks）上运维一年多后写的深度对比，希望能帮你跳过几个月的调研和踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/columnar-warehouse-doris-starrocks/featured.jpg"/></item><item><title>Kubernetes YAML 工程化：常用资源模板与生产最佳实践</title><link>https://socake.github.io/posts/kubernetes-yaml-patterns/</link><pubDate>Sun, 19 Jan 2025 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-yaml-patterns/</guid><description>写好 Kubernetes YAML 不只是语法问题，更多是工程经验的沉淀。本文梳理了生产环境中常见的 YAML 反模式，并给出各类资源的完整可用模板。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-yaml-patterns/featured.jpg"/></item><item><title>Kubernetes 资源管理实战——QoS、ResourceQuota、VPA 体系化实践</title><link>https://socake.github.io/posts/kubernetes-resource-management/</link><pubDate>Thu, 16 Jan 2025 13:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-resource-management/</guid><description>我在生产中见过太多因为资源配置不当导致的事故：不设 limits 的服务把节点内存吃光导致 OOM 驱逐、requests 设得过高导致 Pod 调度不上去、HPA 配置错误导致扩缩失灵。这篇文章把 K8s 资源管理体系从头到尾捋一遍，让你建立完整的资源治理思路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-resource-management/featured.jpg"/></item><item><title>Kubernetes 网络深度解析——CNI、kube-proxy、NetworkPolicy 完全指南</title><link>https://socake.github.io/posts/kubernetes-networking-deep-dive/</link><pubDate>Fri, 10 Jan 2025 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-networking-deep-dive/</guid><description>K8s 网络是很多工程师的知识盲区，平时不出问题就忽略，一出问题就完全不知道从哪下手。我在多次生产网络故障的排查中，深刻理解了 K8s 网络的每一层。这篇文章从 Pod 网络模型讲到 NetworkPolicy 实战，帮你建立完整的 K8s 网络知识体系。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-networking-deep-dive/featured.jpg"/></item><item><title>数据库变更管理：从 gh-ost 到 Flyway 的完整工程化路径</title><link>https://socake.github.io/posts/database-change-management/</link><pubDate>Wed, 08 Jan 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/database-change-management/</guid><description>很多团队把&amp;quot;数据库变更管理&amp;quot;当成几条 SQL + 一个工单，实际上这是工程化程度最低的一块地方。一边是开发随手写 ALTER 把线上锁住，一边是 DBA 手动盯着进度条祈祷不出事。这篇文章把我总结的 DB 变更管理最佳实践分成工具、流程、组织三个层面讲，每一层都有可以直接落地的方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/database-change-management/featured.jpg"/></item><item><title>Vitess 实战：把 MySQL 水平扩展到 PB 级的路</title><link>https://socake.github.io/posts/vitess-mysql-sharding/</link><pubDate>Tue, 24 Dec 2024 14:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/vitess-mysql-sharding/</guid><description>当 MySQL 单库扛不住、又不想切 TiDB 或 PG 的时候，Vitess 就成了最后一个选项。它保留了 MySQL 兼容性，用 vtgate 做分片代理，用 VReplication 做在线 resharding。听起来很美，但 Vitess 的学习曲线陡得惊人。这篇文章是我调研 Vitess 几个月、在 staging 跑通一个 4 shard 集群后的全面笔记。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/vitess-mysql-sharding/featured.jpg"/></item><item><title>运维工程师的技术成长：从执行者到架构者的路径规划</title><link>https://socake.github.io/posts/devops-career-growth/</link><pubDate>Sun, 22 Dec 2024 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devops-career-growth/</guid><description>运维工程师的成长不是工具的堆砌，而是认知层次的跃迁。这篇文章记录了我对这条路的观察和思考——哪些时机会让人真正进阶，哪些惯性思维会让人原地踏步。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devops-career-growth/featured.jpg"/></item><item><title>故障排查方法论：从现象到根因</title><link>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/</link><pubDate>Tue, 17 Dec 2024 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/</guid><description>好的排查不靠直觉，靠方法。这篇文章总结了我在多次生产故障中提炼出的排查框架：从时间线构建到假设优先级，再到认知陷阱的识别与规避。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/featured.jpg"/></item><item><title>Rook-Ceph on Kubernetes 运维实战：从部署到故障恢复</title><link>https://socake.github.io/posts/ceph-rook-kubernetes/</link><pubDate>Fri, 13 Dec 2024 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ceph-rook-kubernetes/</guid><description>当你需要在 Kubernetes 上提供 block、file、object 三种存储时，Rook-Ceph 是几乎没有替代品的方案。但它的复杂度也是所有 K8s 存储方案里最高的。这篇文章是我在一套裸金属 Rook-Ceph 生产集群上两年运维经验的整理，包括几次把集群从悬崖边拉回来的复盘。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ceph-rook-kubernetes/featured.jpg"/></item><item><title>SRE 实践心得：从运维到 SRE 的思维转变</title><link>https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/</link><pubDate>Wed, 11 Dec 2024 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/</guid><description>SRE 不是换了个头衔的运维，而是一套用软件工程思维解决可靠性问题的方法论。这篇文章记录了我在实践过程中最有感触的几个转变。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/featured.jpg"/></item><item><title>可观测性建设：从 Prometheus 采集到 Grafana 告警联动</title><link>https://socake.github.io/posts/prometheus-grafana/</link><pubDate>Fri, 06 Dec 2024 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-grafana/</guid><description>可观测性不是装几个监控工具，而是让系统在出问题时能快速定位根因。这篇文章从采集架构到 PromQL 到告警路由，覆盖我们在生产环境中实际遇到的 cardinality 爆炸、告警噪音等问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-grafana/featured.jpg"/></item><item><title>MinIO 分布式对象存储生产实践：从 Erasure Code 到多租户</title><link>https://socake.github.io/posts/minio-distributed-storage/</link><pubDate>Mon, 02 Dec 2024 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/minio-distributed-storage/</guid><description>自建对象存储曾经是件麻烦事，直到 MinIO 把 S3 API + Erasure Code + 简单部署这件事做到了极致。这篇文章是我在三套生产 MinIO 集群上的运维笔记，覆盖从硬件选型到故障救火的全链路。同时会聊一下 2024 年 MinIO 商业化策略调整后，社区版用户应该怎么办。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/minio-distributed-storage/featured.jpg"/></item><item><title>Python 对接 Prometheus：查询监控数据与告警状态自动化</title><link>https://socake.github.io/posts/python-prometheus-monitoring/</link><pubDate>Mon, 25 Nov 2024 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/python-prometheus-monitoring/</guid><description>用 Python 直接调 Prometheus HTTP API，实现服务存活巡检、可用率日报生成，最后接入钉钉每日自动推送集群健康摘要。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/python-prometheus-monitoring/featured.jpg"/></item><item><title>Python 异步编程实战：asyncio 在 AI 应用中的使用</title><link>https://socake.github.io/posts/python-async-programming/</link><pubDate>Fri, 22 Nov 2024 12:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/python-async-programming/</guid><description>AI 应用天然是 I/O 密集型的：等 LLM 响应、等向量数据库检索、等多个工具调用返回。同步写法在这里是性能杀手。这篇文章从 event loop 原理讲到实际的 AI 应用模式，重点是 asyncio.gather 并发调用、SSE 流式输出处理和常见陷阱排查。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/python-async-programming/featured.jpg"/></item><item><title>MongoDB 分片集群实战：从 shard key 设计到 chunk 均衡的全链路</title><link>https://socake.github.io/posts/mongodb-sharding-practice/</link><pubDate>Wed, 20 Nov 2024 15:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mongodb-sharding-practice/</guid><description>很多团队把 MongoDB 分片当成&amp;quot;设个 shard key 就完事&amp;quot;，结果上线半年后发现 80% 数据在一个 shard 上、balancer 每天搬几十 GB 却怎么都追不上、某个 collection 出现 jumbo chunk 无法分裂。这篇文章把我在几套 MongoDB 分片集群上的经验整理出来，希望能让你在分片之前少走一些弯路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mongodb-sharding-practice/featured.jpg"/></item><item><title>Python 自动化运维：从脚本到完整工具的工程化实践</title><link>https://socake.github.io/posts/python-devops-automation/</link><pubDate>Tue, 12 Nov 2024 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/python-devops-automation/</guid><description>系统梳理 Python 运维自动化的工程化方法：boto3 操作 AWS 资源、Kubernetes Python SDK 使用、Click/Typer CLI 框架选型、数据库批量运维脚本、钉钉 Webhook 集成，以及类型注解与错误处理的实践经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/python-devops-automation/featured.jpg"/></item><item><title>Redis Cluster 扩缩容与数据迁移实战：从 SETSLOT 到 Atomic Slot Migration</title><link>https://socake.github.io/posts/redis-cluster-migration/</link><pubDate>Fri, 08 Nov 2024 10:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/redis-cluster-migration/</guid><description>很多团队把 Redis Cluster 当成&amp;quot;开箱即用&amp;quot;的分布式 Redis，直到要做扩缩容或数据迁移时才发现：SETSLOT 协议里有十几种状态，迁移过程中客户端重定向要么不生效要么风暴，migrate 卡住没法断，big key 直接把迁移拖垮。这篇文章把我在几套千亿级 Cluster 上做过的扩缩容、迁移、救火全过一遍。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/redis-cluster-migration/featured.jpg"/></item><item><title>Redis 运维实践：持久化配置、集群模式与生产监控</title><link>https://socake.github.io/posts/redis-ops-practice/</link><pubDate>Wed, 06 Nov 2024 10:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/redis-ops-practice/</guid><description>Redis 运维看起来简单，但真到了生产出了问题才知道水有多深。本文整理了持久化、集群、监控、故障处理等核心运维主题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/redis-ops-practice/featured.jpg"/></item><item><title>MySQL 备份与恢复实战：从 mysqldump 到 XtraBackup 的完整方案</title><link>https://socake.github.io/posts/mysql-backup-restore/</link><pubDate>Fri, 01 Nov 2024 11:33:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mysql-backup-restore/</guid><description>从 mysqldump 到 XtraBackup，从全量备份到基于 binlog 的时间点恢复，这篇文章覆盖了 MySQL 备份恢复的完整知识体系，包括生产环境的踩坑和自动化验证方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mysql-backup-restore/featured.jpg"/></item><item><title>PostgreSQL 膨胀治理：把 autovacuum 调到你真正需要的样子</title><link>https://socake.github.io/posts/postgresql-vacuum-bloat-tuning/</link><pubDate>Tue, 29 Oct 2024 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/postgresql-vacuum-bloat-tuning/</guid><description>大部分 PostgreSQL DBA 对 autovacuum 的理解停留在&amp;quot;它会自己跑&amp;quot;，但一旦膨胀起来才发现：默认参数对现代硬件完全不够用，几十个 autovacuum_* 参数各管一摊，出了问题根本不知道从哪儿看。这篇文章把我在几套 PG 集群上治理膨胀的经验整理出来，从 MVCC 原理讲到参数调优、从监控到应急处置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/postgresql-vacuum-bloat-tuning/featured.jpg"/></item><item><title>Nginx 运维完全指南：反向代理、负载均衡、HTTPS 与限流</title><link>https://socake.github.io/posts/nginx-ops-complete/</link><pubDate>Thu, 24 Oct 2024 12:06:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/nginx-ops-complete/</guid><description>Nginx 知道怎么装，但真的会用吗？本文从配置结构说起，完整覆盖反向代理、负载均衡策略、Let&amp;rsquo;s Encrypt 证书、限流配置、日志分析和性能调优，附常见 502/SSL 故障排查。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/nginx-ops-complete/featured.jpg"/></item><item><title>Kubernetes 从零开始：工程师视角的入门指南</title><link>https://socake.github.io/posts/kubernetes-beginner-guide/</link><pubDate>Sun, 20 Oct 2024 09:17:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-beginner-guide/</guid><description>Docker Compose 能运行多个容器，为什么还需要 Kubernetes？本文从这个问题出发，用类比的方式讲清楚 Pod/Deployment/Service/Ingress 等核心概念，给出最常用的 kubectl 命令和完整的入门部署示例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-beginner-guide/featured.jpg"/></item><item><title>MySQL 深度调优：从 Buffer Pool 到锁等待的生产手册</title><link>https://socake.github.io/posts/mysql-performance-tuning-deep-dive/</link><pubDate>Fri, 18 Oct 2024 14:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/mysql-performance-tuning-deep-dive/</guid><description>你有没有过这种体验：按网上教程把 innodb_buffer_pool_size 调到 75%、关了 query cache、打开了 innodb_file_per_table，然后告诉自己&amp;quot;MySQL 调优就这样了&amp;quot;？真正的调优是一个持续观察、假设、验证、回滚的过程。这篇文章把我在过去几年维护的十几套 MySQL 实例上积累的调参经验整理出来，每一条都能追到具体指标和业务效果。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/mysql-performance-tuning-deep-dive/featured.jpg"/></item><item><title>Git 工作流实战：分支策略与团队协作规范</title><link>https://socake.github.io/posts/git-workflow-practice/</link><pubDate>Thu, 10 Oct 2024 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/git-workflow-practice/</guid><description>Git 用了五年，最大的感悟是：工作流问题本质上是团队协作问题，不是工具问题。本文对比 Git Flow / GitHub Flow / Trunk-Based 三种策略，覆盖分支命名、Commit Message、rebase 哲学、大型重构分支处理、冲突解决等高频话题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/git-workflow-practice/featured.jpg"/></item><item><title>TiDB 生产环境实战：从 Placement Rules 到 TiKV 调优的全链路经验</title><link>https://socake.github.io/posts/tidb-production-practice/</link><pubDate>Sat, 05 Oct 2024 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/tidb-production-practice/</guid><description>把 TiDB 当成&amp;quot;分布式 MySQL&amp;quot;跑起来并不难，真正难的是让 TiKV 在高并发写入下不抖动、让 PD 调度不误伤业务、让跨机房副本在 RPO=0 的前提下活下去。本文把过去两年我在几套 TiDB 集群上踩过的坑、调过的参数和定过的 SOP 都摊开来讲，不是教程，而是一份能直接照抄的作战手册。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/tidb-production-practice/featured.jpg"/></item><item><title>Shell 脚本实战：Bash 自动化运维从入门到工程化</title><link>https://socake.github.io/posts/shell-script-automation/</link><pubDate>Wed, 02 Oct 2024 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/shell-script-automation/</guid><description>Shell 脚本是 SRE 的第一生产力工具。本文从语法精要出发，覆盖批量操作、日志轮转、健康检查等常用运维模式，再到 getopts、trap 信号处理和脚本工程化思路，最后总结引号地狱、变量作用域等经典踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/shell-script-automation/featured.jpg"/></item><item><title>Docker Compose 本地开发工作流：多服务环境搭建最佳实践</title><link>https://socake.github.io/posts/docker-compose-dev-workflow/</link><pubDate>Fri, 27 Sep 2024 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/docker-compose-dev-workflow/</guid><description>用 Docker Compose 搭建包含数据库、缓存、消息队列的完整本地环境，配合 healthcheck 确保启动顺序、bind mount 实现热更新，还有 override 模式分离开发和生产配置。这篇文章覆盖所有关键细节和常见踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/docker-compose-dev-workflow/featured.jpg"/></item><item><title>Docker 最佳实践：从 Dockerfile 到生产部署</title><link>https://socake.github.io/posts/docker-best-practices/</link><pubDate>Sat, 21 Sep 2024 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/docker-best-practices/</guid><description>多阶段构建、.dockerignore 遗漏、非 root 运行、构建缓存优化，以及 entrypoint/cmd 信号处理这些在生产中实际踩过的问题，用具体的 Dockerfile 示例逐一拆解。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/docker-best-practices/featured.jpg"/></item><item><title>Linux 系统管理精要——DevOps 工程师必知的系统层知识</title><link>https://socake.github.io/posts/linux-system-admin-devops/</link><pubDate>Mon, 16 Sep 2024 13:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/linux-system-admin-devops/</guid><description>做了多年 DevOps，我越来越觉得 Linux 系统层的知识是一切排障的基础。当 Kubernetes Pod 莫名被杀、Java 服务突然无响应、磁盘 IO 飙高导致整机卡顿——最终都要落到系统层来定位。这篇文章把我在生产中最常用的系统管理技能系统梳理一遍。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/linux-system-admin-devops/featured.jpg"/></item><item><title>Linux 性能调优实战：CPU、内存、IO 瓶颈的系统排查方法</title><link>https://socake.github.io/posts/linux-performance-tuning/</link><pubDate>Sun, 08 Sep 2024 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/linux-performance-tuning/</guid><description>从工具链选择到实战排查，梳理 Linux 性能调优的完整方法论：CPU 上下文切换与软中断分析、OOM 日志解读、IO 调度器选择、TCP TIME_WAIT 处理，以及容器环境下 cgroup 限制的特殊影响。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/linux-performance-tuning/featured.jpg"/></item><item><title>关于我</title><link>https://socake.github.io/posts/authors/</link><pubDate>Sun, 08 Sep 2024 13:50:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/authors/</guid><description/><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/authors/featured.png"/></item></channel></rss>