<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SRE on 黄文卓 | DevOps Engineer</title><link>https://socake.github.io/tags/sre/</link><description>Recent content in SRE on 黄文卓 | DevOps Engineer</description><generator>Hugo -- gohugo.io</generator><language>zh-CN</language><managingEditor>17691281867@163.com (Wenzhuo Huang)</managingEditor><webMaster>17691281867@163.com (Wenzhuo Huang)</webMaster><copyright>© 2026 Wenzhuo Huang</copyright><lastBuildDate>Thu, 30 Apr 2026 17:00:00 +0800</lastBuildDate><atom:link href="https://socake.github.io/tags/sre/index.xml" rel="self" type="application/rss+xml"/><item><title>Playbook：多云告警体系合并实战 —— 从 200 条规则混战到统一治理</title><link>https://socake.github.io/playbook/multi-cloud-alerting-consolidation/</link><pubDate>Thu, 30 Apr 2026 17:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/playbook/multi-cloud-alerting-consolidation/</guid><description>做告警最常见的状态不是没告警，而是有两套甚至三套并行运行的告警系统，渠道交叉、规则重叠、silence 写得到处都是。本文给出从混乱状态收敛成统一治理的完整路径，包含可直接 1:1 复制部署的全量 yaml、脚本与配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/playbook/multi-cloud-alerting-consolidation/featured.jpg"/></item><item><title>USE Method：系统性能分析方法论</title><link>https://socake.github.io/posts/use-method-performance-analysis/</link><pubDate>Sun, 12 Apr 2026 11:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/use-method-performance-analysis/</guid><description>随机尝试是性能排查的大敌。USE Method 用一个三维框架（使用率/饱和度/错误）把所有系统资源纳入统一分析体系，本文从原理到实战全面解析这套方法论，并提供 K8s 环境下的 PromQL 映射和工具链速查表。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/use-method-performance-analysis/featured.jpg"/></item><item><title>基于 Error Budget 的 Prometheus 告警设计——燃烧率告警实战</title><link>https://socake.github.io/posts/prometheus-error-budget-alerting/</link><pubDate>Thu, 25 Dec 2025 10:40:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-error-budget-alerting/</guid><description>错误率告警有一个致命问题：它不告诉你问题有多紧急。1% 的错误率，持续 2 小时和持续 10 分钟，对 SLO 的威胁完全不同。燃烧率告警从 Error Budget 消耗速度出发，让每一次告警都携带&amp;quot;紧急程度&amp;quot;信息。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-error-budget-alerting/featured.jpg"/></item><item><title>高级运维/DevOps 工程师面试题精选：系统设计与深度考察</title><link>https://socake.github.io/posts/devops-senior-interview/</link><pubDate>Thu, 11 Dec 2025 12:51:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devops-senior-interview/</guid><description>高级运维面试考什么？本文整理 5 道系统设计题和 10 道深度技术题，每题给出答题框架。从监控体系设计到 K8s 调度器原理，从生产事故复盘到新技术引入决策，帮你建立完整的回答思路。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devops-senior-interview/featured.jpg"/></item><item><title>如何设计一个好的告警体系</title><link>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</link><pubDate>Tue, 18 Nov 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</guid><description>从真实的告警噪音泛滥经历出发，分享如何用 SLI/SLO 重新设计告警体系，包括告警分级、规则设计原则、路由策略和复盘机制。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/featured.jpg"/></item><item><title>On-Call 轮值管理实战：从告警疲劳到可持续值班</title><link>https://socake.github.io/posts/oncall-rotation-management/</link><pubDate>Wed, 24 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/oncall-rotation-management/</guid><description>On-call 不是福利也不是惩罚，是一份职责。把它做成可持续的工程实践，比任何高级监控工具都重要。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/oncall-rotation-management/featured.jpg"/></item><item><title>混沌工程实战：Chaos Mesh 在 K8s 中注入故障</title><link>https://socake.github.io/posts/chaos-mesh-practice/</link><pubDate>Sat, 13 Sep 2025 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/chaos-mesh-practice/</guid><description>混沌工程不是破坏系统，而是在可控环境中提前暴露脆弱点。本文记录了我用 Chaos Mesh 在生产级 K8s 集群中设计并执行混沌演练的完整过程，包括安装、实验配置、Workflow 编排和游戏日流程设计。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/chaos-mesh-practice/featured.jpg"/></item><item><title>故障响应与 Blameless 复盘：让每一次事故都变成组织资产</title><link>https://socake.github.io/posts/incident-response-postmortem/</link><pubDate>Wed, 10 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/incident-response-postmortem/</guid><description>事故响应不是英雄主义，是一套可重复的流程。把流程、模板、文化讲清楚，让每次事故都能沉淀成组织资产。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/incident-response-postmortem/featured.jpg"/></item><item><title>混沌工程 GameDay 实战指南：从第一次演练到常态化故障注入</title><link>https://socake.github.io/posts/chaos-engineering-gameday/</link><pubDate>Wed, 27 Aug 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/chaos-engineering-gameday/</guid><description>别把混沌工程理解成随便 kill pod。真正有价值的是一套假设驱动的演练方法论：演练前写下假设，演练中验证，复盘后改进系统和流程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/chaos-engineering-gameday/featured.jpg"/></item><item><title>SLO/SLI/Error Budget 从理论到落地：SRE 可靠性工程实战</title><link>https://socake.github.io/posts/slo-sli-error-budget-practice/</link><pubDate>Fri, 01 Aug 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/slo-sli-error-budget-practice/</guid><description>从 SLI 指标选取到 Error Budget 消耗速率告警，系统讲解 SRE 可靠性工程体系的落地实践，包括 Prometheus recording rules 计算 SLI、多窗口 burn rate 告警规则配置、SLO 违规复盘流程，以及与开发团队的协作策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/slo-sli-error-budget-practice/featured.jpg"/></item><item><title>DORA 指标与平台工程效能度量：用数据驱动 DevOps 改进</title><link>https://socake.github.io/posts/dora-metrics-platform-engineering/</link><pubDate>Sat, 12 Jul 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/dora-metrics-platform-engineering/</guid><description>DORA 四个指标不是考核工具，是诊断工具。从 CI/CD 流水线和 Incident 系统采集数据，找到部署频率低、前置时间长的真实原因，然后用平台工程手段系统性改进。本文给出采集方案、Grafana 看板设计和常见误用陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/dora-metrics-platform-engineering/featured.jpg"/></item><item><title>On-Call 工程实践：从告警响应到 Runbook 设计</title><link>https://socake.github.io/posts/on-call-engineering-practice/</link><pubDate>Tue, 08 Jul 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/on-call-engineering-practice/</guid><description>好的 On-Call 体系不是让人 24 小时盯着屏幕，而是让每一次叫醒都有价值。从告警质量到 Runbook 设计，从轮班制度到数据驱动改进，这篇文章是我们团队在生产环境打磨 3 年的实践总结。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/on-call-engineering-practice/featured.jpg"/></item><item><title>SRE 故障管理全生命周期：从响应到复盘</title><link>https://socake.github.io/posts/sre-incident-management/</link><pubDate>Sat, 05 Jul 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre-incident-management/</guid><description>故障处理不只是技术问题，更是协作和信息流问题。这篇文章完整梳理了从故障触发到 Post-Mortem 归档的每个环节，包括 IC 角色的意义、15 分钟定界框架，以及如何让 Post-Mortem 真正推动改进而不是走过场。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre-incident-management/featured.jpg"/></item><item><title>SRE 核心理念：从运维思维到可靠性工程</title><link>https://socake.github.io/posts/sre-concepts-and-principles/</link><pubDate>Thu, 26 Jun 2025 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre-concepts-and-principles/</guid><description>SRE 不是给运维换了个更好听的名字。它是一套用软件工程思维解决可靠性问题的方法论。本文从 Error Budget 切入，覆盖 SLI/SLO 制定、Toil 识别、On-call 设计、故障复盘文化，以及从传统运维转型 SRE 的实际路径。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre-concepts-and-principles/featured.jpg"/></item><item><title>多集群 Kubernetes 运维：跨集群管理与统一可观测</title><link>https://socake.github.io/posts/multi-cluster-k8s-management/</link><pubDate>Wed, 21 May 2025 13:03:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/multi-cluster-k8s-management/</guid><description>从单集群到多集群，运维复杂度不是线性增加，而是指数级。这篇文章总结了我们管理跨地域、跨环境多套 K8s 集群的实际经验：如何用 ArgoCD ApplicationSet 统一部署、如何用 Thanos 聚合多集群指标、以及一次真实的跨集群迁移过程。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/multi-cluster-k8s-management/featured.jpg"/></item><item><title>Kubernetes 集群升级策略：零停机升级的完整实践指南</title><link>https://socake.github.io/posts/kubernetes-upgrade-strategy/</link><pubDate>Wed, 14 May 2025 09:56:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kubernetes-upgrade-strategy/</guid><description>K8s 集群升级听起来简单，实际操作中坑很多：API 弃用导致的 Helm 失败、Admission Webhook 拦截升级流量、PDB 配置不当导致服务中断。这篇文章从真实的升级经验出发，给出一套可复用的零停机升级方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kubernetes-upgrade-strategy/featured.jpg"/></item><item><title>Alertmanager 完全指南：路由、抑制、静默与多渠道通知</title><link>https://socake.github.io/posts/alertmanager-routing-config/</link><pubDate>Sat, 22 Mar 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/alertmanager-routing-config/</guid><description>告警太多和告警太少一样有害。Alertmanager 的路由、抑制、分组机制是控制告警噪声的核心手段，本文从一个真实的多环境告警体系出发，讲清楚每个配置的意图和陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/alertmanager-routing-config/featured.jpg"/></item><item><title>运维工程师的技术成长：从执行者到架构者的路径规划</title><link>https://socake.github.io/posts/devops-career-growth/</link><pubDate>Sun, 22 Dec 2024 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/devops-career-growth/</guid><description>运维工程师的成长不是工具的堆砌，而是认知层次的跃迁。这篇文章记录了我对这条路的观察和思考——哪些时机会让人真正进阶，哪些惯性思维会让人原地踏步。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/devops-career-growth/featured.jpg"/></item><item><title>故障排查方法论：从现象到根因</title><link>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/</link><pubDate>Tue, 17 Dec 2024 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/</guid><description>好的排查不靠直觉，靠方法。这篇文章总结了我在多次生产故障中提炼出的排查框架：从时间线构建到假设优先级，再到认知陷阱的识别与规避。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E6%95%85%E9%9A%9C%E6%8E%92%E6%9F%A5%E6%96%B9%E6%B3%95%E8%AE%BA/featured.jpg"/></item><item><title>SRE 实践心得：从运维到 SRE 的思维转变</title><link>https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/</link><pubDate>Wed, 11 Dec 2024 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/</guid><description>SRE 不是换了个头衔的运维，而是一套用软件工程思维解决可靠性问题的方法论。这篇文章记录了我在实践过程中最有感触的几个转变。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/sre%E5%AE%9E%E8%B7%B5%E5%BF%83%E5%BE%97/featured.jpg"/></item><item><title>可观测性建设：从 Prometheus 采集到 Grafana 告警联动</title><link>https://socake.github.io/posts/prometheus-grafana/</link><pubDate>Fri, 06 Dec 2024 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-grafana/</guid><description>可观测性不是装几个监控工具，而是让系统在出问题时能快速定位根因。这篇文章从采集架构到 PromQL 到告警路由，覆盖我们在生产环境中实际遇到的 cardinality 爆炸、告警噪音等问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-grafana/featured.jpg"/></item></channel></rss>