<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>告警 on 黄文卓 | DevOps Engineer</title><link>https://socake.github.io/tags/%E5%91%8A%E8%AD%A6/</link><description>Recent content in 告警 on 黄文卓 | DevOps Engineer</description><generator>Hugo -- gohugo.io</generator><language>zh-CN</language><managingEditor>17691281867@163.com (Wenzhuo Huang)</managingEditor><webMaster>17691281867@163.com (Wenzhuo Huang)</webMaster><copyright>© 2026 Wenzhuo Huang</copyright><lastBuildDate>Thu, 30 Apr 2026 17:00:00 +0800</lastBuildDate><atom:link href="https://socake.github.io/tags/%E5%91%8A%E8%AD%A6/index.xml" rel="self" type="application/rss+xml"/><item><title>Playbook：多云告警体系合并实战 —— 从 200 条规则混战到统一治理</title><link>https://socake.github.io/playbook/multi-cloud-alerting-consolidation/</link><pubDate>Thu, 30 Apr 2026 17:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/playbook/multi-cloud-alerting-consolidation/</guid><description>做告警最常见的状态不是没告警，而是有两套甚至三套并行运行的告警系统，渠道交叉、规则重叠、silence 写得到处都是。本文给出从混乱状态收敛成统一治理的完整路径，包含可直接 1:1 复制部署的全量 yaml、脚本与配置。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/playbook/multi-cloud-alerting-consolidation/featured.jpg"/></item><item><title>基于 Error Budget 的 Prometheus 告警设计——燃烧率告警实战</title><link>https://socake.github.io/posts/prometheus-error-budget-alerting/</link><pubDate>Thu, 25 Dec 2025 10:40:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-error-budget-alerting/</guid><description>错误率告警有一个致命问题：它不告诉你问题有多紧急。1% 的错误率，持续 2 小时和持续 10 分钟，对 SLO 的威胁完全不同。燃烧率告警从 Error Budget 消耗速度出发，让每一次告警都携带&amp;quot;紧急程度&amp;quot;信息。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-error-budget-alerting/featured.jpg"/></item><item><title>告警带图实战：Grafana Render + 钉钉推送趋势图</title><link>https://socake.github.io/posts/prometheus-alert-with-image/</link><pubDate>Tue, 23 Dec 2025 09:54:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-alert-with-image/</guid><description>收到告警只有一行数字，还要登录 Grafana 才能看趋势图——这是告警体验最大的痛点之一。本文介绍如何将 Grafana Image Renderer 与 Alertmanager Webhook 结合，实现告警消息自动附带趋势图的完整方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-alert-with-image/featured.jpg"/></item><item><title>如何设计一个好的告警体系</title><link>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</link><pubDate>Tue, 18 Nov 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</guid><description>从真实的告警噪音泛滥经历出发，分享如何用 SLI/SLO 重新设计告警体系，包括告警分级、规则设计原则、路由策略和复盘机制。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/featured.jpg"/></item><item><title>On-Call 轮值管理实战：从告警疲劳到可持续值班</title><link>https://socake.github.io/posts/oncall-rotation-management/</link><pubDate>Wed, 24 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/oncall-rotation-management/</guid><description>On-call 不是福利也不是惩罚，是一份职责。把它做成可持续的工程实践，比任何高级监控工具都重要。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/oncall-rotation-management/featured.jpg"/></item><item><title>On-Call 工程实践：从告警响应到 Runbook 设计</title><link>https://socake.github.io/posts/on-call-engineering-practice/</link><pubDate>Tue, 08 Jul 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/on-call-engineering-practice/</guid><description>好的 On-Call 体系不是让人 24 小时盯着屏幕，而是让每一次叫醒都有价值。从告警质量到 Runbook 设计，从轮班制度到数据驱动改进，这篇文章是我们团队在生产环境打磨 3 年的实践总结。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/on-call-engineering-practice/featured.jpg"/></item><item><title>Alertmanager Webhook 开发：自定义告警处理与 API 集成</title><link>https://socake.github.io/posts/alertmanager-webhook-api/</link><pubDate>Tue, 25 Mar 2025 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/alertmanager-webhook-api/</guid><description>Alertmanager 内置的通知渠道不支持钉钉、飞书等国内工具，Webhook 是扩展告警通知的标准方式。本文用 Python Flask 实现完整的 Webhook 接收器，涵盖消息格式化、降噪去重、Alertmanager API 集成和 K8s 部署。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/alertmanager-webhook-api/featured.jpg"/></item><item><title>Alertmanager 完全指南：路由、抑制、静默与多渠道通知</title><link>https://socake.github.io/posts/alertmanager-routing-config/</link><pubDate>Sat, 22 Mar 2025 12:27:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/alertmanager-routing-config/</guid><description>告警太多和告警太少一样有害。Alertmanager 的路由、抑制、分组机制是控制告警噪声的核心手段，本文从一个真实的多环境告警体系出发，讲清楚每个配置的意图和陷阱。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/alertmanager-routing-config/featured.jpg"/></item></channel></rss>