Prometheus on 黄文卓 | DevOps Engineer

Prometheus on 黄文卓 | DevOps Engineerhttps://socake.github.io/tags/prometheus/Recent content in Prometheus on 黄文卓 | DevOps EngineerHugo -- gohugo.iozh-CN17691281867@163.com (Wenzhuo Huang)17691281867@163.com (Wenzhuo Huang)© 2026 Wenzhuo HuangThu, 30 Apr 2026 17:00:00 +0800Playbook：多云告警体系合并实战 —— 从 200 条规则混战到统一治理https://socake.github.io/playbook/multi-cloud-alerting-consolidation/Thu, 30 Apr 2026 17:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/playbook/multi-cloud-alerting-consolidation/做告警最常见的状态不是没告警，而是有两套甚至三套并行运行的告警系统，渠道交叉、规则重叠、silence 写得到处都是。本文给出从混乱状态收敛成统一治理的完整路径，包含可直接 1:1 复制部署的全量 yaml、脚本与配置。OpenCost 实战：Kubernetes 成本可见性与多团队费用分摊https://socake.github.io/posts/opencost-kubernetes-cost-visibility/Sun, 12 Apr 2026 14:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/opencost-kubernetes-cost-visibility/Kubernetes 成本不透明是 FinOps 落地的最大障碍。本文通过 OpenCost 构建完整的成本可见性体系，涵盖部署集成、云厂商价格接入、按团队分摊、Grafana 看板、超预算告警和自动周报推送，提供可直接复用的配置。USE Method：系统性能分析方法论https://socake.github.io/posts/use-method-performance-analysis/Sun, 12 Apr 2026 11:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/use-method-performance-analysis/随机尝试是性能排查的大敌。USE Method 用一个三维框架（使用率/饱和度/错误）把所有系统资源纳入统一分析体系，本文从原理到实战全面解析这套方法论，并提供 K8s 环境下的 PromQL 映射和工具链速查表。基于 Error Budget 的 Prometheus 告警设计——燃烧率告警实战https://socake.github.io/posts/prometheus-error-budget-alerting/Thu, 25 Dec 2025 10:40:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/prometheus-error-budget-alerting/错误率告警有一个致命问题：它不告诉你问题有多紧急。1% 的错误率，持续 2 小时和持续 10 分钟，对 SLO 的威胁完全不同。燃烧率告警从 Error Budget 消耗速度出发，让每一次告警都携带"紧急程度"信息。告警带图实战：Grafana Render + 钉钉推送趋势图https://socake.github.io/posts/prometheus-alert-with-image/Tue, 23 Dec 2025 09:54:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/prometheus-alert-with-image/收到告警只有一行数字，还要登录 Grafana 才能看趋势图——这是告警体验最大的痛点之一。本文介绍如何将 Grafana Image Renderer 与 Alertmanager Webhook 结合，实现告警消息自动附带趋势图的完整方案。Prometheus 进程监控：process-exporter 实战与告警配置https://socake.github.io/posts/prometheus-process-monitoring/Thu, 18 Dec 2025 11:20:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/prometheus-process-monitoring/K8s 有完善的 Pod 监控体系，但裸机和 VM 上运行的进程如何监控？本文介绍 process-exporter 的部署与配置实践，覆盖进程组匹配、核心指标、告警规则设计及实际踩坑经验。Prometheus + Grafana + Loki 可观测性体系建设https://socake.github.io/docs/kubernetes/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7%E5%BB%BA%E8%AE%BE/Mon, 08 Dec 2025 15:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/docs/kubernetes/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7%E5%BB%BA%E8%AE%BE/记录在多套 K8s 集群上建立统一可观测性平台的实践经验，包含 Prometheus 采集配置、告警规则设计、Grafana Dashboard 组织方式，以及跨集群日志聚合的 Loki 部署方案。k6 压测实战：从脚本编写到性能分析https://socake.github.io/posts/k6-load-testing-practice/Tue, 21 Oct 2025 12:44:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/k6-load-testing-practice/压测不是跑一个脚本看能不能撑住，而是通过有设计的负载模型暴露系统瓶颈。本文记录了我用 k6 做生产级性能测试的完整实践：脚本设计、阈值配置、与 Grafana 集成，以及几个典型性能问题的定位过程。ELK 集群监控：用 Prometheus + Grafana 监控 Elasticsearch 健康https://socake.github.io/posts/elk-prometheus-monitoring/Wed, 08 Oct 2025 11:33:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/elk-prometheus-monitoring/Kibana 内置的 Stack Monitoring 免费功能有限，告警媒介也受商业授权约束。我们最终选择 Prometheus + Grafana 方案监控 ELK 集群，这篇文章记录完整的落地过程和踩坑。Prometheus 高基数治理实战：从 8 亿 series 到可控增长https://socake.github.io/posts/metric-cardinality-governance/Sun, 28 Sep 2025 10:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/metric-cardinality-governance/高基数是 Prometheus 生态里最常见的性能杀手。这篇把「为什么发生、怎么发现、怎么治理」讲清楚，并给出一套可推广的组织治理方案。SLO/SLI/Error Budget 从理论到落地：SRE 可靠性工程实战https://socake.github.io/posts/slo-sli-error-budget-practice/Fri, 01 Aug 2025 13:37:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/slo-sli-error-budget-practice/从 SLI 指标选取到 Error Budget 消耗速率告警，系统讲解 SRE 可靠性工程体系的落地实践，包括 Prometheus recording rules 计算 SLI、多窗口 burn rate 告警规则配置、SLO 违规复盘流程，以及与开发团队的协作策略。VictoriaMetrics：比 Prometheus 更省资源的监控存储方案https://socake.github.io/posts/victoriametrics-prometheus/Mon, 28 Jul 2025 13:37:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/victoriametrics-prometheus/Prometheus 撑不住了？本文对比 VictoriaMetrics 与 Prometheus 的核心差异，介绍 remote_write 无缝迁移方案，以及 VM 在资源占用、压缩率、查询性能上的实际提升。Thanos 实战：多 K8s 集群 Prometheus 统一监控与长期存储https://socake.github.io/posts/thanos-multi-cluster/Sat, 26 Jul 2025 11:37:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/thanos-multi-cluster/记录我们将三套 EKS 集群的独立 Prometheus 迁移到 Thanos 统一监控体系的全过程，重点覆盖选型决策、生产配置和踩坑总结。可观测性三支柱实战：Metrics/Logs/Traces 联动https://socake.github.io/posts/observability-three-pillars/Mon, 14 Jul 2025 09:52:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/observability-three-pillars/监控告诉你系统挂了，可观测性告诉你为什么挂。本文从三支柱的核心差异出发，讲透 Prometheus+Loki+Tempo 的联动排障流程，覆盖 OpenTelemetry 采集标准、Exemplar 原理与配置，以及可观测性建设的优先级策略。Grafana Mimir 长期指标存储实战：从单集群 Prometheus 到 10 亿级 serieshttps://socake.github.io/posts/grafana-mimir-long-term-metrics/Wed, 18 Jun 2025 10:00:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/grafana-mimir-long-term-metrics/从一套 Prometheus HA pair 起步，一路扩到跨三地多活 Mimir，把 series 数从千万推到十亿级。本文把架构、配置、监控、事故按顺序讲清楚。Alertmanager Webhook 开发：自定义告警处理与 API 集成https://socake.github.io/posts/alertmanager-webhook-api/Tue, 25 Mar 2025 09:52:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/alertmanager-webhook-api/Alertmanager 内置的通知渠道不支持钉钉、飞书等国内工具，Webhook 是扩展告警通知的标准方式。本文用 Python Flask 实现完整的 Webhook 接收器，涵盖消息格式化、降噪去重、Alertmanager API 集成和 K8s 部署。Alertmanager 完全指南：路由、抑制、静默与多渠道通知https://socake.github.io/posts/alertmanager-routing-config/Sat, 22 Mar 2025 12:27:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/alertmanager-routing-config/告警太多和告警太少一样有害。Alertmanager 的路由、抑制、分组机制是控制告警噪声的核心手段，本文从一个真实的多环境告警体系出发，讲清楚每个配置的意图和陷阱。Prometheus 服务发现深度解析：kubernetes_sd_configs 实战https://socake.github.io/posts/prometheus-service-discovery/Sat, 15 Mar 2025 09:30:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/prometheus-service-discovery/在 K8s 环境里手动维护 Prometheus scrape targets 是不现实的，kubernetes_sd_configs 配合 relabel_configs 是解决这个问题的核心机制。本文从原理到实践，把这套体系讲透。可观测性建设：从 Prometheus 采集到 Grafana 告警联动https://socake.github.io/posts/prometheus-grafana/Fri, 06 Dec 2024 09:30:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/prometheus-grafana/可观测性不是装几个监控工具，而是让系统在出问题时能快速定位根因。这篇文章从采集架构到 PromQL 到告警路由，覆盖我们在生产环境中实际遇到的 cardinality 爆炸、告警噪音等问题。Python 对接 Prometheus：查询监控数据与告警状态自动化https://socake.github.io/posts/python-prometheus-monitoring/Mon, 25 Nov 2024 11:44:00 +080017691281867@163.com (Wenzhuo Huang)https://socake.github.io/posts/python-prometheus-monitoring/用 Python 直接调 Prometheus HTTP API，实现服务存活巡检、可用率日报生成，最后接入钉钉每日自动推送集群健康摘要。