<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>可观测性 on 黄文卓 | DevOps Engineer</title><link>https://socake.github.io/tags/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7/</link><description>Recent content in 可观测性 on 黄文卓 | DevOps Engineer</description><generator>Hugo -- gohugo.io</generator><language>zh-CN</language><managingEditor>17691281867@163.com (Wenzhuo Huang)</managingEditor><webMaster>17691281867@163.com (Wenzhuo Huang)</webMaster><copyright>© 2026 Wenzhuo Huang</copyright><lastBuildDate>Sun, 12 Apr 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://socake.github.io/tags/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7/index.xml" rel="self" type="application/rss+xml"/><item><title>bpftrace 实战：线上问题排查的瑞士军刀</title><link>https://socake.github.io/posts/bpftrace-performance-debug/</link><pubDate>Sun, 12 Apr 2026 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/bpftrace-performance-debug/</guid><description>strace 太重、perf 太原始、BCC 工具集要装一堆依赖——bpftrace 是这三者之间的平衡点。本文用四个真实场景讲清楚 bpftrace 的工作方式，帮你把它变成日常排查工具。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/bpftrace-performance-debug/featured.jpg"/></item><item><title>Langfuse：LLM 应用可观测性平台实战</title><link>https://socake.github.io/posts/langfuse-llm-observability/</link><pubDate>Sat, 14 Feb 2026 11:44:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/langfuse-llm-observability/</guid><description>讲清楚为什么LLM应用必须要可观测性，以及如何用Langfuse从链路追踪、Prompt版本管理、评估实验到成本分析做到全覆盖，包含Docker自托管部署和Python SDK完整集成示例。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/langfuse-llm-observability/featured.jpg"/></item><item><title>告警带图实战：Grafana Render + 钉钉推送趋势图</title><link>https://socake.github.io/posts/prometheus-alert-with-image/</link><pubDate>Tue, 23 Dec 2025 09:54:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-alert-with-image/</guid><description>收到告警只有一行数字，还要登录 Grafana 才能看趋势图——这是告警体验最大的痛点之一。本文介绍如何将 Grafana Image Renderer 与 Alertmanager Webhook 结合，实现告警消息自动附带趋势图的完整方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-alert-with-image/featured.jpg"/></item><item><title>Prometheus 进程监控：process-exporter 实战与告警配置</title><link>https://socake.github.io/posts/prometheus-process-monitoring/</link><pubDate>Thu, 18 Dec 2025 11:20:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-process-monitoring/</guid><description>K8s 有完善的 Pod 监控体系，但裸机和 VM 上运行的进程如何监控？本文介绍 process-exporter 的部署与配置实践，覆盖进程组匹配、核心指标、告警规则设计及实际踩坑经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-process-monitoring/featured.jpg"/></item><item><title>Prometheus + Grafana + Loki 可观测性体系建设</title><link>https://socake.github.io/docs/kubernetes/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7%E5%BB%BA%E8%AE%BE/</link><pubDate>Mon, 08 Dec 2025 15:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/docs/kubernetes/%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7%E5%BB%BA%E8%AE%BE/</guid><description>记录在多套 K8s 集群上建立统一可观测性平台的实践经验，包含 Prometheus 采集配置、告警规则设计、Grafana Dashboard 组织方式，以及跨集群日志聚合的 Loki 部署方案。</description></item><item><title>如何设计一个好的告警体系</title><link>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</link><pubDate>Tue, 18 Nov 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/</guid><description>从真实的告警噪音泛滥经历出发，分享如何用 SLI/SLO 重新设计告警体系，包括告警分级、规则设计原则、路由策略和复盘机制。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/%E5%91%8A%E8%AD%A6%E4%BD%93%E7%B3%BB%E8%AE%BE%E8%AE%A1/featured.jpg"/></item><item><title>Prometheus 高基数治理实战：从 8 亿 series 到可控增长</title><link>https://socake.github.io/posts/metric-cardinality-governance/</link><pubDate>Sun, 28 Sep 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/metric-cardinality-governance/</guid><description>高基数是 Prometheus 生态里最常见的性能杀手。这篇把「为什么发生、怎么发现、怎么治理」讲清楚，并给出一套可推广的组织治理方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/metric-cardinality-governance/featured.jpg"/></item><item><title>eBPF 可观测性实践：Cilium 网络监控与 Tetragon 安全审计</title><link>https://socake.github.io/posts/ebpf-observability/</link><pubDate>Wed, 17 Sep 2025 12:36:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ebpf-observability/</guid><description>eBPF 正在重塑云原生可观测性的底层基础。本文记录在 K8s 集群中落地 Cilium + Hubble 网络监控和 Tetragon 安全审计的实践经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ebpf-observability/featured.jpg"/></item><item><title>Kiali 服务网格可观测性实战：从拓扑图到告警联动</title><link>https://socake.github.io/posts/kiali-service-mesh-observability/</link><pubDate>Tue, 12 Aug 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kiali-service-mesh-observability/</guid><description>Kiali 不只是画拓扑图的工具，它是服务网格的诊断中心。本文把 Kiali 2.x 在生产中的配置、用法、踩坑都写清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kiali-service-mesh-observability/featured.jpg"/></item><item><title>SLO/SLI/Error Budget 从理论到落地：SRE 可靠性工程实战</title><link>https://socake.github.io/posts/slo-sli-error-budget-practice/</link><pubDate>Fri, 01 Aug 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/slo-sli-error-budget-practice/</guid><description>从 SLI 指标选取到 Error Budget 消耗速率告警，系统讲解 SRE 可靠性工程体系的落地实践，包括 Prometheus recording rules 计算 SLI、多窗口 burn rate 告警规则配置、SLO 违规复盘流程，以及与开发团队的协作策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/slo-sli-error-budget-practice/featured.jpg"/></item><item><title>Cilium Hubble 实战：用 eBPF 看透 Kubernetes 网络</title><link>https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/</link><pubDate>Wed, 30 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/</guid><description>Cilium Hubble 是 Kubernetes 下最接近交换机镜像端口的东西。本文讲清楚它的架构、关键配置和生产上如何读 flow 定位网络问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/ebpf-network-observability-cilium-hubble/featured.jpg"/></item><item><title>VictoriaMetrics：比 Prometheus 更省资源的监控存储方案</title><link>https://socake.github.io/posts/victoriametrics-prometheus/</link><pubDate>Mon, 28 Jul 2025 13:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/victoriametrics-prometheus/</guid><description>Prometheus 撑不住了？本文对比 VictoriaMetrics 与 Prometheus 的核心差异，介绍 remote_write 无缝迁移方案，以及 VM 在资源占用、压缩率、查询性能上的实际提升。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/victoriametrics-prometheus/featured.jpg"/></item><item><title>Thanos 实战：多 K8s 集群 Prometheus 统一监控与长期存储</title><link>https://socake.github.io/posts/thanos-multi-cluster/</link><pubDate>Sat, 26 Jul 2025 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/thanos-multi-cluster/</guid><description>记录我们将三套 EKS 集群的独立 Prometheus 迁移到 Thanos 统一监控体系的全过程，重点覆盖选型决策、生产配置和踩坑总结。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/thanos-multi-cluster/featured.jpg"/></item><item><title>OpenTelemetry 落地实践：统一采集 Traces、Metrics、Logs</title><link>https://socake.github.io/posts/opentelemetry-practice/</link><pubDate>Sun, 20 Jul 2025 11:41:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/opentelemetry-practice/</guid><description>从为什么选 OpenTelemetry 讲起，给出 DaemonSet + Gateway 的 Collector 部署架构、关键配置和实际踩坑记录。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/opentelemetry-practice/featured.jpg"/></item><item><title>Grafana Tempo 大规模分布式追踪实战：从 OTel 接入到 TraceQL 调优</title><link>https://socake.github.io/posts/grafana-tempo-distributed-tracing/</link><pubDate>Wed, 16 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-tempo-distributed-tracing/</guid><description>Tempo 是目前最便宜的分布式追踪后端。本文把架构、接入、TraceQL、tail sampling、成本优化、事故案例都串起来，供团队直接抄作业。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-tempo-distributed-tracing/featured.jpg"/></item><item><title>可观测性三支柱实战：Metrics/Logs/Traces 联动</title><link>https://socake.github.io/posts/observability-three-pillars/</link><pubDate>Mon, 14 Jul 2025 09:52:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/observability-three-pillars/</guid><description>监控告诉你系统挂了，可观测性告诉你为什么挂。本文从三支柱的核心差异出发，讲透 Prometheus+Loki+Tempo 的联动排障流程，覆盖 OpenTelemetry 采集标准、Exemplar 原理与配置，以及可观测性建设的优先级策略。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/observability-three-pillars/featured.jpg"/></item><item><title>分布式链路追踪实战：Jaeger 与 Tempo 选型对比</title><link>https://socake.github.io/posts/distributed-tracing-jaeger-tempo/</link><pubDate>Thu, 10 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/distributed-tracing-jaeger-tempo/</guid><description>系统梳理 Jaeger 与 Tempo 的架构差异与适用场景，结合 OpenTelemetry SDK 插桩、TraceQL 查询、采样策略和 Traces/Metrics/Logs 关联，给出可落地的生产实战方案。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/distributed-tracing-jaeger-tempo/featured.jpg"/></item><item><title>Pyroscope 持续性能剖析生产实战：给每一行代码一个性能画像</title><link>https://socake.github.io/posts/pyroscope-continuous-profiling/</link><pubDate>Wed, 02 Jul 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/pyroscope-continuous-profiling/</guid><description>为什么 metrics/logs/traces 之外还需要 profiling，它解决的是什么问题，Pyroscope 的架构是什么，怎样以 2%~5% overhead 把它铺到整个 K8s 集群。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/pyroscope-continuous-profiling/featured.jpg"/></item><item><title>Grafana Mimir 长期指标存储实战：从单集群 Prometheus 到 10 亿级 series</title><link>https://socake.github.io/posts/grafana-mimir-long-term-metrics/</link><pubDate>Wed, 18 Jun 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-mimir-long-term-metrics/</guid><description>从一套 Prometheus HA pair 起步，一路扩到跨三地多活 Mimir，把 series 数从千万推到十亿级。本文把架构、配置、监控、事故按顺序讲清楚。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-mimir-long-term-metrics/featured.jpg"/></item><item><title>Loki 架构深度解析：从写入路径到 PB 级日志查询优化</title><link>https://socake.github.io/posts/loki-architecture-deep-dive/</link><pubDate>Thu, 05 Jun 2025 10:00:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/loki-architecture-deep-dive/</guid><description>围绕 Loki 3.x 架构拆解写入、索引、查询三条链路，给出 schema_config、compactor、bloom、TSDB 的可直接复用配置，并复盘两次线上事故带来的调参经验。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/loki-architecture-deep-dive/featured.jpg"/></item><item><title>Kafka 运维实战：消息堆积排查、分区再平衡与监控体系</title><link>https://socake.github.io/posts/kafka-ops-practice/</link><pubDate>Mon, 07 Apr 2025 11:37:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/kafka-ops-practice/</guid><description>系统梳理 Kafka 运维核心技能：消费者延迟监控告警、消息堆积根因分析、分区扩容规划、Rebalance 风暴处理，以及 KEDA 基于 lag 自动扩缩的配置实践。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/kafka-ops-practice/featured.jpg"/></item><item><title>Grafana API 自动化：用代码管理 Dashboard、数据源和告警</title><link>https://socake.github.io/posts/grafana-api-automation/</link><pubDate>Tue, 18 Mar 2025 11:26:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/grafana-api-automation/</guid><description>手动点 UI 管理 Grafana Dashboard 在多环境场景下是噩梦。用 API 把 Dashboard 代码化，实现版本控制和环境同步，才是正确姿势。本文提供完整的 Python 工具脚本和实战踩坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/grafana-api-automation/featured.jpg"/></item><item><title>Prometheus 服务发现深度解析：kubernetes_sd_configs 实战</title><link>https://socake.github.io/posts/prometheus-service-discovery/</link><pubDate>Sat, 15 Mar 2025 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-service-discovery/</guid><description>在 K8s 环境里手动维护 Prometheus scrape targets 是不现实的，kubernetes_sd_configs 配合 relabel_configs 是解决这个问题的核心机制。本文从原理到实践，把这套体系讲透。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-service-discovery/featured.jpg"/></item><item><title>Kubernetes 日志采集方案选型：从技术对比到生产落地</title><link>https://socake.github.io/posts/k8s-logging-solution/</link><pubDate>Tue, 25 Feb 2025 11:01:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/k8s-logging-solution/</guid><description>记录我们团队从无到有建立 Kubernetes 日志采集系统的完整历程，最终选择 Fluent Bit + Fluentd + Elasticsearch 方案的技术依据，以及生产环境踩过的那些坑。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/k8s-logging-solution/featured.jpg"/></item><item><title>可观测性建设：从 Prometheus 采集到 Grafana 告警联动</title><link>https://socake.github.io/posts/prometheus-grafana/</link><pubDate>Fri, 06 Dec 2024 09:30:00 +0800</pubDate><author>17691281867@163.com (Wenzhuo Huang)</author><guid>https://socake.github.io/posts/prometheus-grafana/</guid><description>可观测性不是装几个监控工具，而是让系统在出问题时能快速定位根因。这篇文章从采集架构到 PromQL 到告警路由，覆盖我们在生产环境中实际遇到的 cardinality 爆炸、告警噪音等问题。</description><media:content xmlns:media="http://search.yahoo.com/mrss/" url="https://socake.github.io/posts/prometheus-grafana/featured.jpg"/></item></channel></rss>