跳过正文
OpenCost 实战:Kubernetes 成本可见性与多团队费用分摊

OpenCost 实战:Kubernetes 成本可见性与多团队费用分摊

·1780 字·9 分钟·
目录

在 Kubernetes 中,一个 kubectl apply 就能消耗几百美元的云资源,但账单却只显示一个 EC2 或 ECS 集群的总费用。谁在用、用了多少、用在哪里——这三个问题在多团队共享集群的场景下几乎无法从云账单直接回答。

这就是 Kubernetes 成本不透明的根因,也是 OpenCost 要解决的问题。


成本不透明的根因分析
#

共享节点的代价
#

传统 VM 时代,一个 VM 对应一个账单条目,归属清晰。Kubernetes 上,多个 Namespace 的 Pod 共享同一批节点,节点成本无法直接归因到某个团队或服务。

典型问题场景:

  • 资源超申请(Over-provisioning):team-a 的服务 requests 了 8 核,实际用了 2 核,节点上 30% 的算力被空占
  • 共享基础组件:Ingress Controller、Prometheus、日志采集器的成本属于"公共基础设施",应该按比例分摊给各团队,但传统方案做不到
  • Spot 实例混用:on-demand 和 spot 节点混用,不同实例类型的单价差异巨大,简单平均会严重失真

为什么需要专门的工具
#

云厂商账单(AWS Cost Explorer、阿里云费用中心)只能到实例/资源组级别,无法下钻到 Pod、Namespace、Label 维度。自己写脚本计算成本模型需要:采集资源使用量、查询实例价格 API、处理 spot 价格波动、处理 PVC 存储计费——工程量不小且难以准确。


OpenCost vs Kubecost:选型边界
#

能力OpenCost(开源)Kubecost 付费版
实时成本查询(Namespace/Pod/Label)
多云价格接入
Grafana Dashboard
多集群统一视图
成本预算与告警 UI❌(需自建)
自定义分摊规则 UI❌(需自建)
SAML/SSO
网络成本细分有限
支持社区商业

结论:单集群、技术团队自运维、愿意写 Prometheus 规则——OpenCost 完全够用。多集群统一管理、需要非技术管理者查看成本 UI、有合规报表需求——考虑 Kubecost。


部署 OpenCost
#

前置依赖
#

OpenCost 需要 Prometheus 作为存储后端。如果已有 kube-prometheus-stack,可以直接复用;没有的话一并安装。

# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

安装 kube-prometheus-stack(如未安装)
#

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

安装 OpenCost
#

创建 values.yaml

# opencost-values.yaml
opencost:
  exporter:
    # 云厂商:aws, azure, gcp, alibabacloud
    cloudProviderApiKey: ""   # 不需要,用 IRSA 或 IAM Role
    defaultClusterId: "prod-us-west-2"

  prometheus:
    internal:
      enabled: false          # 使用外部 Prometheus
    external:
      enabled: true
      url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

  ui:
    enabled: true
    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
      - host: opencost.internal.example.com
        paths:
        - /

  # AWS 成本配置(通过 IRSA)
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/opencost-cost-exporter"

# 资源配置
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 1Gi
helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  -f opencost-values.yaml

验证部署:

kubectl get pods -n opencost
# NAME                        READY   STATUS    RESTARTS   AGE
# opencost-7d8c9b5f6-xxxxx    2/2     Running   0          2m

# 访问 API 验证
kubectl port-forward -n opencost svc/opencost 9003:9003 &
curl http://localhost:9003/allocation/compute?window=1d | jq '.data[0] | keys'

AWS 云厂商价格接入
#

OpenCost 通过 AWS Cost and Usage Report(CUR)或 Price List API 获取实例价格,确保成本计算准确反映实际账单。

IAM 权限配置(IRSA)
#

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeSpotPriceHistory",
        "pricing:GetProducts"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-cur-bucket",
        "arn:aws:s3:::my-cur-bucket/*"
      ]
    }
  ]
}

云厂商配置文件
#

在 Kubernetes Secret 中存储价格配置:

apiVersion: v1
kind: Secret
metadata:
  name: opencost-cloud-config
  namespace: opencost
stringData:
  cloud-integration.json: |
    {
      "aws": {
        "athenaBucketName": "s3://my-cur-bucket/opencost-athena-results",
        "athenaRegion": "us-east-1",
        "athenaDatabase": "athenacurcfn",
        "athenaTable": "cur_report",
        "projectID": "123456789012",
        "serviceKeyName": "",
        "serviceKeySecret": "",
        "spotDataRegion": "us-west-2",
        "spotDataBucket": "s3://my-spot-data-bucket",
        "spotDataPrefix": "spot-data-feed",
        "awsAccountId": "123456789012"
      }
    }

Spot 实例价格同步
#

Spot 价格实时变动,OpenCost 通过 Spot Data Feed 获取历史价格:

# 在 AWS 控制台启用 Spot Instance Data Feed,指向 S3 bucket
# OpenCost 会自动读取并计算加权平均价格

# 验证 Spot 价格是否正确加载
curl http://localhost:9003/spotFeed | jq '.[] | select(.node_type == "m5.xlarge")'

核心成本模型解析
#

OpenCost 的计费模型基于**资源请求量(Requests)**而非实际使用量,这个设计选择非常重要:

占用了资源就应该付费,不管有没有用到——这更公平地反映了 Pod 对集群容量的占用。

各资源计费方式
#

CPU

每 vCPU 小时价格 = 节点实例价格 / 节点 vCPU 数
Pod CPU 成本 = CPU Requests (cores) × 时长(小时) × 每 vCPU 小时价格

内存

每 GiB 小时价格 = 节点实例价格 × 内存权重比例 / 节点内存(GiB)
# 默认内存权重:约占实例价格的 40%(CPU:Memory ≈ 6:4)
Pod 内存成本 = Memory Requests (GiB) × 时长(小时) × 每 GiB 小时价格

存储(PVC)

PVC 成本 = 存储大小(GiB) × 时长(小时) × StorageClass 单价
# AWS gp3: $0.08/GiB/月 → $0.000111/GiB/小时

网络出流量

网络成本 = 出流量(GB) × 出流量单价
# AWS us-east-1 → Internet: $0.09/GB

查看成本分解
#

# 查看过去 24 小时各 Namespace 成本
curl "http://localhost:9003/allocation/compute?window=24h&aggregate=namespace" | \
  jq '.data[] | to_entries[] | {namespace: .key, cost: .value.totalCost}'

# 输出示例
# {"namespace": "team-a", "cost": 12.34}
# {"namespace": "team-b", "cost": 8.76}
# {"namespace": "monitoring", "cost": 25.10}

OpenCost API 使用详解
#

OpenCost 提供了强大的 HTTP API,可以按任意维度聚合查询成本。

按 Label 查询(实现跨 Namespace 的服务成本)
#

# 查询 app=payment-service 的过去 7 天成本,按 deployment 分组
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=label:app&filter=label%5Bapp%5D%3A%22payment-service%22" | \
  jq '.data[] | .["payment-service"] | {totalCost, cpuCost, ramCost, pvCost}'

按 Deployment 查询
#

# 查询 production namespace 下所有 deployment 成本,按日汇总
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=deployment&namespace=production&accumulate=false" | \
  jq '.data[] | to_entries[] | {deployment: .key, daily_cost: .value.totalCost}'

成本趋势查询(用于 Grafana)
#

# 过去 30 天,按天汇总,按团队 label 分组
curl "http://localhost:9003/allocation/compute?window=30d&step=1d&aggregate=label:team&accumulate=false" | \
  jq '[.data[] | to_entries[] | {date: .key, team: .value.name, cost: .value.totalCost}]'

自定义分摊规则:共享资源成本处理
#

监控栈(Prometheus、Grafana)、Ingress Controller、日志采集等基础组件的成本属于全局共享,需要合理分摊给各业务团队。

分摊策略设计
#

共享组件成本分摊 = 各团队按"CPU Request 占比"分摊

team_share = team_cpu_requests / total_business_cpu_requests × shared_cost

在 OpenCost 的 API 层面,通过 shareSplit 参数实现:

# 将 monitoring namespace 的成本按比例分摊到业务 namespace
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=namespace&shareSplit=weighted&shareNamespaces=monitoring,logging,ingress-nginx"

Prometheus Recording Rule 实现自定义分摊
#

对于更复杂的分摊逻辑,通过 Prometheus Recording Rule 预计算:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: opencost-custom-allocation
  namespace: monitoring
spec:
  groups:
  - name: cost-allocation
    interval: 1h
    rules:
    # 计算每个 namespace 的 CPU request 占比
    - record: namespace:cpu_request_ratio:ratio
      expr: |
        sum(
          kube_pod_container_resource_requests{resource="cpu", namespace!~"monitoring|logging|kube-system"}
        ) by (namespace)
        /
        sum(
          kube_pod_container_resource_requests{resource="cpu", namespace!~"monitoring|logging|kube-system"}
        )

    # 计算共享成本的分摊额(单位:美元/小时)
    - record: namespace:shared_cost_allocation:usd_per_hour
      expr: |
        namespace:cpu_request_ratio:ratio
        *
        # 共享组件总成本(需要从 OpenCost metrics 获取)
        sum(
          opencost_allocation_total_cost{namespace=~"monitoring|logging|kube-system"}
        )

Grafana Dashboard 搭建
#

导入 OpenCost 官方 Dashboard
#

# OpenCost 官方 Dashboard ID: 20568 (Grafana.com)
# 或通过 configmap 部署
kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/main/grafana/dashboards/opencost.json

自定义核心看板
#

看板一:实时成本趋势(按团队)

{
  "title": "Team Cost Trend - 30 Days",
  "type": "timeseries",
  "targets": [
    {
      "expr": "sum by (namespace) (rate(opencost_allocation_total_cost[1h]) * 3600)",
      "legendFormat": "{{namespace}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "currencyUSD",
      "custom": {
        "fillOpacity": 20
      }
    }
  }
}

看板二:Top 10 高消费服务

Grafana 面板配置(使用 OpenCost API 作为数据源):

# 在 Grafana 中添加 OpenCost 作为 JSON API 数据源
# URL: http://opencost.opencost.svc:9003
# 查询:/allocation/compute?window=7d&aggregate=deployment&accumulate=true

# 然后使用 Table 面板展示,按 totalCost 降序排列

通过 Prometheus 指标实现 Top 10 Panel:

# Top 10 高消费 Deployment(过去 24 小时)
topk(10,
  sum by (deployment, namespace) (
    increase(opencost_allocation_total_cost[24h])
  )
)

看板三:按团队汇总(月度)

# 本月累计成本(按 team label 汇总)
sum by (label_team) (
  increase(opencost_allocation_total_cost[${__range}])
)
* on(label_team) group_left
  label_replace(vector(1), "label_team", "$1", "", "(.*)")

Grafana Dashboard YAML(GitOps 管理)
#

apiVersion: v1
kind: ConfigMap
metadata:
  name: opencost-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"    # grafana-sidecar 自动加载
data:
  opencost-team-cost.json: |
    {
      "title": "Kubernetes Cost by Team",
      "uid": "k8s-cost-by-team",
      "tags": ["cost", "finops"],
      "timezone": "Asia/Shanghai",
      "panels": [
        {
          "id": 1,
          "title": "Monthly Cost by Team",
          "type": "barchart",
          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
          "targets": [{
            "expr": "sum by (label_team) (increase(opencost_allocation_total_cost[30d]))",
            "legendFormat": "{{label_team}}"
          }]
        },
        {
          "id": 2,
          "title": "Daily Cost Trend",
          "type": "timeseries",
          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
          "targets": [{
            "expr": "sum by (label_team) (rate(opencost_allocation_total_cost[1h]) * 24)",
            "legendFormat": "{{label_team}}"
          }]
        }
      ]
    }

成本异常告警:AlertManager 规则
#

设计原则
#

告警不应该基于绝对值(成本超过 X 美元),而应该基于环比异常(今天比昨天同期高 Y%),否则月初和月末的成本自然不同会产生大量误报。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-cost
    rules:
    # 告警1:某 namespace 日成本环比暴涨(超过前 7 天均值的 50%)
    - alert: NamespaceCostSpike
      expr: |
        (
          sum by (namespace) (increase(opencost_allocation_total_cost[24h]))
          /
          sum by (namespace) (
            increase(opencost_allocation_total_cost[24h] offset 1d) +
            increase(opencost_allocation_total_cost[24h] offset 2d) +
            increase(opencost_allocation_total_cost[24h] offset 3d) +
            increase(opencost_allocation_total_cost[24h] offset 4d) +
            increase(opencost_allocation_total_cost[24h] offset 5d) +
            increase(opencost_allocation_total_cost[24h] offset 6d) +
            increase(opencost_allocation_total_cost[24h] offset 7d)
          ) * 7
        ) > 1.5
      for: 1h
      labels:
        severity: warning
        team: "{{ $labels.namespace }}"
      annotations:
        summary: "Namespace {{ $labels.namespace }} 成本异常"
        description: "{{ $labels.namespace }} 过去 24 小时成本是过去 7 天均值的 {{ $value | humanize }}x,请检查是否有资源泄漏。"

    # 告警2:月度预算超支预警(当月累计成本超过预算的 80%)
    - alert: MonthlyBudgetWarning
      expr: |
        # 当月累计成本(从月初开始)
        sum by (namespace) (
          increase(opencost_allocation_total_cost[${days_in_current_month}d])
        )
        >
        # 预算阈值(通过 ConfigMap 或 label 配置,这里用硬编码示例)
        on(namespace) group_left
        kube_namespace_labels{label_monthly_budget!=""} * 0
        + 800   # team-a 月预算 $1000,80% = $800
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "{{ $labels.namespace }} 月度预算即将超支"
        description: "当月已消费 ${{ $value | printf \"%.2f\" }},已超过月度预算的 80%。"

    # 告警3:单个 Pod 资源极度浪费(requests >> usage)
    - alert: PodResourceWaste
      expr: |
        (
          sum by (pod, namespace) (
            kube_pod_container_resource_requests{resource="cpu"}
          )
          -
          sum by (pod, namespace) (
            rate(container_cpu_usage_seconds_total[1h])
          )
        )
        /
        sum by (pod, namespace) (
          kube_pod_container_resource_requests{resource="cpu"}
        )
        > 0.8   # CPU 浪费超过 80%
      for: 2h
      labels:
        severity: info
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 资源严重浪费"
        description: "CPU 请求量中 {{ $value | humanizePercentage }} 未被使用,建议降低 resource requests。"

AlertManager 路由配置
#

# alertmanager-config
route:
  receiver: default
  routes:
  - matchers:
    - alertname =~ "NamespaceCostSpike|MonthlyBudgetWarning"
    receiver: cost-alert-dingtalk
    group_wait: 10m
    group_interval: 4h    # 同类告警 4 小时聚合一次,避免刷屏
    repeat_interval: 24h

receivers:
- name: cost-alert-dingtalk
  webhook_configs:
  - url: "http://dingtalk-webhook.monitoring.svc:8060/dingtalk/cost-alert/send"
    send_resolved: true

与钉钉集成:每周自动成本报告
#

方案架构
#

CronJob (每周一 9:00)
  → 调用 OpenCost API 获取上周成本数据
  → 计算环比变化、Top 10 服务
  → 格式化为 Markdown
  → 推送钉钉群机器人

成本报告脚本
#

#!/usr/bin/env python3
# weekly_cost_report.py

import requests
import json
from datetime import datetime, timedelta
from typing import Dict, List

OPENCOST_API = "http://opencost.opencost.svc:9003"
DINGTALK_WEBHOOK = "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"

def get_cost_data(window: str, aggregate: str) -> Dict:
    resp = requests.get(
        f"{OPENCOST_API}/allocation/compute",
        params={
            "window": window,
            "aggregate": aggregate,
            "accumulate": "true"
        },
        timeout=30
    )
    resp.raise_for_status()
    return resp.json()

def format_cost_report() -> str:
    # 获取上周和上上周数据
    this_week = get_cost_data("lastweek", "namespace")
    prev_week = get_cost_data("week:-2", "namespace")

    this_week_data = this_week.get("data", [{}])[0]
    prev_week_data = prev_week.get("data", [{}])[0]

    # 计算总成本和环比
    total_this = sum(v.get("totalCost", 0) for v in this_week_data.values())
    total_prev = sum(v.get("totalCost", 0) for v in prev_week_data.values())
    change_pct = (total_this - total_prev) / total_prev * 100 if total_prev > 0 else 0

    # Top 10 高消费 Namespace
    sorted_ns = sorted(
        this_week_data.items(),
        key=lambda x: x[1].get("totalCost", 0),
        reverse=True
    )[:10]

    # 构建报告
    trend_emoji = "📈" if change_pct > 5 else ("📉" if change_pct < -5 else "➡️")

    lines = [
        f"## 📊 Kubernetes 成本周报",
        f"**统计周期**:上周({get_last_week_range()}\n",
        f"### 汇总",
        f"- 本周总成本:**${total_this:.2f}**",
        f"- 环比上周:{trend_emoji} **{change_pct:+.1f}%**(上周 ${total_prev:.2f}\n",
        f"### Top 10 高消费服务",
        "| Namespace | 本周成本 | CPU | 内存 | 存储 |",
        "|-----------|---------|-----|------|------|"
    ]

    for ns, data in sorted_ns:
        cpu_cost = data.get("cpuCost", 0)
        ram_cost = data.get("ramCost", 0)
        pv_cost = data.get("pvCost", 0)
        total = data.get("totalCost", 0)
        lines.append(
            f"| `{ns}` | **${total:.2f}** | ${cpu_cost:.2f} | ${ram_cost:.2f} | ${pv_cost:.2f} |"
        )

    # 检查异常(环比增幅超过 30% 的 namespace)
    anomalies = []
    for ns, data in this_week_data.items():
        prev_cost = prev_week_data.get(ns, {}).get("totalCost", 0)
        this_cost = data.get("totalCost", 0)
        if prev_cost > 0 and (this_cost - prev_cost) / prev_cost > 0.3:
            anomalies.append((ns, this_cost, prev_cost))

    if anomalies:
        lines.append(f"\n### ⚠️ 成本异常(环比增幅 >30%)")
        for ns, this_cost, prev_cost in sorted(anomalies, key=lambda x: -x[1]):
            pct = (this_cost - prev_cost) / prev_cost * 100
            lines.append(f"- `{ns}`:${this_cost:.2f}(+{pct:.0f}%,上周 ${prev_cost:.2f})")

    lines.append(f"\n> 详细数据:http://opencost.internal.example.com")

    return "\n".join(lines)

def get_last_week_range() -> str:
    today = datetime.now()
    last_monday = today - timedelta(days=today.weekday() + 7)
    last_sunday = last_monday + timedelta(days=6)
    return f"{last_monday.strftime('%m/%d')} - {last_sunday.strftime('%m/%d')}"

def send_to_dingtalk(content: str):
    payload = {
        "msgtype": "markdown",
        "markdown": {
            "title": "Kubernetes 成本周报",
            "text": content
        }
    }
    resp = requests.post(DINGTALK_WEBHOOK, json=payload, timeout=10)
    resp.raise_for_status()
    result = resp.json()
    if result.get("errcode") != 0:
        raise Exception(f"钉钉推送失败: {result}")

if __name__ == "__main__":
    report = format_cost_report()
    send_to_dingtalk(report)
    print("成本报告推送成功")

CronJob 部署
#

apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-cost-report
  namespace: monitoring
spec:
  schedule: "0 9 * * 1"    # 每周一 09:00
  timeZone: "Asia/Shanghai"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cost-reporter
          containers:
          - name: reporter
            image: python:3.12-slim
            command:
            - /bin/sh
            - -c
            - |
              pip install requests -q && python /scripts/weekly_cost_report.py
            volumeMounts:
            - name: scripts
              mountPath: /scripts
            env:
            - name: DINGTALK_WEBHOOK
              valueFrom:
                secretKeyRef:
                  name: dingtalk-secrets
                  key: cost-report-webhook
          volumes:
          - name: scripts
            configMap:
              name: cost-report-scripts
          restartPolicy: OnFailure

生产落地经验
#

初始化阶段常见问题
#

问题一:Spot 实例价格显示为 0

原因:Spot Data Feed 未配置或 S3 权限不足。临时解法:

# 在 opencost configmap 中强制指定 spot 折扣率
kubectl patch configmap -n opencost opencost-conf --type=merge -p '
{
  "data": {
    "default-spot-cpu-discount": "0.7",
    "default-spot-ram-discount": "0.7"
  }
}'

问题二:PVC 成本为 0

原因:StorageClass 未配置价格。在 cloud-integration.json 中添加:

{
  "aws": {
    "storageClassPricing": {
      "gp3": {"storageGB": 0.08, "iopsPerGB": 0.005},
      "gp2": {"storageGB": 0.10},
      "io1": {"storageGB": 0.125, "iopsPerGB": 0.065}
    }
  }
}

问题三:成本数据延迟

OpenCost 默认每分钟抓取 Prometheus 数据,但实例价格缓存可能有 1 小时延迟。对于需要实时成本的场景,可以降低缓存刷新间隔:

# opencost deployment env
- name: CLOUD_PROVIDER_REFRESH_MINUTES
  value: "15"

成本优化闭环
#

数据可见性只是第一步,关键是建立优化闭环:

  1. 每周报告 → 识别高成本、高浪费服务
  2. VPA 推荐 → 对浪费严重的服务自动推荐合理的 resource requests
  3. 开发团队确认 → 走 PR 审批调整 requests
  4. 持续监控 → 跟踪调整效果,形成 FinOps 文化

一个实际的效果参考:某 16 节点生产集群,通过 3 个月的 OpenCost 驱动优化,将集群平均资源利用率从 22% 提升到 48%,对应节省了约 30% 的节点成本(约 $3,200/月)。


OpenCost 本身并不复杂,真正有用的是它把"成本"这件事从财务的月底账单拉到了工程师每天看的 Grafana 里。一旦每个团队能看到自己的成本曲线、每次调 requests 和扩副本对账单的直接影响,FinOps 才不是 PPT。

Wenzhuo Huang
作者
Wenzhuo Huang
搞运维的工程师,写代码的运维人。专注 Kubernetes、AWS、GitOps 与基础设施可靠性。这个博客既是我的技术笔记本,也是我踩过的坑的受害者档案。

相关文章

FinOps 实践:Kubernetes 成本治理体系建设

·1316 字·7 分钟
一套完整的 Kubernetes FinOps 落地路径:如何识别僵尸资源、配置成本分摊模型、利用 Karpenter 降低节点成本,以及如何将月账单从 $50k 压到 $30k。

Kubernetes 成本优化实战:系统性降本的四条路径

·1066 字·6 分钟
真实的降本案例:从发现成本异常到分析根因,通过 Karpenter 节点弹性伸缩、资源请求规格治理、大机型收敛等手段,系统性降低 AWS EC2 成本。包含具体配置和执行思路。

告警带图实战:Grafana Render + 钉钉推送趋势图

·1160 字·6 分钟
收到告警只有一行数字,还要登录 Grafana 才能看趋势图——这是告警体验最大的痛点之一。本文介绍如何将 Grafana Image Renderer 与 Alertmanager Webhook 结合,实现告警消息自动附带趋势图的完整方案。