OpenCost 实战：Kubernetes 成本可见性与多团队费用分摊

在 Kubernetes 中，一个 kubectl apply 就能消耗几百美元的云资源，但账单却只显示一个 EC2 或 ECS 集群的总费用。谁在用、用了多少、用在哪里——这三个问题在多团队共享集群的场景下几乎无法从云账单直接回答。

这就是 Kubernetes 成本不透明的根因，也是 OpenCost 要解决的问题。

成本不透明的根因分析
#

共享节点的代价
#

传统 VM 时代，一个 VM 对应一个账单条目，归属清晰。Kubernetes 上，多个 Namespace 的 Pod 共享同一批节点，节点成本无法直接归因到某个团队或服务。

典型问题场景：

资源超申请（Over-provisioning）：team-a 的服务 requests 了 8 核，实际用了 2 核，节点上 30% 的算力被空占
共享基础组件：Ingress Controller、Prometheus、日志采集器的成本属于"公共基础设施"，应该按比例分摊给各团队，但传统方案做不到
Spot 实例混用：on-demand 和 spot 节点混用，不同实例类型的单价差异巨大，简单平均会严重失真

为什么需要专门的工具
#

云厂商账单（AWS Cost Explorer、阿里云费用中心）只能到实例/资源组级别，无法下钻到 Pod、Namespace、Label 维度。自己写脚本计算成本模型需要：采集资源使用量、查询实例价格 API、处理 spot 价格波动、处理 PVC 存储计费——工程量不小且难以准确。

OpenCost vs Kubecost：选型边界
#

能力	OpenCost（开源）	Kubecost 付费版
实时成本查询（Namespace/Pod/Label）	✅	✅
多云价格接入	✅	✅
Grafana Dashboard	✅	✅
多集群统一视图	❌	✅
成本预算与告警 UI	❌（需自建）	✅
自定义分摊规则 UI	❌（需自建）	✅
SAML/SSO	❌	✅
网络成本细分	有限	✅
支持	社区	商业

结论：单集群、技术团队自运维、愿意写 Prometheus 规则——OpenCost 完全够用。多集群统一管理、需要非技术管理者查看成本 UI、有合规报表需求——考虑 Kubecost。

部署 OpenCost
#

前置依赖
#

OpenCost 需要 Prometheus 作为存储后端。如果已有 kube-prometheus-stack，可以直接复用；没有的话一并安装。

# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

安装 kube-prometheus-stack（如未安装）
#

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

安装 OpenCost
#

创建 values.yaml：

# opencost-values.yaml
opencost:
  exporter:
    # 云厂商：aws, azure, gcp, alibabacloud
    cloudProviderApiKey: ""   # 不需要，用 IRSA 或 IAM Role
    defaultClusterId: "prod-us-west-2"

  prometheus:
    internal:
      enabled: false          # 使用外部 Prometheus
    external:
      enabled: true
      url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

  ui:
    enabled: true
    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
      - host: opencost.internal.example.com
        paths:
        - /

  # AWS 成本配置（通过 IRSA）
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/opencost-cost-exporter"

# 资源配置
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 1Gi

helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  -f opencost-values.yaml

验证部署：

kubectl get pods -n opencost
# NAME                        READY   STATUS    RESTARTS   AGE
# opencost-7d8c9b5f6-xxxxx    2/2     Running   0          2m

# 访问 API 验证
kubectl port-forward -n opencost svc/opencost 9003:9003 &
curl http://localhost:9003/allocation/compute?window=1d | jq '.data[0] | keys'

AWS 云厂商价格接入
#

OpenCost 通过 AWS Cost and Usage Report（CUR）或 Price List API 获取实例价格，确保成本计算准确反映实际账单。

IAM 权限配置（IRSA）
#

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeSpotPriceHistory",
        "pricing:GetProducts"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-cur-bucket",
        "arn:aws:s3:::my-cur-bucket/*"
      ]
    }
  ]
}

云厂商配置文件
#

在 Kubernetes Secret 中存储价格配置：

apiVersion: v1
kind: Secret
metadata:
  name: opencost-cloud-config
  namespace: opencost
stringData:
  cloud-integration.json: |
    {
      "aws": {
        "athenaBucketName": "s3://my-cur-bucket/opencost-athena-results",
        "athenaRegion": "us-east-1",
        "athenaDatabase": "athenacurcfn",
        "athenaTable": "cur_report",
        "projectID": "123456789012",
        "serviceKeyName": "",
        "serviceKeySecret": "",
        "spotDataRegion": "us-west-2",
        "spotDataBucket": "s3://my-spot-data-bucket",
        "spotDataPrefix": "spot-data-feed",
        "awsAccountId": "123456789012"
      }
    }

Spot 实例价格同步
#

Spot 价格实时变动，OpenCost 通过 Spot Data Feed 获取历史价格：

# 在 AWS 控制台启用 Spot Instance Data Feed，指向 S3 bucket
# OpenCost 会自动读取并计算加权平均价格

# 验证 Spot 价格是否正确加载
curl http://localhost:9003/spotFeed | jq '.[] | select(.node_type == "m5.xlarge")'

核心成本模型解析
#

OpenCost 的计费模型基于**资源请求量（Requests）**而非实际使用量，这个设计选择非常重要：

占用了资源就应该付费，不管有没有用到——这更公平地反映了 Pod 对集群容量的占用。

各资源计费方式
#

CPU：

每 vCPU 小时价格 = 节点实例价格 / 节点 vCPU 数
Pod CPU 成本 = CPU Requests (cores) × 时长(小时) × 每 vCPU 小时价格

内存：

每 GiB 小时价格 = 节点实例价格 × 内存权重比例 / 节点内存(GiB)
# 默认内存权重：约占实例价格的 40%（CPU:Memory ≈ 6:4）
Pod 内存成本 = Memory Requests (GiB) × 时长(小时) × 每 GiB 小时价格

存储（PVC）：

PVC 成本 = 存储大小(GiB) × 时长(小时) × StorageClass 单价
# AWS gp3: $0.08/GiB/月 → $0.000111/GiB/小时

网络出流量：

网络成本 = 出流量(GB) × 出流量单价
# AWS us-east-1 → Internet: $0.09/GB

查看成本分解
#

# 查看过去 24 小时各 Namespace 成本
curl "http://localhost:9003/allocation/compute?window=24h&aggregate=namespace" | \
  jq '.data[] | to_entries[] | {namespace: .key, cost: .value.totalCost}'

# 输出示例
# {"namespace": "team-a", "cost": 12.34}
# {"namespace": "team-b", "cost": 8.76}
# {"namespace": "monitoring", "cost": 25.10}

OpenCost API 使用详解
#

OpenCost 提供了强大的 HTTP API，可以按任意维度聚合查询成本。

按 Label 查询（实现跨 Namespace 的服务成本）
#

# 查询 app=payment-service 的过去 7 天成本，按 deployment 分组
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=label:app&filter=label%5Bapp%5D%3A%22payment-service%22" | \
  jq '.data[] | .["payment-service"] | {totalCost, cpuCost, ramCost, pvCost}'

按 Deployment 查询
#

# 查询 production namespace 下所有 deployment 成本，按日汇总
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=deployment&namespace=production&accumulate=false" | \
  jq '.data[] | to_entries[] | {deployment: .key, daily_cost: .value.totalCost}'

成本趋势查询（用于 Grafana）
#

# 过去 30 天，按天汇总，按团队 label 分组
curl "http://localhost:9003/allocation/compute?window=30d&step=1d&aggregate=label:team&accumulate=false" | \
  jq '[.data[] | to_entries[] | {date: .key, team: .value.name, cost: .value.totalCost}]'

自定义分摊规则：共享资源成本处理
#

监控栈（Prometheus、Grafana）、Ingress Controller、日志采集等基础组件的成本属于全局共享，需要合理分摊给各业务团队。

分摊策略设计
#

共享组件成本分摊 = 各团队按"CPU Request 占比"分摊

team_share = team_cpu_requests / total_business_cpu_requests × shared_cost

在 OpenCost 的 API 层面，通过 shareSplit 参数实现：

# 将 monitoring namespace 的成本按比例分摊到业务 namespace
curl "http://localhost:9003/allocation/compute?window=7d&aggregate=namespace&shareSplit=weighted&shareNamespaces=monitoring,logging,ingress-nginx"

Prometheus Recording Rule 实现自定义分摊
#

对于更复杂的分摊逻辑，通过 Prometheus Recording Rule 预计算：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: opencost-custom-allocation
  namespace: monitoring
spec:
  groups:
  - name: cost-allocation
    interval: 1h
    rules:
    # 计算每个 namespace 的 CPU request 占比
    - record: namespace:cpu_request_ratio:ratio
      expr: |
        sum(
          kube_pod_container_resource_requests{resource="cpu", namespace!~"monitoring|logging|kube-system"}
        ) by (namespace)
        /
        sum(
          kube_pod_container_resource_requests{resource="cpu", namespace!~"monitoring|logging|kube-system"}
        )

    # 计算共享成本的分摊额（单位：美元/小时）
    - record: namespace:shared_cost_allocation:usd_per_hour
      expr: |
        namespace:cpu_request_ratio:ratio
        *
        # 共享组件总成本（需要从 OpenCost metrics 获取）
        sum(
          opencost_allocation_total_cost{namespace=~"monitoring|logging|kube-system"}
        )

Grafana Dashboard 搭建
#

导入 OpenCost 官方 Dashboard
#

# OpenCost 官方 Dashboard ID: 20568 (Grafana.com)
# 或通过 configmap 部署
kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/main/grafana/dashboards/opencost.json

自定义核心看板
#

看板一：实时成本趋势（按团队）

{
  "title": "Team Cost Trend - 30 Days",
  "type": "timeseries",
  "targets": [
    {
      "expr": "sum by (namespace) (rate(opencost_allocation_total_cost[1h]) * 3600)",
      "legendFormat": "{{namespace}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "currencyUSD",
      "custom": {
        "fillOpacity": 20
      }
    }
  }
}

看板二：Top 10 高消费服务

Grafana 面板配置（使用 OpenCost API 作为数据源）：

# 在 Grafana 中添加 OpenCost 作为 JSON API 数据源
# URL: http://opencost.opencost.svc:9003
# 查询：/allocation/compute?window=7d&aggregate=deployment&accumulate=true

# 然后使用 Table 面板展示，按 totalCost 降序排列

通过 Prometheus 指标实现 Top 10 Panel：

# Top 10 高消费 Deployment（过去 24 小时）
topk(10,
  sum by (deployment, namespace) (
    increase(opencost_allocation_total_cost[24h])
  )
)

看板三：按团队汇总（月度）

# 本月累计成本（按 team label 汇总）
sum by (label_team) (
  increase(opencost_allocation_total_cost[${__range}])
)
* on(label_team) group_left
  label_replace(vector(1), "label_team", "$1", "", "(.*)")

Grafana Dashboard YAML（GitOps 管理）
#

apiVersion: v1
kind: ConfigMap
metadata:
  name: opencost-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"    # grafana-sidecar 自动加载
data:
  opencost-team-cost.json: |
    {
      "title": "Kubernetes Cost by Team",
      "uid": "k8s-cost-by-team",
      "tags": ["cost", "finops"],
      "timezone": "Asia/Shanghai",
      "panels": [
        {
          "id": 1,
          "title": "Monthly Cost by Team",
          "type": "barchart",
          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
          "targets": [{
            "expr": "sum by (label_team) (increase(opencost_allocation_total_cost[30d]))",
            "legendFormat": "{{label_team}}"
          }]
        },
        {
          "id": 2,
          "title": "Daily Cost Trend",
          "type": "timeseries",
          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
          "targets": [{
            "expr": "sum by (label_team) (rate(opencost_allocation_total_cost[1h]) * 24)",
            "legendFormat": "{{label_team}}"
          }]
        }
      ]
    }

成本异常告警：AlertManager 规则
#

设计原则
#

告警不应该基于绝对值（成本超过 X 美元），而应该基于环比异常（今天比昨天同期高 Y%），否则月初和月末的成本自然不同会产生大量误报。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-cost
    rules:
    # 告警1：某 namespace 日成本环比暴涨（超过前 7 天均值的 50%）
    - alert: NamespaceCostSpike
      expr: |
        (
          sum by (namespace) (increase(opencost_allocation_total_cost[24h]))
          /
          sum by (namespace) (
            increase(opencost_allocation_total_cost[24h] offset 1d) +
            increase(opencost_allocation_total_cost[24h] offset 2d) +
            increase(opencost_allocation_total_cost[24h] offset 3d) +
            increase(opencost_allocation_total_cost[24h] offset 4d) +
            increase(opencost_allocation_total_cost[24h] offset 5d) +
            increase(opencost_allocation_total_cost[24h] offset 6d) +
            increase(opencost_allocation_total_cost[24h] offset 7d)
          ) * 7
        ) > 1.5
      for: 1h
      labels:
        severity: warning
        team: "{{ $labels.namespace }}"
      annotations:
        summary: "Namespace {{ $labels.namespace }} 成本异常"
        description: "{{ $labels.namespace }} 过去 24 小时成本是过去 7 天均值的 {{ $value | humanize }}x，请检查是否有资源泄漏。"

    # 告警2：月度预算超支预警（当月累计成本超过预算的 80%）
    - alert: MonthlyBudgetWarning
      expr: |
        # 当月累计成本（从月初开始）
        sum by (namespace) (
          increase(opencost_allocation_total_cost[${days_in_current_month}d])
        )
        >
        # 预算阈值（通过 ConfigMap 或 label 配置，这里用硬编码示例）
        on(namespace) group_left
        kube_namespace_labels{label_monthly_budget!=""} * 0
        + 800   # team-a 月预算 $1000，80% = $800
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "{{ $labels.namespace }} 月度预算即将超支"
        description: "当月已消费 ${{ $value | printf \"%.2f\" }}，已超过月度预算的 80%。"

    # 告警3：单个 Pod 资源极度浪费（requests >> usage）
    - alert: PodResourceWaste
      expr: |
        (
          sum by (pod, namespace) (
            kube_pod_container_resource_requests{resource="cpu"}
          )
          -
          sum by (pod, namespace) (
            rate(container_cpu_usage_seconds_total[1h])
          )
        )
        /
        sum by (pod, namespace) (
          kube_pod_container_resource_requests{resource="cpu"}
        )
        > 0.8   # CPU 浪费超过 80%
      for: 2h
      labels:
        severity: info
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 资源严重浪费"
        description: "CPU 请求量中 {{ $value | humanizePercentage }} 未被使用，建议降低 resource requests。"

AlertManager 路由配置
#

# alertmanager-config
route:
  receiver: default
  routes:
  - matchers:
    - alertname =~ "NamespaceCostSpike|MonthlyBudgetWarning"
    receiver: cost-alert-dingtalk
    group_wait: 10m
    group_interval: 4h    # 同类告警 4 小时聚合一次，避免刷屏
    repeat_interval: 24h

receivers:
- name: cost-alert-dingtalk
  webhook_configs:
  - url: "http://dingtalk-webhook.monitoring.svc:8060/dingtalk/cost-alert/send"
    send_resolved: true

与钉钉集成：每周自动成本报告
#

方案架构
#

CronJob (每周一 9:00)
  → 调用 OpenCost API 获取上周成本数据
  → 计算环比变化、Top 10 服务
  → 格式化为 Markdown
  → 推送钉钉群机器人

成本报告脚本
#

#!/usr/bin/env python3
# weekly_cost_report.py

import requests
import json
from datetime import datetime, timedelta
from typing import Dict, List

OPENCOST_API = "http://opencost.opencost.svc:9003"
DINGTALK_WEBHOOK = "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"

def get_cost_data(window: str, aggregate: str) -> Dict:
    resp = requests.get(
        f"{OPENCOST_API}/allocation/compute",
        params={
            "window": window,
            "aggregate": aggregate,
            "accumulate": "true"
        },
        timeout=30
    )
    resp.raise_for_status()
    return resp.json()

def format_cost_report() -> str:
    # 获取上周和上上周数据
    this_week = get_cost_data("lastweek", "namespace")
    prev_week = get_cost_data("week:-2", "namespace")

    this_week_data = this_week.get("data", [{}])[0]
    prev_week_data = prev_week.get("data", [{}])[0]

    # 计算总成本和环比
    total_this = sum(v.get("totalCost", 0) for v in this_week_data.values())
    total_prev = sum(v.get("totalCost", 0) for v in prev_week_data.values())
    change_pct = (total_this - total_prev) / total_prev * 100 if total_prev > 0 else 0

    # Top 10 高消费 Namespace
    sorted_ns = sorted(
        this_week_data.items(),
        key=lambda x: x[1].get("totalCost", 0),
        reverse=True
    )[:10]

    # 构建报告
    trend_emoji = "📈" if change_pct > 5 else ("📉" if change_pct < -5 else "➡️")

    lines = [
        f"## 📊 Kubernetes 成本周报",
        f"**统计周期**：上周（{get_last_week_range()}）\n",
        f"### 汇总",
        f"- 本周总成本：**${total_this:.2f}**",
        f"- 环比上周：{trend_emoji} **{change_pct:+.1f}%**（上周 ${total_prev:.2f}）\n",
        f"### Top 10 高消费服务",
        "| Namespace | 本周成本 | CPU | 内存 | 存储 |",
        "|-----------|---------|-----|------|------|"
    ]

    for ns, data in sorted_ns:
        cpu_cost = data.get("cpuCost", 0)
        ram_cost = data.get("ramCost", 0)
        pv_cost = data.get("pvCost", 0)
        total = data.get("totalCost", 0)
        lines.append(
            f"| `{ns}` | **${total:.2f}** | ${cpu_cost:.2f} | ${ram_cost:.2f} | ${pv_cost:.2f} |"
        )

    # 检查异常（环比增幅超过 30% 的 namespace）
    anomalies = []
    for ns, data in this_week_data.items():
        prev_cost = prev_week_data.get(ns, {}).get("totalCost", 0)
        this_cost = data.get("totalCost", 0)
        if prev_cost > 0 and (this_cost - prev_cost) / prev_cost > 0.3:
            anomalies.append((ns, this_cost, prev_cost))

    if anomalies:
        lines.append(f"\n### ⚠️ 成本异常（环比增幅 >30%）")
        for ns, this_cost, prev_cost in sorted(anomalies, key=lambda x: -x[1]):
            pct = (this_cost - prev_cost) / prev_cost * 100
            lines.append(f"- `{ns}`：${this_cost:.2f}（+{pct:.0f}%，上周 ${prev_cost:.2f}）")

    lines.append(f"\n> 详细数据：http://opencost.internal.example.com")

    return "\n".join(lines)

def get_last_week_range() -> str:
    today = datetime.now()
    last_monday = today - timedelta(days=today.weekday() + 7)
    last_sunday = last_monday + timedelta(days=6)
    return f"{last_monday.strftime('%m/%d')} - {last_sunday.strftime('%m/%d')}"

def send_to_dingtalk(content: str):
    payload = {
        "msgtype": "markdown",
        "markdown": {
            "title": "Kubernetes 成本周报",
            "text": content
        }
    }
    resp = requests.post(DINGTALK_WEBHOOK, json=payload, timeout=10)
    resp.raise_for_status()
    result = resp.json()
    if result.get("errcode") != 0:
        raise Exception(f"钉钉推送失败: {result}")

if __name__ == "__main__":
    report = format_cost_report()
    send_to_dingtalk(report)
    print("成本报告推送成功")

CronJob 部署
#

apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-cost-report
  namespace: monitoring
spec:
  schedule: "0 9 * * 1"    # 每周一 09:00
  timeZone: "Asia/Shanghai"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cost-reporter
          containers:
          - name: reporter
            image: python:3.12-slim
            command:
            - /bin/sh
            - -c
            - |
              pip install requests -q && python /scripts/weekly_cost_report.py
            volumeMounts:
            - name: scripts
              mountPath: /scripts
            env:
            - name: DINGTALK_WEBHOOK
              valueFrom:
                secretKeyRef:
                  name: dingtalk-secrets
                  key: cost-report-webhook
          volumes:
          - name: scripts
            configMap:
              name: cost-report-scripts
          restartPolicy: OnFailure

生产落地经验
#

初始化阶段常见问题
#

问题一：Spot 实例价格显示为 0

原因：Spot Data Feed 未配置或 S3 权限不足。临时解法：

# 在 opencost configmap 中强制指定 spot 折扣率
kubectl patch configmap -n opencost opencost-conf --type=merge -p '
{
  "data": {
    "default-spot-cpu-discount": "0.7",
    "default-spot-ram-discount": "0.7"
  }
}'

问题二：PVC 成本为 0

原因：StorageClass 未配置价格。在 cloud-integration.json 中添加：

{
  "aws": {
    "storageClassPricing": {
      "gp3": {"storageGB": 0.08, "iopsPerGB": 0.005},
      "gp2": {"storageGB": 0.10},
      "io1": {"storageGB": 0.125, "iopsPerGB": 0.065}
    }
  }
}

问题三：成本数据延迟

OpenCost 默认每分钟抓取 Prometheus 数据，但实例价格缓存可能有 1 小时延迟。对于需要实时成本的场景，可以降低缓存刷新间隔：

# opencost deployment env
- name: CLOUD_PROVIDER_REFRESH_MINUTES
  value: "15"

成本优化闭环
#

数据可见性只是第一步，关键是建立优化闭环：

每周报告 → 识别高成本、高浪费服务
VPA 推荐 → 对浪费严重的服务自动推荐合理的 resource requests
开发团队确认 → 走 PR 审批调整 requests
持续监控 → 跟踪调整效果，形成 FinOps 文化

一个实际的效果参考：某 16 节点生产集群，通过 3 个月的 OpenCost 驱动优化，将集群平均资源利用率从 22% 提升到 48%，对应节省了约 30% 的节点成本（约 $3,200/月）。

OpenCost 本身并不复杂，真正有用的是它把"成本"这件事从财务的月底账单拉到了工程师每天看的 Grafana 里。一旦每个团队能看到自己的成本曲线、每次调 requests 和扩副本对账单的直接影响，FinOps 才不是 PPT。

作者

Wenzhuo Huang

搞运维的工程师，写代码的运维人。专注 Kubernetes、AWS、GitOps 与基础设施可靠性。这个博客既是我的技术笔记本，也是我踩过的坑的受害者档案。

成本不透明的根因分析#

共享节点的代价#

为什么需要专门的工具#

OpenCost vs Kubecost：选型边界#

部署 OpenCost#

前置依赖#

安装 kube-prometheus-stack（如未安装）#

安装 OpenCost#

AWS 云厂商价格接入#

IAM 权限配置（IRSA）#

云厂商配置文件#

Spot 实例价格同步#

核心成本模型解析#

各资源计费方式#

查看成本分解#

OpenCost API 使用详解#

按 Label 查询（实现跨 Namespace 的服务成本）#

按 Deployment 查询#

成本趋势查询（用于 Grafana）#

自定义分摊规则：共享资源成本处理#

分摊策略设计#

Prometheus Recording Rule 实现自定义分摊#

Grafana Dashboard 搭建#

导入 OpenCost 官方 Dashboard#

自定义核心看板#

Grafana Dashboard YAML（GitOps 管理）#

成本异常告警：AlertManager 规则#

设计原则#

AlertManager 路由配置#

与钉钉集成：每周自动成本报告#

方案架构#

成本报告脚本#

CronJob 部署#

生产落地经验#

初始化阶段常见问题#

成本优化闭环#

相关文章