Kubernetes 资源管理：requests/limits/QoS/配额

requests 与 limits 的核心区别
#

requests：调度依据（影响节点选择）
  → kube-scheduler 只看 requests 决定 Pod 放哪个节点
  → 节点可分配资源 = 节点容量 - 所有 Pod requests 之和

limits：运行时上限（影响实际使用）
  → kubelet 用 cgroups 强制限制实际使用量
  → CPU 超限：被 throttle（不被杀死）
  → 内存超限：进程被 OOMKill

resources:
  requests:
    cpu: "500m"       # 调度时保留 0.5 核（0.5 CPU = 500 milliCPU）
    memory: "512Mi"   # 调度时保留 512Mi 内存
  limits:
    cpu: "2"          # 运行时最多使用 2 核
    memory: "1Gi"     # 运行时最多使用 1Gi，超出即 OOMKill

# 查看节点可用资源（已分配 vs 总量）
kubectl describe node <node-name> | grep -A15 "Allocated resources"

# 输出示例：
# Allocated resources:
#   Resource           Requests     Limits
#   --------           --------     ------
#   cpu                6280m (78%)  12200m (152%)    ← limits 可以超配，requests 不能超过 100%
#   memory             12Gi (75%)   18Gi (112%)

CPU 限流机制（CFS Quota）
#

Linux CFS（Completely Fair Scheduler）通过 cpu.cfs_quota_us 和 cpu.cfs_period_us 实现 CPU 限制：

period = 100ms（默认）
quota  = limits.cpu × period

例：limits.cpu = "2"
quota = 2 × 100ms = 200ms
含义：每 100ms 内，容器最多使用 200ms CPU 时间

CPU Throttling 的性能影响
#

# 检查容器是否被 throttle（在容器所在节点执行）
# 找到容器的 cgroup 路径
cat /sys/fs/cgroup/cpu/kubepods/pod<pod-uid>/<container-id>/cpu.stat
# 关注：
# nr_periods：总调度周期数
# nr_throttled：被 throttle 的周期数
# throttled_time：被 throttle 的总时间（纳秒）

# throttle 率 = nr_throttled / nr_periods × 100%
# 生产建议：throttle 率 > 5% 则需要上调 limits 或优化代码

# 通过 Prometheus 查看 CPU throttle（需要 cAdvisor）
# 指标：container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total
rate(container_cpu_cfs_throttled_periods_total{namespace="production"}[5m])
/ rate(container_cpu_cfs_periods_total{namespace="production"}[5m])

常见误区：limits.cpu 设得很高，但 requests.cpu 很低。调度器只看 requests，导致节点超载，所有 Pod 都频繁 throttle。

内存 OOMKill 机制
#

内存超出 limits 时，Linux OOM Killer 直接杀死进程，容器重启（RestartPolicy 生效）：

# 查看 OOMKill 历史
kubectl describe pod <pod-name> -n production | grep -A5 "OOMKilled"
# State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

# 查看系统 OOM 日志（在节点上执行）
dmesg | grep -i "out of memory"
dmesg | grep -i oom | tail -20

# 查看容器重启原因
kubectl get pod <pod-name> -n production -o jsonpath='{.status.containerStatuses[0].lastState}'

# Prometheus 监控 OOMKill 事件
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

# 内存使用分析
kubectl top pod <pod-name> -n production --containers

# 查看 Pod 内存使用历史（需要 metrics-server）
kubectl top pod -n production --sort-by=memory | head -20

QoS 三种类型
#

QoS 类型	判断规则	调度优先级	驱逐优先级
Guaranteed	所有容器 CPU+内存都设了 requests = limits	最优	最后驱逐
Burstable	至少一个容器设了 requests（不满足 Guaranteed）	中等	中等驱逐
BestEffort	所有容器都没有设 requests 和 limits	最低	最先驱逐

判断规则详解
#

# Guaranteed：每个容器都必须同时满足 cpu requests=limits，memory requests=limits
spec:
  containers:
    - name: app
      resources:
        requests:
          cpu: "1"
          memory: "512Mi"
        limits:
          cpu: "1"          # 必须等于 requests
          memory: "512Mi"   # 必须等于 requests
    - name: sidecar
      resources:
        requests:
          cpu: "100m"
          memory: "64Mi"
        limits:
          cpu: "100m"       # 所有容器都必须满足
          memory: "64Mi"
# QoS Class: Guaranteed

# Burstable：至少一个容器有 requests，但不满足 Guaranteed
spec:
  containers:
    - name: app
      resources:
        requests:
          cpu: "500m"
          memory: "256Mi"
        limits:
          cpu: "2"          # limits > requests → Burstable
          memory: "1Gi"
# QoS Class: Burstable

# BestEffort：完全没有资源限制（不推荐生产使用）
spec:
  containers:
    - name: app
      # 没有 resources 字段
# QoS Class: BestEffort

# 查看 Pod QoS Class
kubectl get pod <pod-name> -n production -o jsonpath='{.status.qosClass}'

# 批量查看
kubectl get pods -n production -o custom-columns='NAME:.metadata.name,QOS:.status.qosClass'

QoS 对调度和驱逐的影响
#

节点内存压力触发驱逐顺序：
1. BestEffort Pod（首先被驱逐）
2. Burstable Pod（实际使用超过 requests 的部分）
3. Guaranteed Pod（最后驱逐，OOM score adj = -997）

CPU 压力下（throttle 而非驱逐）：
- Guaranteed Pod 有独占 CPU 份额
- BestEffort Pod 在 CPU 紧张时几乎得不到时间片

LimitRange：命名空间默认值
#

apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
    # Container 级别限制
    - type: Container
      default:                    # 不设 limits 时的默认值
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:             # 不设 requests 时的默认值
        cpu: "100m"
        memory: "128Mi"
      max:                        # 允许设置的最大值
        cpu: "8"
        memory: "16Gi"
      min:                        # 允许设置的最小值
        cpu: "10m"
        memory: "32Mi"
      maxLimitRequestRatio:       # limits/requests 最大比率（防止过度超配）
        cpu: "10"
        memory: "4"

    # Pod 级别限制（所有容器之和）
    - type: Pod
      max:
        cpu: "16"
        memory: "32Gi"

    # PVC 大小限制
    - type: PersistentVolumeClaim
      max:
        storage: "100Gi"
      min:
        storage: "1Gi"

# 查看 LimitRange
kubectl describe limitrange production-limits -n production

# 验证：创建没有 resources 的 Pod，会自动注入默认值
kubectl run test-pod --image=nginx -n production
kubectl describe pod test-pod -n production | grep -A10 "Limits"

ResourceQuota：命名空间总量限制
#

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    # 计算资源
    requests.cpu: "50"
    requests.memory: "100Gi"
    limits.cpu: "100"
    limits.memory: "200Gi"

    # 存储资源
    requests.storage: "500Gi"
    persistentvolumeclaims: "20"
    ebs-gp3.storageclass.storage.k8s.io/requests.storage: "200Gi"  # 特定 SC 配额

    # 对象数量
    pods: "100"
    services: "30"
    services.loadbalancers: "5"
    services.nodeports: "0"     # 禁止使用 NodePort
    secrets: "50"
    configmaps: "50"
    replicationcontrollers: "0"
    deployments.apps: "20"

    # 按 QoS 限制
    requests.cpu.Guaranteed: "20"  # Guaranteed 类 Pod 的 CPU requests 上限

# 查看配额使用情况
kubectl describe resourcequota production-quota -n production

# 输出示例：
# Name:            production-quota
# Namespace:       production
# Resource         Used    Hard
# --------         ---     ----
# limits.cpu       8500m   100
# limits.memory    17Gi    200Gi
# pods             12      100
# requests.cpu     4250m   50
# requests.memory  8Gi     100Gi

资源设置最佳实践
#

如何合理设置 requests
#

# 方法1：查看历史 P95 使用量（Prometheus）
# CPU P95
histogram_quantile(0.95,
  rate(container_cpu_usage_seconds_total{namespace="production",container="my-app"}[7d])
)

# 内存 P95
quantile_over_time(0.95,
  container_memory_working_set_bytes{namespace="production",container="my-app"}[7d]
)

# 方法2：使用 VPA Off 模式获取推荐值
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa-advisor
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # 仅推荐，不自动修改
EOF

# 7天后查看推荐值
kubectl describe vpa my-app-vpa-advisor -n production

推荐的资源配置策略
#

应用类型	requests	limits	QoS 目标
核心服务（API/DB）	P50 使用量	P95-P99	Guaranteed
普通业务服务	P50 使用量	2-3× requests	Burstable
批处理任务	实际需求	实际需求 × 1.2	Burstable
开发/测试	最小可运行	适当放大	BestEffort 可接受

# 生产 API 服务推荐配置
resources:
  requests:
    cpu: "500m"       # 根据 P50 监控设置
    memory: "512Mi"
  limits:
    cpu: "2"          # 允许突发使用
    memory: "1Gi"     # 内存建议和 requests 接近，防止 OOM

驱逐（Eviction）机制
#

kubelet 在节点资源紧张时触发驱逐：

# kubelet 驱逐阈值配置（/etc/kubernetes/kubelet-config.yaml）
evictionHard:
  memory.available: "200Mi"     # 可用内存低于 200Mi 触发驱逐
  nodefs.available: "10%"       # 节点磁盘剩余低于 10%
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"

evictionSoft:
  memory.available: "500Mi"     # 软阈值，持续 2 分钟才触发
evictionSoftGracePeriod:
  memory.available: "2m"

evictionMinimumReclaim:         # 驱逐后至少回收多少资源
  memory.available: "500Mi"
  nodefs.available: "1Gi"

# 查看节点驱逐事件
kubectl describe node <node-name> | grep -A5 "Conditions"
kubectl get events --field-selector reason=Evicted -n production

# 查看被驱逐的 Pod
kubectl get pods -n production --field-selector=status.phase=Failed | grep Evicted

# 清理已驱逐的 Pod
kubectl get pods -n production --field-selector=status.phase=Failed \
  -o name | xargs kubectl delete -n production

PriorityClass：调度与驱逐优先级
#

# 定义优先级类（数值越大优先级越高）
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000          # 系统级：~2147483647，用户自定义最大建议 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority  # 允许抢占低优先级 Pod
description: "核心业务服务，不允许被驱逐"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-job
value: 100
globalDefault: false
preemptionPolicy: Never    # 不抢占，等待资源
description: "批处理任务"

# Pod 引用 PriorityClass
spec:
  priorityClassName: critical-service
  containers:
    - name: app
      image: my-app:v1.0.0

# 查看集群内所有 PriorityClass
kubectl get priorityclass

# 内置优先级（不要手动创建同名）：
# system-cluster-critical：2000000000（CoreDNS 等）
# system-node-critical：2000001000（kube-proxy 等）
kubectl get priorityclass | grep system

综合排查命令
#

# 查看节点资源压力
kubectl describe nodes | grep -A5 "Conditions" | grep -E "MemoryPressure|DiskPressure|PIDPressure"

# 查看资源使用 Top
kubectl top nodes
kubectl top pods -n production --sort-by=cpu | head -20
kubectl top pods -n production --sort-by=memory | head -20

# 查看所有命名空间 ResourceQuota 使用情况
kubectl get resourcequota -A

# 找出没有设置 resources 的 Pod（BestEffort）
kubectl get pods -A -o json | jq '.items[] | select(.status.qosClass=="BestEffort") | {ns:.metadata.namespace, name:.metadata.name}'

# 找出内存使用超过 requests 80% 的 Pod（需要 Prometheus）
# container_memory_working_set_bytes / (kube_pod_container_resource_requests{resource="memory"}) > 0.8

作者

Wenzhuo Huang

搞运维的工程师，写代码的运维人。专注 Kubernetes、AWS、GitOps 与基础设施可靠性。这个博客既是我的技术笔记本，也是我踩过的坑的受害者档案。

Kubernetes 资源管理实战——QoS、ResourceQuota、VPA 体系化实践

2025-01-16·739 字·4 分钟

我在生产中见过太多因为资源配置不当导致的事故：不设 limits 的服务把节点内存吃光导致 OOM 驱逐、requests 设得过高导致 Pod 调度不上去、HPA 配置错误导致扩缩失灵。这篇文章把 K8s 资源管理体系从头到尾捋一遍，让你建立完整的资源治理思路。

Kubernetes HPA/VPA 弹性伸缩配置

2025-12-09·1145 字·6 分钟

从 HPA v2 到 KEDA 事件驱动伸缩，覆盖 CPU/内存/自定义指标配置、防抖参数调优、VPA 推荐器集成和生产级弹性伸缩最佳实践。

Kubernetes RBAC 权限管理实践

2025-12-09·1069 字·6 分钟

从 RBAC 核心概念到生产级多租户权限设计，涵盖 ServiceAccount 最小权限、kubectl auth can-i 排查和命名空间隔离实践。

requests 与 limits 的核心区别#

CPU 限流机制（CFS Quota）#

CPU Throttling 的性能影响#

内存 OOMKill 机制#

QoS 三种类型#

判断规则详解#

QoS 对调度和驱逐的影响#

LimitRange：命名空间默认值#

ResourceQuota：命名空间总量限制#

资源设置最佳实践#

如何合理设置 requests#

推荐的资源配置策略#

驱逐（Eviction）机制#

PriorityClass：调度与驱逐优先级#

综合排查命令#

相关文章