跳过正文
PostgreSQL 高可用实战:Patroni + HAProxy + etcd 完整部署指南

PostgreSQL 高可用实战:Patroni + HAProxy + etcd 完整部署指南

·1814 字·9 分钟·
目录

生产环境的 PostgreSQL 单点是最大的风险敞口。我们在把核心业务从 RDS 迁移到自建集群的过程中,选择了 Patroni 作为 HA 框架。Patroni 是目前社区最成熟的 PostgreSQL 高可用方案,Zalando、GitLab、Crunchy Data 都在生产大规模使用。这篇文章记录从零搭建的完整过程,踩过的坑都会标出来。

整体架构
#

                        ┌─────────────────┐
                        │    应用层         │
                        │  App / ORM       │
                        └────────┬────────┘
                                 │
                    ┌────────────┴────────────┐
                    │         HAProxy          │
                    │  :5000 (读写/主节点)      │
                    │  :5001 (只读/从节点)      │
                    └──────┬──────────┬───────┘
                           │          │
              ┌────────────┘          └────────────┐
              │                                     │
   ┌──────────▼──────────┐             ┌───────────▼──────────┐
   │  pg-node1 (Leader)  │             │  pg-node2 (Replica)  │
   │  Patroni + PG 16    │◄────WAL────►│  Patroni + PG 16     │
   │  192.168.1.101      │             │  192.168.1.102        │
   └─────────────────────┘             └──────────────────────┘
                                                    │
                                        ┌───────────▼──────────┐
                                        │  pg-node3 (Replica)  │
                                        │  Patroni + PG 16     │
                                        │  192.168.1.103        │
                                        └──────────────────────┘

   ┌──────────────────────────────────────────────────────────┐
   │              etcd 集群(3节点)                            │
   │  etcd1: 192.168.1.101   etcd2: 192.168.1.102            │
   │  etcd3: 192.168.1.103                                    │
   └──────────────────────────────────────────────────────────┘

节点规划

主机名IP角色
pg-node1192.168.1.101Patroni + etcd + PostgreSQL
pg-node2192.168.1.102Patroni + etcd + PostgreSQL
pg-node3192.168.1.103Patroni + etcd + PostgreSQL
haproxy192.168.1.100HAProxy(可与 PG 节点合并)

etcd 与 Patroni 节点复用,节省机器资源。生产环境建议 etcd 独立部署,避免 PostgreSQL IO 压力影响 etcd 的 fsync 延迟导致误判主节点宕机。


第一步:etcd 集群搭建
#

三节点都要执行:

# Ubuntu 22.04
apt-get update && apt-get install -y etcd-server etcd-client

# 或者手动下载指定版本
ETCD_VER=v3.5.12
curl -L https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz \
  -o /tmp/etcd.tar.gz
tar xzf /tmp/etcd.tar.gz -C /usr/local/bin --strip-components=1 \
  etcd-${ETCD_VER}-linux-amd64/etcd \
  etcd-${ETCD_VER}-linux-amd64/etcdctl

etcd1(192.168.1.101)的配置文件 /etc/etcd/etcd.conf

name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://192.168.1.101:2379,http://127.0.0.1:2379
advertise-client-urls: http://192.168.1.101:2379
listen-peer-urls: http://192.168.1.101:2380
initial-advertise-peer-urls: http://192.168.1.101:2380
initial-cluster: etcd1=http://192.168.1.101:2380,etcd2=http://192.168.1.102:2380,etcd3=http://192.168.1.103:2380
initial-cluster-token: pg-etcd-cluster-prod
initial-cluster-state: new
# 心跳与选举超时,默认值对大多数场景够用
heartbeat-interval: 100
election-timeout: 1000
# 快照
snapshot-count: 10000
max-snapshots: 5
# 日志
logger: zap
log-level: warn

etcd2/etcd3 只改 name 和三处 IP 地址即可。

# systemd service
cat > /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=etcd key-value store
After=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable etcd
systemctl start etcd

验证 etcd 集群健康:

etcdctl --endpoints=http://192.168.1.101:2379,http://192.168.1.102:2379,http://192.168.1.103:2379 \
  endpoint health

# 输出类似:
# http://192.168.1.101:2379 is healthy: successfully committed proposal: took = 2.1ms
# http://192.168.1.102:2379 is healthy: successfully committed proposal: took = 1.8ms
# http://192.168.1.103:2379 is healthy: successfully committed proposal: took = 2.4ms

etcdctl --endpoints=http://192.168.1.101:2379 \
  endpoint status --write-out=table

第二步:安装 PostgreSQL 16
#

三节点都要执行:

# 添加 PGDG 源
apt-get install -y curl ca-certificates
install -d /usr/share/postgresql-common/pgdg
curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc \
  --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc
sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] \
  https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \
  > /etc/apt/sources.list.d/pgdg.list'
apt-get update
apt-get install -y postgresql-16 postgresql-client-16

# 停止并禁用默认的 postgresql 服务(由 Patroni 接管启停)
systemctl stop postgresql
systemctl disable postgresql

# 清空默认数据目录(Patroni 会自己初始化)
rm -rf /var/lib/postgresql/16/main

第三步:安装 Patroni
#

apt-get install -y python3-pip python3-dev libpq-dev gcc

# 安装 Patroni + etcd 支持
pip3 install patroni[etcd] psycopg2-binary

# 或者用 pipx 隔离环境(推荐)
pipx install 'patroni[etcd]'

第四步:Patroni 配置文件
#

这是整个方案的核心,每个节点配置文件有差异,以下是 pg-node1/etc/patroni/patroni.yml

scope: pg-cluster          # 集群名称,所有节点必须一致
namespace: /db/            # etcd 中的 key 前缀
name: pg-node1             # 本节点名称,每个节点唯一

restapi:
  listen: 192.168.1.101:8008       # Patroni REST API 监听地址
  connect_address: 192.168.1.101:8008
  # 生产建议加认证
  # authentication:
  #   username: patroni
  #   password: strongpassword

etcd3:
  hosts: 192.168.1.101:2379,192.168.1.102:2379,192.168.1.103:2379
  # 可选:开启 TLS
  # protocol: https
  # cacert: /etc/ssl/etcd/ca.crt

bootstrap:
  # DCS 中不存在集群时的初始化配置
  dcs:
    ttl: 30                        # leader key 的 TTL(秒),超时触发重新选举
    loop_wait: 10                  # Patroni 主循环间隔(秒)
    retry_timeout: 10              # 操作 DCS 的超时时间
    maximum_lag_on_failover: 1048576  # 允许故障转移的最大 WAL 滞后(1MB)
    # 同步复制:至少 1 个同步副本
    synchronous_mode: false
    # synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true          # 开启 pg_rewind,允许老主降级后追上新主
      use_slots: true              # 使用复制槽,防止 WAL 被清除
      parameters:
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 1024        # MB,保留 WAL 段
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"        # pg_rewind 需要
        archive_mode: "on"
        archive_command: 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
        shared_buffers: 4GB
        effective_cache_size: 12GB
        maintenance_work_mem: 512MB
        checkpoint_completion_target: 0.9
        wal_buffers: 64MB
        default_statistics_target: 100
        random_page_cost: 1.1
        effective_io_concurrency: 200
        work_mem: 16MB
        min_wal_size: 1GB
        max_wal_size: 4GB
        max_worker_processes: 8
        max_parallel_workers_per_gather: 4
        max_parallel_workers: 8
        max_parallel_maintenance_workers: 4
        log_destination: stderr
        logging_collector: "on"
        log_directory: /var/log/postgresql
        log_filename: postgresql-%Y-%m-%d_%H%M%S.log
        log_min_duration_statement: 1000    # 慢查询阈值 1s
        log_checkpoints: "on"
        log_connections: "off"
        log_disconnections: "off"
        log_lock_waits: "on"
        log_temp_files: 0
        log_autovacuum_min_duration: 0
        track_activity_query_size: 4096
        shared_preload_libraries: pg_stat_statements
        pg_stat_statements.max: 10000
        pg_stat_statements.track: all

  # 初始化时执行的 SQL
  initdb:
    - encoding: UTF8
    - data-checksums       # 开启数据校验,生产必须
    - locale: en_US.UTF-8

  # Patroni 托管 pg_hba.conf,不要手动修改该文件
  pg_hba:
    - host replication replicator 192.168.1.0/24 md5
    - host all all 0.0.0.0/0 md5
    - local all all peer

  # 初始化后执行的 SQL(创建复制用户)
  post_init: /etc/patroni/post_init.sh

postgresql:
  listen: 192.168.1.101:5432       # 每个节点改为本机 IP
  connect_address: 192.168.1.101:5432
  data_dir: /var/lib/postgresql/16/main
  bin_dir: /usr/lib/postgresql/16/bin
  config_dir: /var/lib/postgresql/16/main
  pgpass: /tmp/pgpass0

  authentication:
    replication:
      username: replicator
      password: "ReplStr0ngPass!"
    superuser:
      username: postgres
      password: "PGSuperStr0ng!"
    rewind:
      username: rewind_user
      password: "RewindStr0ng!"

  # 额外的 recovery 参数(PG 12+ 写入 postgresql.conf)
  recovery_conf:
    restore_command: 'cp /var/lib/postgresql/wal_archive/%f %p'

tags:
  nofailover: false        # 设为 true 则此节点不参与选主
  noloadbalance: false     # 设为 true 则 HAProxy 不向此节点路由读请求
  clonefrom: false         # 设为 true 则优先从此节点克隆新副本
  nosync: false

pg-node2 / pg-node3 只需改三处:namerestapi.listenrestapi.connect_addresspostgresql.listenpostgresql.connect_address

创建 post_init.sh:

cat > /etc/patroni/post_init.sh << 'EOF'
#!/bin/bash
psql -U postgres << SQL
CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'ReplStr0ngPass!';
CREATE USER rewind_user LOGIN ENCRYPTED PASSWORD 'RewindStr0ng!';
GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_stat_file(text, boolean) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, boolean) TO rewind_user;
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SQL
EOF
chmod +x /etc/patroni/post_init.sh

创建 systemd service:

cat > /etc/systemd/system/patroni.service << 'EOF'
[Unit]
Description=Patroni - High Availability PostgreSQL Cluster
After=syslog.target network.target etcd.service
Requires=etcd.service

[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=on-failure
RestartSec=5
# ulimits
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target
EOF

# 创建日志目录
mkdir -p /var/log/postgresql
chown postgres:postgres /var/log/postgresql

# 创建 WAL 归档目录
mkdir -p /var/lib/postgresql/wal_archive
chown postgres:postgres /var/lib/postgresql/wal_archive

# 启动(先启动 pg-node1)
systemctl daemon-reload
systemctl enable patroni
systemctl start patroni
systemctl status patroni

在 pg-node1 成功初始化后,再依次启动 pg-node2、pg-node3,Patroni 会自动通过 pg_basebackup 克隆主节点数据。


第五步:HAProxy 配置
#

HAProxy 实现两个虚拟端口:

  • 5000:写端口,只转发到当前 Leader(Patroni REST API 返回 HTTP 200 表示主节点,HTTP 503 表示从节点)
  • 5001:读端口,转发到所有 Replica
apt-get install -y haproxy

/etc/haproxy/haproxy.cfg

global
    maxconn 100000
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode tcp
    option tcplog
    option dontlognull
    timeout connect 5s
    timeout client 30s
    timeout server 30s
    timeout check 5s

#
# 统计页面
#
listen stats
    bind *:7000
    mode http
    stats enable
    stats uri /haproxy
    stats refresh 10s
    stats show-legends
    stats auth admin:haproxy_admin_pass

#
# 主节点(读写)端口 5000
# Patroni 主节点 REST API 返回 HTTP 200
#
listen pg_primary
    bind *:5000
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node1 192.168.1.101:5432 check port 8008
    server pg-node2 192.168.1.102:5432 check port 8008
    server pg-node3 192.168.1.103:5432 check port 8008

#
# 从节点(只读)端口 5001
# Patroni 从节点 REST API 返回 HTTP 200(/replica 接口)
#
listen pg_replicas
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node1 192.168.1.101:5432 check port 8008
    server pg-node2 192.168.1.102:5432 check port 8008
    server pg-node3 192.168.1.103:5432 check port 8008

注意/primary 接口只在 Leader 节点返回 200,/replica 接口只在 Replica 节点返回 200。HAProxy 的健康检查会自动将当前主节点从只读池中排除。

haproxy -c -f /etc/haproxy/haproxy.cfg   # 验证配置语法
systemctl enable haproxy
systemctl start haproxy

连接验证:

psql -h 192.168.1.100 -p 5000 -U postgres -c "SELECT pg_is_in_recovery();"
# 应返回 f(false,即主节点)

psql -h 192.168.1.100 -p 5001 -U postgres -c "SELECT pg_is_in_recovery();"
# 应返回 t(true,即从节点)

第六步:patronictl 常用运维命令
#

# 设置 PATRONICTL_CONFIG_FILE 环境变量,简化命令
export PATRONICTL_CONFIG_FILE=/etc/patroni/patroni.yml

# 查看集群状态
patronictl -c /etc/patroni/patroni.yml list

# 输出示例:
# + Cluster: pg-cluster (7234567890123456789) +---------+----+-----------+
# | Member   | Host              | Role    | State   | TL | Lag in MB |
# +----------+-------------------+---------+---------+----+-----------+
# | pg-node1 | 192.168.1.101:5432 | Leader  | running |  1 |           |
# | pg-node2 | 192.168.1.102:5432 | Replica | running |  1 |         0 |
# | pg-node3 | 192.168.1.103:5432 | Replica | running |  1 |         0 |
# +----------+-------------------+---------+---------+----+-----------+

# 手动 Switchover(计划内切换,有确认提示)
patronictl -c /etc/patroni/patroni.yml switchover pg-cluster \
  --master pg-node1 --candidate pg-node2 --scheduled now

# 强制 Failover(紧急切换,不等当前主节点响应)
patronictl -c /etc/patroni/patroni.yml failover pg-cluster \
  --master pg-node1 --candidate pg-node2 --force

# 重启某个节点(等待,不强制)
patronictl -c /etc/patroni/patroni.yml restart pg-cluster pg-node2

# 重新加载配置(patroni.yml 改动后)
patronictl -c /etc/patroni/patroni.yml reload pg-cluster

# 暂停自动故障转移(维护窗口必用)
patronictl -c /etc/patroni/patroni.yml pause pg-cluster
# 恢复
patronictl -c /etc/patroni/patroni.yml resume pg-cluster

# 编辑 DCS 中的集群配置(等效于修改 bootstrap.dcs 段)
patronictl -c /etc/patroni/patroni.yml edit-config pg-cluster

# 查看历史时间线
patronictl -c /etc/patroni/patroni.yml history pg-cluster

# 删除某个成员的 DCS 注册(成员彻底下线时)
patronictl -c /etc/patroni/patroni.yml remove pg-cluster

第七步:故障切换演练
#

演练场景:Kill 主节点,观察自动选主

# 确认当前主节点
patronictl -c /etc/patroni/patroni.yml list
# pg-node1 是 Leader

# 终端1:持续监控
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# 终端2:模拟主节点宕机(在 pg-node1 上执行)
systemctl stop patroni

# 观察切换过程(约 30s,即 TTL 时间):
# 1. pg-node1 状态变为 stopped
# 2. etcd 中 leader key TTL 超时(30s)
# 3. pg-node2 或 pg-node3 竞争 leader key
# 4. 获胜节点执行 promote,成为新 Leader
# 5. HAProxy 健康检查感知变化,流量切换(约 3-9s 后)

HAProxy 侧验证:

# 持续测试写端口是否恢复
while true; do
  psql -h 192.168.1.100 -p 5000 -U postgres -c "SELECT now(), pg_is_in_recovery();" 2>&1
  sleep 2
done

pg_rewind 恢复老主节点:

当 pg-node1 重新上线时,因为它的时间线已经落后于新主,不能直接加入集群。Patroni 配置了 use_pg_rewind: true 后会自动处理,但需要确认 wal_log_hints=on 已生效:

# 重新启动 pg-node1 的 Patroni
systemctl start patroni

# Patroni 会自动:
# 1. 检测到时间线不匹配
# 2. 执行 pg_rewind 从新主节点同步差异 WAL
# 3. 以 Replica 身份重新加入集群

# 如果 pg_rewind 失败,手动克隆:
systemctl stop patroni
rm -rf /var/lib/postgresql/16/main/*
pg_basebackup -h 192.168.1.102 -U replicator -D /var/lib/postgresql/16/main \
  -P -Xs -R -C -S pg-node1-slot
chown -R postgres:postgres /var/lib/postgresql/16/main
systemctl start patroni

第八步:监控集成
#

Prometheus patroni_exporter
#

# 每个 Patroni 节点安装 patroni_exporter
pip3 install patroni[zookeeper,etcd,consul,kubernetes]

# 也可以用独立的 patroni exporter
# https://github.com/woblerr/patroni_exporter
wget https://github.com/woblerr/patroni_exporter/releases/download/v0.8.0/patroni_exporter_linux_amd64
chmod +x patroni_exporter_linux_amd64
mv patroni_exporter_linux_amd64 /usr/local/bin/patroni_exporter

实际上 Patroni 内置了 /metrics 接口,直接在 Prometheus 中刮取即可:

# prometheus.yml
scrape_configs:
  - job_name: 'patroni'
    static_configs:
      - targets:
          - '192.168.1.101:8008'
          - '192.168.1.102:8008'
          - '192.168.1.103:8008'
    metrics_path: /metrics

关键 Prometheus 告警规则:

# patroni-alerts.yml
groups:
  - name: patroni
    rules:
      - alert: PatroniClusterUnhealthy
        expr: patroni_cluster_unlocked == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Patroni 集群无 Leader({{ $labels.scope }})"

      - alert: PatroniReplicaLagging
        expr: patroni_replica_lag_in_megabytes > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "副本 {{ $labels.patroni_member }} 延迟 {{ $value }}MB"

      - alert: PatroniMemberDown
        expr: patroni_patroni_info == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Patroni 节点 {{ $labels.instance }} 离线"

      - alert: PatroniFailoverDetected
        expr: changes(patroni_master[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "集群 {{ $labels.scope }} 发生了主节点切换"

postgres_exporter 补充指标
#

# 安装 postgres_exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar xzf postgres_exporter-0.15.0.linux-amd64.tar.gz
mv postgres_exporter-0.15.0.linux-amd64/postgres_exporter /usr/local/bin/

# 配置数据源
export DATA_SOURCE_NAME="postgresql://postgres:PGSuperStr0ng!@localhost:5432/postgres?sslmode=disable"
postgres_exporter --web.listen-address=":9187" &

在 Kubernetes 上:CloudNativePG Operator
#

如果数据库运行在 K8s 上,CloudNativePG(CNPG) 是 Patroni 的云原生替代方案,由原 Zalando postgres-operator 核心团队开发。

# 安装 CNPG Operator
kubectl apply -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.22/releases/cnpg-1.22.0.yaml

Cluster 资源定义(三节点,内置 HAProxy 等效功能):

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-cluster
  namespace: database
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.2

  postgresql:
    parameters:
      shared_buffers: "4GB"
      work_mem: "16MB"
      max_connections: "200"
      wal_level: "replica"
      max_wal_senders: "10"
      shared_preload_libraries: "pg_stat_statements"
      pg_stat_statements.max: "10000"
      log_min_duration_statement: "1000"
    pg_hba:
      - host all all 10.0.0.0/8 md5
      - host replication replicator 10.0.0.0/8 md5

  bootstrap:
    initdb:
      database: appdb
      owner: appuser
      secret:
        name: pg-user-secret
      encoding: UTF8
      dataChecksums: true

  storage:
    size: 100Gi
    storageClass: gp3

  walStorage:
    size: 20Gi
    storageClass: gp3

  backup:
    retentionPolicy: "30d"
    barmanObjectStore:
      destinationPath: s3://your-bucket/pg-cluster
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 8

  resources:
    requests:
      memory: "8Gi"
      cpu: "2"
    limits:
      memory: "16Gi"
      cpu: "4"

  # 自动故障转移配置
  failoverDelay: 0
  switchoverDelay: 3600

  # 监控
  monitoring:
    enablePodMonitor: true

CNPG 会自动创建三个 Service:

  • pg-cluster-rw:指向 Leader(应用写连接)
  • pg-cluster-ro:指向 Replica(负载均衡只读)
  • pg-cluster-r:指向所有节点

常见问题
#

1. etcd key TTL 超时时间应该设多少?

ttl: 30 意味着主节点宕机后最长 30 秒内完成切换。对于多数业务可接受,如需更快切换可设 15,但太小会导致网络抖动引发误切。

2. maximum_lag_on_failover 的作用

如果所有 Replica 的 WAL 滞后都超过这个值(默认 1MB),Patroni 会拒绝自动故障转移,避免数据丢失,需要 DBA 手动介入。

3. 两个节点都认为自己是 Leader(脑裂)

Patroni 通过 etcd 的 CAS(Compare-And-Swap)操作确保同一时刻只有一个 Leader 持有 key,从协议层面杜绝脑裂。但 etcd 本身崩溃时,Patroni 会进入 pause 模式,保持现状不切换。

4. pg_rewind 权限问题

rewind_user 需要特定函数的 EXECUTE 权限(已在 post_init.sh 中授权)。PostgreSQL 15+ 可以直接 GRANT pg_rewind TO rewind_user;

5. 生产建议

  • etcd 使用 SSD,fsync 延迟直接影响 TTL 判断准确性
  • synchronous_mode: true 可开启同步复制,RPO=0 但写延迟升高
  • 维护窗口操作前务必先 patronictl pause,避免意外故障转移
  • 定期测试 switchover,确保切换流程熟练
Wenzhuo Huang
作者
Wenzhuo Huang
搞运维的工程师,写代码的运维人。专注 Kubernetes、AWS、GitOps 与基础设施可靠性。这个博客既是我的技术笔记本,也是我踩过的坑的受害者档案。

相关文章

PostgreSQL 运维实战:配置调优、连接池、慢查询与高可用

·1918 字·10 分钟
系统梳理 PostgreSQL 运维核心技能:从 shared_buffers、WAL 参数调优,到 PgBouncer 事务模式配置;从 pg_stat_statements 慢查询分析到 PITR 时间点恢复;以及主从流复制、膨胀表清理和 Prometheus 监控指标的完整实践。