0%

pg_repmgr

repmgr教程

简介

repmgr是一个开源工具套件,用于管理PostgreSQL服务器集群中的复制和故障转移。它使用工具来增强PostgreSQL的内置热备份功能,以设置备用服务器,监控复制以及执行管理任务,例如故障转移或手动切换操作。

前期准备

两台装有 postgresql 9.5 的服务器。

1. 安装

1.1 下载源码包https://repmgr.org/download/repmgr-5.0.0.tar.gz
1.2 解压

1
tar -zxvf repmgr-5.0.0.tar.gz

1.3 进入目录并编译
安装编译所需库

  • flex
  • libedit-dev
  • libkrb5-dev
  • libpam0g-dev
  • libreadline-dev
  • libselinux1-dev
  • libssl-dev
  • libxml2-dev
  • libxslt1-dev
    1
    apt install flex libedit-dev libkrb5-dev libpam0g-dev libreadline-dev libselinux1-dev libssl-dev libxml2-dev libxslt1-dev
    1
    ./configure && make install

    2. 配置

    可复制 sample, sample 位于解压的根目录 repmgr.conf.sample。
    1
    cp repmgr.conf.sample /etc/postgresql/repmgr.conf
    修改以下关键信息
    1
    2
    3
    4
    5
    6
    node_id=1
    node_name='138'
    conninfo='host=db1 user=postgres dbname=postgres password=d connect_timeout=2'
    data_directory='/var/lib/postgresql/9.5/main'
    pg_bindir='/usr/lib/postgresql/9.5/bin'
    repmgr_bindir='/usr/lib/postgresql/9.5/bin'
    1
    2
    3
    4
    5
    6
    node_id=2
    node_name='139'
    conninfo='host=db2 user=postgres dbname=postgres password=d connect_timeout=2'
    data_directory='/var/lib/postgresql/9.5/main'
    pg_bindir='/usr/lib/postgresql/9.5/bin'
    repmgr_bindir='/usr/lib/postgresql/9.5/bin'

    3. 从库复制主库

    主库中创建extension,psql 连接数据库后输入
    1
    create extension repmgr
    创建流复制角色
    1
    CREATE ROLE repl login replication encrypted password 'd'

3.1 修改主库配置
postgresql.conf, 和一般的流复制配置无异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
listen_addresses = '*' # 监听来自任何地址的请求
archive_mode = on # 开启归档
archive_command = 'cp %p /var/lib/postgresql/pg_archive/%f' # 归档 logfile segment 命令 %p 代表需要归档的文件路径,%f 归档文件名,此目录需要属于postgres角色,一面权限不足。
wal_level = hot_standby # 归档开启热备用 minimal, archive, hot_standby, or logical
max_wal_senders = 10 # 最大wal发送者数量。通常和从库数量一致。
wal_keep_segments = 256 # 保留的最大日志文件段。一个16MB。
wal_sender_timeout = 60s # 发送超时时间
max_connections = 100 # 这个设置要注意下,从库的max_connections必须要大于主库的。(不知道)。最大连接数。
wal_log_hints = on # 不然会导致主服务器挂掉之后恢复出问题。pg_rewind
# 从库配置
hot_standby = on # 在恢复期间是否允许查询
max_standby_archive_delay = 30s # 读取 archive WAL时,取消查询时的最大延迟,-1 无限
#max_standby_streaming_delay = 30s # 读取streaming wal 取消查询的最大延迟,-1 无限
wal_receiver_status_interval = 10s # 发送回复的间隔
hot_standby_feedback = off # 从 standby 状态发送信息以防止冲突
wal_receiver_timeout = 60s # 接收 master 等待的时间,默认 milliseconds
wal_retrieve_retry_interval = 5s # 尝试失败后重新检索 WAL 的间隔

pg_hba.conf

1
2
3
4
# local      DATABASE  USER  METHOD  [OPTIONS]
# host DATABASE USER ADDRESS METHOD [OPTIONS]
host replication repl 0.0.0.0/0 md5
host replication postgres 0.0.0.0/0 md5

3.2 主库注册到 repmgr,并查看

1
2
3
4
5
repmgr -f /etc/postgresql/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
1
2
3
4
repmgr cluster show -f /etc/postgresql/repmgr.conf
ID | Name | Role | Status | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
1 | 138 | primary | * running | | default | 100 | host=db1 user=postgres dbname=postgres password=d connect_timeout=2

3.3 从库配置
将密码连接信息存储在.pgpass文件

1
vim ~/.pgpass

输入以下信息,格式为 host:port:DBName:username:password

1
2
db1:5432:postgres:postgres:d
db2:5432:postgres:postgres:d

然后设置权限chmod 600 ~/.pgpass
clone 执行,执行前需要清空 pgdata 目录

1
repmgr -h db1 -U postgres -d postgres -f /etc/postgresql/repmgr.conf standby clone

你会发现目录生成了,且 recovery.conf 也生成了。
启动从库

1
pg_ctl -D . start

然后注册从库

1
2
3
4
5
6
repmgr standby register -f /etc/postgresql/repmgr.conf
INFO: connecting to local node "139" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "139" (ID: 2) successfully registered

查看节点

1
2
3
4
5
repmgr cluster show -f /etc/postgresql/repmgr.conf
ID | Name | Role | Status | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
1 | 138 | primary | * running | | default | 100 | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
2 | 139 | standby | running | 138 | default | 100 | host=db2 user=postgres dbname=postgres password=d connect_timeout=2

4.配置自动failover(如果使用了 pgpool failover,则不要使用 repmgr 的failover)

  1. 因为自动 failover 基于repmgrd,首先需要包含其关联库,postgresql.conf,配置下面一行,需要重启
    1
    shared_preload_libraries ='repmgr'
    主从都需要重启:
    1
    pg_ctl -D . restart
  2. 配置repmgr.conf,配置在上面配置的基础上添加(主从都需要)
    1
    2
    3
    4
    failover=automatic # 开启自动故障转移
    promote_command='/usr/lib/postgresql/9.5/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file'
    follow_command='/usr/lib/postgresql/9.5/bin/repmgr standby follow -f /etc/postgresql/repmgr.conf --log-to-file --upstream-node-id=%n'
    #上面这两个命令必须绝对路径
  3. 主备均要启动 repmgrd 服务
    配置如果改了,则需要 kill 掉,然后重新启动.
    1
    repmgrd -f /etc/postgresql/repmgr.conf
  4. 测试
    1. 主库挂掉
      1
      pg_ctl -D . stop
  5. 2 副库则看到
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    [2020-02-13 20:14:49] [INFO] checking state of node 1, 6 of 6 attempts
    [2020-02-13 20:14:49] [WARNING] unable to ping "user=postgres password=d connect_timeout=2 dbname=postgres host=db1 fallback_application_name=repmgr"
    [2020-02-13 20:14:49] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
    [2020-02-13 20:14:49] [WARNING] unable to reconnect to node 1 after 6 attempts
    [2020-02-13 20:14:49] [INFO] 0 active sibling nodes registered
    [2020-02-13 20:14:49] [INFO] primary and this node have the same location ("default")
    [2020-02-13 20:14:49] [INFO] no other sibling nodes - we win by default
    [2020-02-13 20:14:49] [NOTICE] this node is the only available candidate and will now promote itself
    [2020-02-13 20:14:49] [INFO] promote_command is:
    "/usr/lib/postgresql/9.5/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file"
    [2020-02-13 20:14:49] [NOTICE] promoting standby to primary
    [2020-02-13 20:14:49] [DETAIL] promoting server "139" (ID: 2) using "/usr/lib/postgresql/9.5/bin/pg_ctl -w -D '/var/lib/postgresql/9.5/main' promote"
    LOG: received promote request
    [2020-02-13 20:14:49] [NOTICE] LOG: redo done at 0/5000028
    waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
    LOG: last completed transaction was at log time 2020-02-13 04:13:29.417036-08
    LOG: selected new timeline ID: 2
    LOG: archive recovery complete
    LOG: MultiXact member wraparound protections are now enabled
    LOG: database system is ready to accept connections
    LOG: autovacuum launcher started
    [2020-02-13 20:14:50] [NOTICE] STANDBY PROMOTE successful
    [2020-02-13 20:14:50] [DETAIL] server "139" (ID: 2) was successfully promoted to primary
    [2020-02-13 20:14:50] [INFO] 0 followers to notify
    [2020-02-13 20:14:50] [INFO] switching to primary monitoring mode
    [2020-02-13 20:14:50] [NOTICE] monitoring cluster primary "139" (ID: 2)
    可以看到副库已经变成了主库
    1
    2
    3
    4
    5
    6
    7
    8
    postgres@ubuntu:~/9.5/main$ repmgr cluster show -f /etc/postgresql/repmgr.conf
    ID | Name | Role | Status | Upstream | Location | Priority | Connection string
    ----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
    1 | 138 | primary | - failed | | default | 100 | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
    2 | 139 | primary | * running | | default | 100 | host=db2 user=postgres dbname=postgres password=d connect_timeout=2

    WARNING: following issues were detected
    - unable to connect to node "138" (ID: 1)
  6. 3 恢复
    也就是将主库设置为新的从库
    创建 recovery.conf 文件。
    内部配置和备用库基本一样
    1
    2
    3
    standby_mode = 'on' # 热备份mode
    primary_conninfo = 'user=repl password=d host=192.168.56.139 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres' # 连接信息
    recovery_target_timeline = 'latest' # 复制最新数据
    启动
    1
    pg_ctl start
    可以看到
    1
    2
    3
    4
    5
    postgres@ubuntu:~/9.5/main$ repmgr cluster show -f /etc/postgresql/repmgr.conf
    ID | Name | Role | Status | Upstream | Location | Priority | Connection string
    ----+------+---------+-----------+----------+----------+----------+-------------------------------------------------------------------- -
    1 | 138 | standby | running | | default | 100 | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
    2 | 139 | primary | * running | | default | 100 | host=db2 user=postgres dbname=postgres password=d connect_timeout=2
    成功切换

问题

  1. apt 安装后并配置数据库会导致启动数据库时could not access file “repmgr”: No such file or directory
    It means the repmgr extension code is not installed in the PostgreSQL application directory. This typically happens when using PostgreSQL packages provided by a third-party vendor, which often have different filesystem layouts.

    Either use PostgreSQL packages provided by the community or 2ndQuadrant; if this is not possible, contact your vendor for assistance.
    简单就是说用的可能时第三方的包,还是从官网下载包并编译安装吧。

  2. 提示未输入密码

    1
    2
    3
    4
    5
    6
    ERROR: connection to database failed
    DETAIL:
    fe_sendauth: no password supplied

    ERROR: unable to establish necessary replication connections
    HINT: check replication permissions on the source server

    检查 pg_hba.conf,以及 ~/.pgpass 是否输入正确的用户密码。没找到问题,然后重新初始化数据库部署了一次,就好了。

    常用命令

    参考:

  3. https://repmgr.org/docs/repmgr.html

  4. http://www.cslingjun.com/2019/11/27/PostgreSQL%E9%AB%98%E5%8F%AF%E7%94%A8%E9%9B%86%E7%BE%A4repmgr/