repmgr教程

简介

repmgr是一个开源工具套件，用于管理PostgreSQL服务器集群中的复制和故障转移。它使用工具来增强PostgreSQL的内置热备份功能，以设置备用服务器，监控复制以及执行管理任务，例如故障转移或手动切换操作。

前期准备

两台装有 postgresql 9.5 的服务器。

1. 安装

1.1 下载源码包https://repmgr.org/download/repmgr-5.0.0.tar.gz
1.2 解压

1	tar -zxvf repmgr-5.0.0.tar.gz

1.3 进入目录并编译
安装编译所需库

flex
libedit-dev
libkrb5-dev
libpam0g-dev
libreadline-dev
libselinux1-dev
libssl-dev
libxml2-dev

libxslt1-dev

1	apt install flex libedit-dev libkrb5-dev libpam0g-dev libreadline-dev libselinux1-dev libssl-dev libxml2-dev libxslt1-dev

1	./configure && make install

2. 配置

可复制 sample, sample 位于解压的根目录 repmgr.conf.sample。

1	cp repmgr.conf.sample /etc/postgresql/repmgr.conf

修改以下关键信息
主

node_id=1
node_name='138'	
conninfo='host=db1 user=postgres dbname=postgres password=d connect_timeout=2'	
data_directory='/var/lib/postgresql/9.5/main'
pg_bindir='/usr/lib/postgresql/9.5/bin'		
repmgr_bindir='/usr/lib/postgresql/9.5/bin'

从

node_id=2
node_name='139'	
conninfo='host=db2 user=postgres dbname=postgres password=d connect_timeout=2'	
data_directory='/var/lib/postgresql/9.5/main'
pg_bindir='/usr/lib/postgresql/9.5/bin'		
repmgr_bindir='/usr/lib/postgresql/9.5/bin'

3. 从库复制主库

主库中创建extension,psql 连接数据库后输入

1	create extension repmgr

创建流复制角色

1	CREATE ROLE repl login replication encrypted password 'd'

3.1 修改主库配置
postgresql.conf, 和一般的流复制配置无异

listen_addresses = '*' # 监听来自任何地址的请求
archive_mode = on # 开启归档
archive_command = 'cp %p /var/lib/postgresql/pg_archive/%f' # 归档 logfile segment 命令 %p 代表需要归档的文件路径，%f 归档文件名，此目录需要属于postgres角色，一面权限不足。
wal_level = hot_standby  # 归档开启热备用   minimal, archive, hot_standby, or logical
max_wal_senders = 10 # 最大wal发送者数量。通常和从库数量一致。
wal_keep_segments = 256 # 保留的最大日志文件段。一个16MB。
wal_sender_timeout = 60s # 发送超时时间
max_connections = 100 # 这个设置要注意下，从库的max_connections必须要大于主库的。（不知道）。最大连接数。
wal_log_hints = on  # 不然会导致主服务器挂掉之后恢复出问题。pg_rewind
# 从库配置
hot_standby = on  # 在恢复期间是否允许查询
max_standby_archive_delay = 30s # 读取 archive WAL时，取消查询时的最大延迟，-1 无限
#max_standby_streaming_delay = 30s # 读取streaming wal 取消查询的最大延迟，-1 无限
wal_receiver_status_interval = 10s # 发送回复的间隔
hot_standby_feedback = off  # 从 standby 状态发送信息以防止冲突
wal_receiver_timeout = 60s # 接收 master 等待的时间，默认 milliseconds
wal_retrieve_retry_interval = 5s # 尝试失败后重新检索 WAL 的间隔

pg_hba.conf

# local      DATABASE  USER  METHOD  [OPTIONS]
# host       DATABASE  USER  ADDRESS  METHOD  [OPTIONS]
host    replication     repl        0.0.0.0/0            md5
host    replication     postgres        0.0.0.0/0            md5

3.2 主库注册到 repmgr,并查看

repmgr -f /etc/postgresql/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

repmgr cluster show -f /etc/postgresql/repmgr.conf
 ID | Name | Role    | Status    | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
 1  | 138  | primary | * running |          | default  | 100      | host=db1 user=postgres dbname=postgres password=d connect_timeout=2

3.3 从库配置
将密码连接信息存储在.pgpass文件

1	vim ~/.pgpass

输入以下信息，格式为 host:port:DBName:username:password

1 2	db1:5432:postgres:postgres:d db2:5432:postgres:postgres:d

然后设置权限chmod 600 ~/.pgpass
clone 执行，执行前需要清空 pgdata 目录

1	repmgr -h db1 -U postgres -d postgres -f /etc/postgresql/repmgr.conf standby clone

你会发现目录生成了，且 recovery.conf 也生成了。
启动从库

1	pg_ctl -D . start

然后注册从库

repmgr standby register -f /etc/postgresql/repmgr.conf
INFO: connecting to local node "139" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "139" (ID: 2) successfully registered

查看节点

repmgr cluster show -f /etc/postgresql/repmgr.conf
 ID | Name | Role    | Status    | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
 1  | 138  | primary | * running |          | default  | 100      | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
 2  | 139  | standby |   running | 138      | default  | 100      | host=db2 user=postgres dbname=postgres password=d connect_timeout=2

4.配置自动failover(如果使用了 pgpool failover，则不要使用 repmgr 的failover)

因为自动 failover 基于repmgrd，首先需要包含其关联库,postgresql.conf,配置下面一行,需要重启
1
shared_preload_libraries ='repmgr'
主从都需要重启:
1
pg_ctl -D . restart

配置repmgr.conf,配置在上面配置的基础上添加（主从都需要）

failover=automatic # 开启自动故障转移
promote_command='/usr/lib/postgresql/9.5/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file'
follow_command='/usr/lib/postgresql/9.5/bin/repmgr standby follow -f /etc/postgresql/repmgr.conf --log-to-file --upstream-node-id=%n'
#上面这两个命令必须绝对路径

主备均要启动 repmgrd 服务
配置如果改了，则需要 kill 掉，然后重新启动.
1
repmgrd -f /etc/postgresql/repmgr.conf
测试
1. 主库挂掉
  1
  pg_ctl -D . stop

2 副库则看到

[2020-02-13 20:14:49] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-13 20:14:49] [WARNING] unable to ping "user=postgres password=d connect_timeout=2 dbname=postgres host=db1 fallback_application_name=repmgr"
[2020-02-13 20:14:49] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-13 20:14:49] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-13 20:14:49] [INFO] 0 active sibling nodes registered
[2020-02-13 20:14:49] [INFO] primary and this node have the same location ("default")
[2020-02-13 20:14:49] [INFO] no other sibling nodes - we win by default
[2020-02-13 20:14:49] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-13 20:14:49] [INFO] promote_command is:
"/usr/lib/postgresql/9.5/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file"
[2020-02-13 20:14:49] [NOTICE] promoting standby to primary
[2020-02-13 20:14:49] [DETAIL] promoting server "139" (ID: 2) using "/usr/lib/postgresql/9.5/bin/pg_ctl  -w -D '/var/lib/postgresql/9.5/main' promote"
LOG:  received promote request
[2020-02-13 20:14:49] [NOTICE] LOG:  redo done at 0/5000028
waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
LOG:  last completed transaction was at log time 2020-02-13 04:13:29.417036-08
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
[2020-02-13 20:14:50] [NOTICE] STANDBY PROMOTE successful
[2020-02-13 20:14:50] [DETAIL] server "139" (ID: 2) was successfully promoted to primary
[2020-02-13 20:14:50] [INFO] 0 followers to notify
[2020-02-13 20:14:50] [INFO] switching to primary monitoring mode
[2020-02-13 20:14:50] [NOTICE] monitoring cluster primary "139" (ID: 2)

可以看到副库已经变成了主库

postgres@ubuntu:~/9.5/main$ repmgr cluster show -f /etc/postgresql/repmgr.conf
 ID | Name | Role    | Status    | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+---------------------------------------------------------------------
 1  | 138  | primary | - failed  |          | default  | 100      | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
 2  | 139  | primary | * running |          | default  | 100      | host=db2 user=postgres dbname=postgres password=d connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "138" (ID: 1)

3 恢复
也就是将主库设置为新的从库
创建 recovery.conf 文件。
内部配置和备用库基本一样

1
2
3

standby_mode = 'on' # 热备份mode
primary_conninfo = 'user=repl password=d host=192.168.56.139 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres' #  连接信息
recovery_target_timeline = 'latest' # 复制最新数据

启动

1	pg_ctl start

可以看到

postgres@ubuntu:~/9.5/main$ repmgr cluster show -f /etc/postgresql/repmgr.conf
 ID | Name | Role    | Status    | Upstream | Location | Priority | Connection string
----+------+---------+-----------+----------+----------+----------+--------------------------------------------------------------------                      -
 1  | 138  | standby |   running |          | default  | 100      | host=db1 user=postgres dbname=postgres password=d connect_timeout=2
 2  | 139  | primary | * running |          | default  | 100      | host=db2 user=postgres dbname=postgres password=d connect_timeout=2

成功切换

问题

apt 安装后并配置数据库会导致启动数据库时could not access file “repmgr”: No such file or directory
It means the repmgr extension code is not installed in the PostgreSQL application directory. This typically happens when using PostgreSQL packages provided by a third-party vendor, which often have different filesystem layouts.

Either use PostgreSQL packages provided by the community or 2ndQuadrant; if this is not possible, contact your vendor for assistance.
简单就是说用的可能时第三方的包，还是从官网下载包并编译安装吧。

提示未输入密码

ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied

ERROR: unable to establish necessary replication connections
HINT: check replication permissions on the source server

检查 pg_hba.conf，以及 ~/.pgpass 是否输入正确的用户密码。没找到问题，然后重新初始化数据库部署了一次，就好了。

常用命令

参考:

https://repmgr.org/docs/repmgr.html
http://www.cslingjun.com/2019/11/27/PostgreSQL%E9%AB%98%E5%8F%AF%E7%94%A8%E9%9B%86%E7%BE%A4repmgr/

冬暖夏凉

pg_repmgr