User Tools

Site Tools


postgresql_failback_streaming_replication

Bibliographie

Abstract

There is 2 ways to fix a posgresql slave

First way : rsync (quicker than the other, but less safe)

second way : pg_backup

Check the version of the master and the slave in production

su - postgres
pg_lsclusters

rsync methodology

for this script we have the postgresql datadir in /data but it could be in /var/lib please check before doing something wrong

On master and Slave :

apt-get install rsync 

put this script replication_pgsql.sh in the $homedir of the postgres user master

#!/bin/sh
datadir='/data/postgresql/9.6/main/'
slave='pgsql-stats-prod-ro-mk-001'
time /usr/bin/rsync \
     --rsh="ssh -2 -l postgres" \
     --verbose \
     --progress \
     --stats \
     --archive \
     --delete \
     /data/postgresql/9.6/main/ $slave:/data/postgresql/9.6/main/

time /usr/bin/rsync \
     --rsh="ssh -2 -l postgres" \
     --verbose \
     --progress \
     --stats \
     --archive \
     --delete \
     /data/postgresql/walshipping/9.6/main/ $slave:/data/postgresql/walshipping/9.6/main/

We need that the listen directive to be on stars : postgresql.conf

listen_addresses = '*'

wal logs need to be active on both master and slave

wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32

Alternative postgresql 9.6 and > :

wal_level = replica

Slave need to be allowed to replicate on the master : pg_hba.conf

host replication postgres 10.191.1.0/24 trust

Reload on the master to take in count the new pg_hba.conf

pg_ctlcluster 9.6 main reload

Postgresql on the slave need to be off

/etc/init.d/postgresql stop

On the master

su - postgres
psql
SELECT pg_start_backup('checkpoint1');

on the slave we clean the datadir WARNING TOU CAN LOOSE DATA IF YOU FAILED HERE

rm -rf /data/postgresql/9.6/main/*

On the Master

su - postgres
sh replication_pgsql.sh

When the rsync is finish you can come back on the psql master shell and say :

SELECT pg_stop_backup();

On slave you should be able to launch the postgresql

/etc/init.t/postgresql start

If all is ok you should see a streaming process son of postgresql master process on slave

postgres 25526  0.0  0.1 420428 90528 ?        S    07:40   0:00 /usr/lib/postgresql/9.6/bin/postgres -D /data/postgresql/9.6/main -c config_file=/etc/postgre
postgres 25527 10.3  0.2 423116 153848 ?       Ss   07:40  54:46  \_ postgres: 9.6/main: startup process   recovering 0000000300008D1B00000056
postgres 25544  0.3  0.2 421604 140280 ?       Ss   07:40   1:52  \_ postgres: 9.6/main: checkpointer process   
postgres 25545  1.0  0.2 420568 137724 ?       Ss   07:40   5:34  \_ postgres: 9.6/main: writer process   
postgres 27059  0.0  0.0 143056  3844 ?        Ss   10:57   0:05  \_ postgres: 9.6/main: stats collector process   
postgres 27060  1.3  0.0 425248 11064 ?        Ss   10:57   4:24  \_ postgres: 9.6/main: wal receiver process   streaming 8D1B/562937C0

pg_backup

On slave you have to clean the datadir (see how to do that in rsync procedure)

rm -rf /data/postgresql/9.6/main/*

Postgresql >= 9.6 and after on the slave in postgres user :

pg_basebackup -h hostname_of_the_master -D /data/postgresql/9.6/main/ -R -P -U postgres --xlog-method=stream

Postgresql >= 10 and after on the slave in postgres user :

pg_basebackup -h hostname_of_the_master -D /data/postgresql/9.6/main/ -R -P -U postgres --wal-method=stream

on the master force a checkpoint for the backup to launch

su - postgres
psql
CHECKPOINT;

The backup of the master data will be deployed on the slave (should take a lot of time) Be sure to have enough space on the slave to host all the datadir space !

When the pg_backup restored is finish you'll get a very long restoration on the slave :

 postgres: 9.6/main: startup process   recovering 0000000300008D1B00000056

Once the recovering is finish you'll get a streaming replication from the slave :

postgres 25526  0.0  0.1 420428 90528 ?        S    07:40   0:00 /usr/lib/postgresql/9.6/bin/postgres -D /data/postgresql/9.6/main -c config_file=/etc/postgre
postgres 25527  9.9  0.2 423116 153848 ?       Ss   07:40  55:00  \_ postgres: 9.6/main: startup process   recovering 0000000300008D1B00000087
postgres 25544  0.3  0.2 421604 140280 ?       Ss   07:40   1:54  \_ postgres: 9.6/main: checkpointer process   
postgres 25545  1.0  0.2 420568 137724 ?       Ss   07:40   5:51  \_ postgres: 9.6/main: writer process   
postgres 27059  0.0  0.0 143056  3844 ?        Ss   10:57   0:05  \_ postgres: 9.6/main: stats collector process   
postgres 27060  1.2  0.0 425248 11064 ?        Ds   10:57   4:37  \_ postgres: 9.6/main: wal receiver process   streaming 8D1B/87ECCEF8
postgres 29494  0.0  0.0 421716 13680 ?        Ss   16:33   0:00  \_ postgres: 9.6/main: postgres postgres [local] idle

debug and check all is ok

On master :

postgres=# **SELECT * from pg_stat_replication;**
  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |  state  | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+---------+---------------+----------------+----------------+-----------------+---------------+------------
 20985 |       10 | postgres | walreceiver      | 10.24.0.xx  |                 |       57756 | 2019-04-17 10:57:22.725532+02 |              | catchup | 8D18/47A00000 | 8D18/474E0000  | 8D18/47000000  | 8D18/46FFFA60   |             0 | async
(1 row)

check the binary position on master and diff with the slave :

postgres=# SELECT pg_current_xlog_location();
 pg_current_xlog_location 
--------------------------
 8D1B/6EC44000
(1 row)

On slave :

 postgres=# **SELECT pg_last_xlog_receive_location();**
 pg_last_xlog_receive_location 
-------------------------------
 8D1B/654D0A70
(1 row)
to summerize
SELECT pg_current_xlog_location(); --- on master
SELECT pg_last_xlog_receive_location(); --- on slave
SELECT pg_last_xlog_replay_location(); --- on slave

Monitoring small easy script

#!/bin/sh
# si la replication a plus de 600s entre deux checkpoint alors elle est cassé
timeout=600
control='/usr/lib/postgresql/9.xx/bin/pg_controldata'
datadir='/var/lib/postgresql/9.xx/main/'
last_check=`$control $datadir | grep 'Time of latest checkpoint' | awk '{print $5,$6,$7,$8,$9}' | date +%s`
now=`date +%s`
if [ `expr $now - $last_check` -gt $timeout ]; then
echo Replication cassé
else
echo replication ok
fi

to know timelag in pgsql replication

SELECT now() - pg_last_xact_replay_timestamp() AS time_lag;
postgresql_failback_streaming_replication.txt · Last modified: 2019/04/17 16:54 by bragon