Story of recovering a PostgreSQL 9.2 database

Story of recovering a PostgreSQL 9.2 database

Background Story

It was a 2 node PostgreSQL 9.2 streaming replication setup. The master node crashed without possibility of recovery. Thus, there was a need to restore the standby database to resume the production services. The standby database also reboot when master node crashed. The standby node was performing database backup via pg_start_backup when it rebooted.

I will simulate the recovery below.


postgresql.conf

wal_level = hot_standby
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.2/data/pg_archive/%f'

recovery.conf

standby_mode = 'on'
primary_conninfo = 'host=10.1.0.3 port=5432 user=postgres application_name=testdb keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'cp /var/lib/pgsql/9.2/data/pg_archive/%f %p'
recovery_target_timeline = 'latest'

Start PostgreSQL backup on the database

postgres=# select pg_start_backup('test1');
 pg_start_backup
-----------------
 0/40000B8
(1 row)

Abort PostgreSQL database and start it again. (Notice we are not able to login the psql)

-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -m immediate stop
waiting for server to shut down... done
server stopped
-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start
server starting
-bash-4.1$ psql
psql: FATAL: the database system is starting up

Notice that the startup process is waiting for a WAL file.

-bash-4.1$ ps -aef | grep postgres
root 2405 2283 0 18:19 pts/1 00:00:00 su - postgres
postgres 2407 2405 0 18:19 pts/1 00:00:00 -bash
postgres 2921 1 0 19:21 pts/1 00:00:00 /usr/pgsql-9.2/bin/postgres -D /var/lib/pgsql/9.2/data
postgres 2923 2921 0 19:21 ? 00:00:00 postgres: logger process
postgres 2924 2921 0 19:21 ? 00:00:00 postgres: startup process waiting for 000000010000000000000004
postgres 2925 2921 0 19:21 ? 00:00:00 postgres: checkpointer process
postgres 2926 2921 0 19:21 ? 00:00:00 postgres: writer process

Error logs in /var/log/messages

FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: could not connect to the primary server: could not connect to server: Connection timed out
 Is the server running on host "10.1.0.3" and accepting
 TCP/IP connections on port 5432?

cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
LOG: received smart shutdown request
FATAL: terminating walreceiver process due to administrator command
LOG: shutting down
LOG: database system is shut down
LOG: database system was shut down in recovery at 2016-08-03 19:17:49 SGT
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
LOG: entering standby mode
LOG: restored log file "000000010000000000000004" from archive
LOG: redo starts at 0/4000020
LOG: record with zero length at 0/40000B8
LOG: record with zero length at 0/40000B8
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
FATAL: the database system is starting up

The WAL file that it was waiting for existed in pg_xlog but not in pg_archive.

-bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_archive
000000010000000000000002 000000010000000000000003

-bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_xlog
000000010000000000000002 000000010000000000000003 000000010000000000000004

Solution: Remove recovery.conf and restart PostgreSQL. Noticed the we didn’t have to use pg_stop_backup to remove the backup label.

-bash-4.1$ mv recovery.conf recovery.done
-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start
server starting
-bash-4.1$ psql
psql (9.2.17)
Type "help" for help.

postgres=# select pg_stop_backup();
ERROR: a backup is not in progress

Verification (Formatted for readability)

-bash-4.1$ /usr/pgsql-9.2/bin/pg_controldata
Database cluster state: in production
pg_control last modified: Wed 03 Aug 2016 07:22:27 PM SGT
Latest checkpoint location: 0/40000B8
Prior checkpoint location: 0/4000058
Latest checkpoint's REDO location: 0/40000B8
Minimum recovery ending location: 0/0
Backup start location: 0/4000020
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby

Summary

There is a few learning points from the story.

  • When a database crashed while it is running pg_start_backup() mode, after it reboots, it is not required to run pg_stop_backup().
  • The reason why it was waiting for the WAL file that was in pg_xlog is due to recovery.conf. The restore command in recovery.conf was pointing to pg_archive to restore the WAL file.
  • It was a crash recovery, we didn’t have to use recovery.conf to restore the database. recovery.conf is only used when we are restoring the database from backup

Regards,
Wei Shan

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s