Story of recovering a PostgreSQL 9.2 database
It was a 2 node PostgreSQL 9.2 streaming replication setup. The master node crashed without possibility of recovery. Thus, there was a need to restore the standby database to resume the production services. The standby database also reboot when master node crashed. The standby node was performing database backup via pg_start_backup when it rebooted.
I will simulate the recovery below.
wal_level = hot_standby archive_mode = on archive_command = 'cp %p /var/lib/pgsql/9.2/data/pg_archive/%f'
standby_mode = 'on' primary_conninfo = 'host=10.1.0.3 port=5432 user=postgres application_name=testdb keepalives_idle=60 keepalives_interval=5 keepalives_count=5' restore_command = 'cp /var/lib/pgsql/9.2/data/pg_archive/%f %p' recovery_target_timeline = 'latest'
Start PostgreSQL backup on the database
postgres=# select pg_start_backup('test1'); pg_start_backup ----------------- 0/40000B8 (1 row)
Abort PostgreSQL database and start it again. (Notice we are not able to login the psql)
-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -m immediate stop waiting for server to shut down... done server stopped -bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start server starting -bash-4.1$ psql psql: FATAL: the database system is starting up
Notice that the startup process is waiting for a WAL file.
-bash-4.1$ ps -aef | grep postgres root 2405 2283 0 18:19 pts/1 00:00:00 su - postgres postgres 2407 2405 0 18:19 pts/1 00:00:00 -bash postgres 2921 1 0 19:21 pts/1 00:00:00 /usr/pgsql-9.2/bin/postgres -D /var/lib/pgsql/9.2/data postgres 2923 2921 0 19:21 ? 00:00:00 postgres: logger process postgres 2924 2921 0 19:21 ? 00:00:00 postgres: startup process waiting for 000000010000000000000004 postgres 2925 2921 0 19:21 ? 00:00:00 postgres: checkpointer process postgres 2926 2921 0 19:21 ? 00:00:00 postgres: writer process
Error logs in /var/log/messages
FATAL: the database system is starting up FATAL: the database system is starting up FATAL: could not connect to the primary server: could not connect to server: Connection timed out Is the server running on host "10.1.0.3" and accepting TCP/IP connections on port 5432? cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory LOG: received smart shutdown request FATAL: terminating walreceiver process due to administrator command LOG: shutting down LOG: database system is shut down LOG: database system was shut down in recovery at 2016-08-03 19:17:49 SGT cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory LOG: entering standby mode LOG: restored log file "000000010000000000000004" from archive LOG: redo starts at 0/4000020 LOG: record with zero length at 0/40000B8 LOG: record with zero length at 0/40000B8 cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory FATAL: the database system is starting up
The WAL file that it was waiting for existed in pg_xlog but not in pg_archive.
-bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_archive 000000010000000000000002 000000010000000000000003 -bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_xlog 000000010000000000000002 000000010000000000000003 000000010000000000000004
Solution: Remove recovery.conf and restart PostgreSQL. Noticed the we didn’t have to use pg_stop_backup to remove the backup label.
-bash-4.1$ mv recovery.conf recovery.done -bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start server starting -bash-4.1$ psql psql (9.2.17) Type "help" for help. postgres=# select pg_stop_backup(); ERROR: a backup is not in progress
Verification (Formatted for readability)
-bash-4.1$ /usr/pgsql-9.2/bin/pg_controldata Database cluster state: in production pg_control last modified: Wed 03 Aug 2016 07:22:27 PM SGT Latest checkpoint location: 0/40000B8 Prior checkpoint location: 0/4000058 Latest checkpoint's REDO location: 0/40000B8 Minimum recovery ending location: 0/0 Backup start location: 0/4000020 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: hot_standby
There is a few learning points from the story.
- When a database crashed while it is running pg_start_backup() mode, after it reboots, it is not required to run pg_stop_backup().
- The reason why it was waiting for the WAL file that was in pg_xlog is due to recovery.conf. The restore command in recovery.conf was pointing to pg_archive to restore the WAL file.
- It was a crash recovery, we didn’t have to use recovery.conf to restore the database. recovery.conf is only used when we are restoring the database from backup