Archive for category PostgreSQL

PostgreSQL 9.6 on RHEL 7.2 -FATAL: semctl(SETVAL) failed: Invalid argument

PostgreSQL 9.6 on RHEL 7.2 -FATAL: semctl(SETVAL) failed: Invalid argument

The error below started occurring once we upgraded to PostgreSQL 9.6 on RHEL 7.2. It was strange because all we did was had a user to login, check the status of pgsql and logout.

 Dec 7 22:50:01 hostname systemd: Created slice user-1004.slice.
 Dec 7 22:50:01 hostname systemd: Starting user-1004.slice.
 Dec 7 22:50:01 hostname systemd: Started Session 2243 of user postgres.
 Dec 7 22:50:01 hostname systemd: Starting Session 2243 of user postgres.
 Dec 7 22:50:01 hostname systemd: Created slice user-0.slice.
 Dec 7 22:50:01 hostname systemd: Starting user-0.slice.
 Dec 7 22:50:01 hostname systemd: Started Session 2244 of user root.
 Dec 7 22:50:01 hostname systemd: Starting Session 2244 of user root.
 Dec 7 22:50:01 hostname postgres[12064]: [432-1] [unknown]: [local]: 58481389.2f20: LOG: connection received: host=[local]
 Dec 7 22:50:01 hostname postgres[12064]: [433-1] postgres: [local]: 58481389.2f20: FATAL: semctl(153976863, 2, SETVAL, 0) failed: Invalid argument
 Dec 7 22:50:01 hostname postgres[21799]: [432-1] : : 58444abd.5527: LOG: server process (PID 12064) exited with exit code 1
 Dec 7 22:50:01 hostname postgres[21799]: [433-1] : : 58444abd.5527: LOG: terminating any other active server processes
 Dec 7 22:50:01 hostname postgres[11514]: [434-1] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: WARNING: terminating connection because of crash of another server process
 Dec 7 22:50:01 hostname postgres[11514]: [434-2] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: DETAIL: The postmasteIP_ADDRESSr has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 Dec 7 22:50:01 hostname postgres[11514]: [434-3] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: HINT: In a moment you should be able to reconnect to the database and repeat your command.
 Dec 7 22:50:01 hostname postgres[12074]: [434-1] [unknown]: ::1(48077): 58481389.2f2a: LOG: connection received: host=::1 port=48077
 Dec 7 22:50:01 hostname postgres[12074]: [435-1] repmgr: ::1(48077): 58481389.2f2a: FATAL: the database system is in recovery mode
 Dec 7 22:50:01 hostname Backup: [INFO]DB check start.
 Dec 7 22:50:01 hostname postgres[21799]: [434-1] : : 58444abd.5527: LOG: all server processes terminated; reinitializing
 Dec 7 22:50:01 hostname postgres[21799]: [435-1] : : 58444abd.5527: LOG: could not remove shared memory segment "/PostgreSQL.1721076088": No such file or directory
 Dec 7 22:50:01 hostname postgres[21799]: [436-1] : : 58444abd.5527: LOG: semctl(152961024, 0, IPC_RMID, ...) failed: Invalid argument
 Dec 7 22:50:01 hostname postgres[21799]: [437-1] : : 58444abd.5527: LOG: semctl(152993793, 0, IPC_RMID, ...) failed: Invalid argument

After researching on the issue, we found this bug :).

Resolution

1) Set RemoveIPC=no in /etc/systemd/logind.conf

2) Reboot the server or restart systemd-logind as follows:
 # systemctl daemon-reload
 # systemctl restart systemd-logind

Reference:

  1. https://bugzilla.redhat.com/show_bug.cgi?id=1287618
  2. https://bugzilla.redhat.com/show_bug.cgi?id=1264533

Note:

  • Please note that I have seen this RHEL 7 issue with Oracle database too. It’s a RHEL 7 bug so it should applies to most database using IPC calls.

Hope this helps.

Regards,
Wei Shan

Leave a comment

Story of recovering a PostgreSQL 9.2 database

Story of recovering a PostgreSQL 9.2 database

Background Story

It was a 2 node PostgreSQL 9.2 streaming replication setup. The master node crashed without possibility of recovery. Thus, there was a need to restore the standby database to resume the production services. The standby database also reboot when master node crashed. The standby node was performing database backup via pg_start_backup when it rebooted.

I will simulate the recovery below.


postgresql.conf

wal_level = hot_standby
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.2/data/pg_archive/%f'

recovery.conf

standby_mode = 'on'
primary_conninfo = 'host=10.1.0.3 port=5432 user=postgres application_name=testdb keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'cp /var/lib/pgsql/9.2/data/pg_archive/%f %p'
recovery_target_timeline = 'latest'

Start PostgreSQL backup on the database

postgres=# select pg_start_backup('test1');
 pg_start_backup
-----------------
 0/40000B8
(1 row)

Abort PostgreSQL database and start it again. (Notice we are not able to login the psql)

-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -m immediate stop
waiting for server to shut down... done
server stopped
-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start
server starting
-bash-4.1$ psql
psql: FATAL: the database system is starting up

Notice that the startup process is waiting for a WAL file.

-bash-4.1$ ps -aef | grep postgres
root 2405 2283 0 18:19 pts/1 00:00:00 su - postgres
postgres 2407 2405 0 18:19 pts/1 00:00:00 -bash
postgres 2921 1 0 19:21 pts/1 00:00:00 /usr/pgsql-9.2/bin/postgres -D /var/lib/pgsql/9.2/data
postgres 2923 2921 0 19:21 ? 00:00:00 postgres: logger process
postgres 2924 2921 0 19:21 ? 00:00:00 postgres: startup process waiting for 000000010000000000000004
postgres 2925 2921 0 19:21 ? 00:00:00 postgres: checkpointer process
postgres 2926 2921 0 19:21 ? 00:00:00 postgres: writer process

Error logs in /var/log/messages

FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: could not connect to the primary server: could not connect to server: Connection timed out
 Is the server running on host "10.1.0.3" and accepting
 TCP/IP connections on port 5432?

cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/000000010000000000000004': No such file or directory
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
LOG: received smart shutdown request
FATAL: terminating walreceiver process due to administrator command
LOG: shutting down
LOG: database system is shut down
LOG: database system was shut down in recovery at 2016-08-03 19:17:49 SGT
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
LOG: entering standby mode
LOG: restored log file "000000010000000000000004" from archive
LOG: redo starts at 0/4000020
LOG: record with zero length at 0/40000B8
LOG: record with zero length at 0/40000B8
cp: cannot stat `/var/lib/pgsql/9.2/data/pg_archive/00000002.history': No such file or directory
FATAL: the database system is starting up

The WAL file that it was waiting for existed in pg_xlog but not in pg_archive.

-bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_archive
000000010000000000000002 000000010000000000000003

-bash-4.1$ ls /var/lib/pgsql/9.2/data/pg_xlog
000000010000000000000002 000000010000000000000003 000000010000000000000004

Solution: Remove recovery.conf and restart PostgreSQL. Noticed the we didn’t have to use pg_stop_backup to remove the backup label.

-bash-4.1$ mv recovery.conf recovery.done
-bash-4.1$ /usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data -l logfile start
server starting
-bash-4.1$ psql
psql (9.2.17)
Type "help" for help.

postgres=# select pg_stop_backup();
ERROR: a backup is not in progress

Verification (Formatted for readability)

-bash-4.1$ /usr/pgsql-9.2/bin/pg_controldata
Database cluster state: in production
pg_control last modified: Wed 03 Aug 2016 07:22:27 PM SGT
Latest checkpoint location: 0/40000B8
Prior checkpoint location: 0/4000058
Latest checkpoint's REDO location: 0/40000B8
Minimum recovery ending location: 0/0
Backup start location: 0/4000020
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby

Summary

There is a few learning points from the story.

  • When a database crashed while it is running pg_start_backup() mode, after it reboots, it is not required to run pg_stop_backup().
  • The reason why it was waiting for the WAL file that was in pg_xlog is due to recovery.conf. The restore command in recovery.conf was pointing to pg_archive to restore the WAL file.
  • It was a crash recovery, we didn’t have to use recovery.conf to restore the database. recovery.conf is only used when we are restoring the database from backup

Regards,
Wei Shan

Leave a comment

How to generate authentication file for PgBouncer

How to generate authentication file for PgBouncer

PgBouncer is a lightweight connection pooler that prevents the MAX_CONNECTION reached error in PostgreSQL. Basically, it does a very good job for what it says. It maintains a pool of connections for your applications. This helps in reducing the overhead of setting up and tearing down connections for every client incoming request. [link]

By default, it uses a plain text password file for authenticating users to the PostgreSQL database. This will obviously raise a red flag during security audit! Let’s see what we can do to prevent that red flag 🙂

First, we have to install pg_bouncer and psycopg2.

yum install pgbouncer.x86_64 python-psycopg2.x86_64

Now that we got PgBouncer installed. We have to do configuration.

# cd /etc
# chown -R postgres. pgbouncer/
# su - postgres
$ cd /etc/pgbouncer/
$ ls
mkauth.py mkauth.pyc mkauth.pyo pgbouncer.ini

Now, let’s generate that password file 🙂

$ ./mkauth.py "/etc/pgbouncer/password.txt" "dbname='postgres' user='postgres' host='localhost'"
$ ls
mkauth.py mkauth.pyc mkauth.pyo password.txt pgbouncer.ini

Let’s see if we can view the password!

$ cat password.txt
"postgres" "" ""
"repmgr" "md58ea99ab1ec3bd8d8a6162df6c8e1ddcd" ""
"testuser" "md5d70a0452418aeb8fb4030eae69ca2856" ""

It uses the pg_shadow table and generate the password file. Obviously, it is only salted and double hashed with MD5. It is not as secure as SHA-256, but still a lot more secure that reading the password in plain text isn’t it?

Regards,
Wei Shan

Leave a comment

WARNING – Granting of psql/sqlplus executables to sudoers in Linux/Unix

WARNING – Granting of psql/sqlplus executables to sudoers in Linux/Unix

I discovered this behavior accidentally during my project work recently. I needed psql access to a RHEL 6 system. However, the company policy enforces a minimum access policy (Well done!). Thus, my account was given a sudoer permission to only use psql executable. I needed to perform DBA tasks on this system.

Let me show you the issue.

[weishan]$ sudo su -
[sudo] password for weishan:
ユーザー user は'/bin/su -' を root として  上で実行することは許可されていません。すみません。
(This means user not allowed to execute su -)

[weishan]$ tail -n 10 /var/log/messages
tail: `messages' を 読み込み用でオープンできません: 許可がありません
(This means access denied)

[weishan]$ sudo /home/postgres/pgsql/bin/psql -h localhost -p 5432 -U postgres postgres
[sudo] password for weishan:
psql (9.2.4)
Type "help" for help.

Looking at the output above, the sudoer was configured correctly! Now, let’s see..

psql> \!
[root@gsdb03s log]# tail -n 5 /var/log/messages
May 10 15:34:47  snmpd[22564]: Connection from UDP: [ip]:63484->[ip]
May 10 15:34:47 snmpd[22564]: Connection from UDP: [ip]:63484-> [ip]
May 10 15:34:47 snmpd[22564]: Connection from UDP: [ip]:63484-> [ip]
May 10 15:34:47 snmpd[22564]: Connection from UDP: [ip]:63484-> [ip]

From the above, we can show that it is technically possible for unintended DBAs to get root access even if you did not allow them access explicitly. Thus, we should be mindful while granting psql or sqlplus executable.

I did not test this on sqlplus, but the behaviour should be the same since both psql and sqlplus allow the user to “temporarily” exit the console to OS level access.

Moral of the story:

  1. Do not configure sudo for user to access psql/sqlplus to login to postgres as root user directly.
  2. Configure sudo to only allow weishan user to “sudo su -” to postgres user.
  3. Configure logging for sudo access

Regards,
Wei Shan

 

 

Leave a comment

Playing Around with PostgreSQL date/time Datatype

Playing Around with PostgreSQL date/time Datatype

To check current timestamp. (Notice it’s GMT+8 in the results)

postgres=# select now();
 now
-------------------------------
 2016-02-29 14:19:36.046499+08
(1 row)

To check current timestamp 1 hour ago.

postgres=# select (now() - interval '1 hour');
 ?column?
-------------------------------
 2016-02-29 13:20:47.030513+08
(1 row)

To check current timestamp 1 day ago.

postgres=# select (now() - interval '1 day');
 ?column?
-------------------------------
 2016-02-28 14:20:53.820484+08

To check current timestamp in epoch time. (Unix timestamp in seconds)

postgres=# select extract(epoch from NOW());
 date_part
------------------
 1456726955.54147
(1 row)

To check current timestamp in epoch time. (Unix timestamp in milliseconds)

postgres=# select extract(epoch from NOW())*1000;
 ?column?
-----------------
 1456727010006.5
(1 row)

Bonus! By default, the output is integer. You need to cast it to float to get results with less than 0.

postgres=# select 1/2;
 ?column?
----------
 0
(1 row)
postgres=# select 1/2::float;
 ?column?
----------
 0.5
(1 row)

I had quite some “fun” playing with all these date/time datatype recently.

Regards,
Wei Shan

 

Leave a comment

PostgreSQL 9.x – pg_basebackup: could not get transaction log end position from server: FATAL: archive member “core.56229” too large for tar format

PostgreSQL 9.x – pg_basebackup: could not get transaction log end position from server: FATAL:  archive member “core.56229” too large for tar format

While I was trying to re-deploy a PostgreSQL streaming replication slave from the master node, I hit the following error:

[root@node2~]# su - postgres
-bash-4.1$ /usr/pgsql-9.2/bin/pg_basebackup -D /data/var/lib/pgsql/9.2/data -h node1 --xlog -P -v
transaction log start point: 70/CD000020
166537602/235436024 kB (100%), 2/2 tablespaces
pg_basebackup: could not get transaction log end position from server: FATAL: archive member "core.56229" too large for tar format

From node1, I found the following files in the $PGDATA:

4.2G ./core.53808
4.2G ./core.55595
4.2G ./core.55698
4.2G ./core.56229

The above was the core dump from previous PostgreSQL DB crashes. By default, PostgreSQL does not create files larger than 1G, so I figured this is not the datafiles. I asked Uncle Google and found out that it’s the core dump by PostgreSQL DB crashes.

Once I removed the files from the $PGDATA directory. The pg_basebackup command executes successfully. The interesting thing that I learnt is, pg_basebackup uses tar in the backend to compress and send the files over to the slave database.

Regards,
Wei Shan

Leave a comment

Pacemaker/Corosync/PostgreSQL Issue

Pacemaker/Corosync/PostgreSQL Issue

The moment I sat on my desk, there were tons of tickets complaining that the database were down. It was 2 node PostgreSQL database HA cluster running the following stack.

  • RedHat Linux 6.x
  • Pacemaker/Corosync
  • PostgreSQL 9.2.x
  • Master-Slave synchronous streaming replication between the 2 PostgreSQL nodes

I ran the crm_mon command immediately and found out that there was something weird with the PostgreSQL. The Pacemaker/Corosync HA wasn’t working as intended. The PostgreSQL database was down. Both nodes are online but only node1 was the slave but PostgreSQL wasn’t running on node2. I started going through the logs.

On Node1

Sep 21 04:10:19 node1 postgres[32318]: [5-1] FATAL: could not connect to the primary server: could not connect to server: No route to host
Sep 21 04:10:19 node1 postgres[32318]: [5-2] #011#011Is the server running on host "<IP address>" and accepting
Sep 21 04:10:19 node1 postgres[32318]: [5-3] #011#011TCP/IP connections on port 5432?
Sep 21 04:10:19 node1 postgres[32318]: [5-4] #011
Sep 21 04:10:22 node1 lrmd[3185]: notice: operation_finished: pgsql_monitor_7000:32319 [ 2015/09/21_04:10:22 INFO: Master does not exist. ]
Sep 21 04:10:22 node1 lrmd[3185]: notice: operation_finished: pgsql_monitor_7000:32319 [ 2015/09/21_04:10:22 WARNING: My data is out-of-date. sta
tus=DISCONNECT ]

On Node2

Sep 21 15:28:41 node2 pengine[3194]: notice: LogActions: Start pgsql:1#011(node2)
Sep 21 15:28:41 node2 pengine[3194]: notice: process_pe_message: Calculated Transition 1120: /var/lib/pacemaker/pengine/pe-input-234.bz2
Sep 21 15:28:42 node2 postgres[61043]: [1-1] LOG: database system was shut down in recovery at 2015-09-19 23:36:56 SGT
Sep 21 15:28:42 node2 postgres[61044]: [1-1] FATAL: the database system is starting up
Sep 21 15:28:42 node2 postgres[61043]: [2-1] LOG: entering standby mode
Sep 21 15:28:42 node2 postgres[61043]: [3-1] LOG: could not read from log file 59, segment 114, offset 0: No such file or directory
Sep 21 15:28:42 node2 postgres[61043]: [4-1] LOG: invalid primary checkpoint record
Sep 21 15:28:42 node2 postgres[61043]: [5-1] LOG: could not read from log file 59, segment 114, offset 0: No such file or directory
Sep 21 15:28:42 node2 postgres[61043]: [6-1] LOG: invalid secondary checkpoint record
Sep 21 15:28:42 node2 postgres[61043]: [7-1] PANIC: could not locate a valid checkpoint record

Explanation

  • Node1 thinks that it is the slave and hence was trying to connect to the master database on node2
  • Node1 wasn’t able to connect to the IP address because that IP address was managed by Pacemaker/Corosync and since PostgreSQL was down, the “IP address” resource won’t be running!
  • Node2 couldn’t start the database because it was corrupted

What I need to do now is to promote the non-corrupted database(node1) to master and re-setup the slave on node2!

Resolution

Execute on node2

Stop the pacemaker/corosync services on node2

root#crm node standby node2
root#"/etc/init.d/pacemaker stop
root#/etc/init.d/cman stop

Execute on node1

Promote node1 to master in both pacemaker/corosync and database level. You don’t have to reboot the server but I just did it because I wanted a “clean” state in node1

root#"crm resource migrate master-group node1
postgres#pg_ctl promote -D $PGDATA
root# init 6

Execute on node2

Re-setup node2 as slave by seeding it from node1. Start pacemaker/corosync services on node2 once the database is re-seeded

root# rm -fr $PGDATA
postgres#pg_basebackup -D /var/lib/pgsql/9.2/data -h 192.168.17.221 --xlog -P -v
root#crm node online node2
root#"/etc/init.d/pacemaker stop
root#/etc/init.d/cman stop
root# crm node online node2

Cheers,
Wei Shan

2 Comments

PostgreSQL 9.2 – Configuring plsh extension

PostgreSQL 9.2 – Configuring plsh extension

This extension allows you to run shell commands from within the database which are useful for moving or renaming files in the Unix level. The GitHub repository is over here. Below is a quick guide on how I get it up and running.

Create an auxiliary directory to hold the files.

# mkdir /tmp/plsh
# cd /tmp/plsh

Download and extract the tar files

# wget https://github.com/petere/plsh/archive/1.20130823.tar.gz
# tar xvf 1.20130823.tar.gz
# cd 1.20130823.tar.gz

Install gcc and postgresql92 in order to compile plsh.

# yum install postgresql92-devel.x86_64
# yum install gcc
# make PG_CONFIG=/usr/pgsql-9.2/bin/pg_config
# make install PG_CONFIG=/usr/pgsql-9.2/bin/pg_config

Create plsh extension. (This needs to be done on each database which requires plsh)

# psql weishan
weishan=# create extension plsh;

Verify that the plsh extension has been installed correctly.

weishan=# \dx
 List of installed extensions
 Name | Version | Schema | Description
---------+---------+------------+------------------------------
 plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
 plsh | 2 | public | PL/sh procedural language

Cheers,
Wei Shan

1 Comment