PostgreSQL 9.2 – Configuring plsh extension

PostgreSQL 9.2 – Configuring plsh extension

This extension allows you to run shell commands from within the database which are useful for moving or renaming files in the Unix level. The GitHub repository is over here. Below is a quick guide on how I get it up and running.

Create an auxiliary directory to hold the files.

# mkdir /tmp/plsh
# cd /tmp/plsh

Download and extract the tar files

# wget https://github.com/petere/plsh/archive/1.20130823.tar.gz
# tar xvf 1.20130823.tar.gz
# cd 1.20130823.tar.gz

Install gcc and postgresql92 in order to compile plsh.

# yum install postgresql92-devel.x86_64
# yum install gcc
# make PG_CONFIG=/usr/pgsql-9.2/bin/pg_config
# make install PG_CONFIG=/usr/pgsql-9.2/bin/pg_config

Create plsh extension. (This needs to be done on each database which requires plsh)

# psql weishan
weishan=# create extension plsh;

Verify that the plsh extension has been installed correctly.

weishan=# \dx
 List of installed extensions
 Name | Version | Schema | Description
---------+---------+------------+------------------------------
 plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
 plsh | 2 | public | PL/sh procedural language

Cheers,
Wei Shan

Leave a comment

PostgreSQL 9.4 – Configuring Replication Slots

PostgreSQL 9.4 – Configuring Replication Slots

PostgreSQL 9.4 introduce a very useful feature called Replication Slots. Previously, we had to configure a high value for WAL_KEEP_SEGMENTS parameter in postgresql.conf. This is done via estimation based on delta change on the master database. Alternative, we could also configure the ARCHIVE_COMMAND instead to move it to another location for the slave to pick it up. Both methods require a larger than required capacity to store the WAL archives. This new feature is just plain cool :)

You can read more about it over here.

Step 1: Modify MAX_REPLICATION_SLOTS parameter in postgresql.conf on master

1.1) max_replication_slots = 0 =>max_replication_slots = 3
1.2) pg_ctl restart

Step 2: Create the replication slot on master

postgres=# SELECT * FROM pg_create_physical_replication_slot('postgres_slave');
 slot_name | xlog_position
 ----------------+---------------
 postgres_slave |

Step 3: Configure PRIMARY_SLOT_NAME in recovery.conf on slave

# echo "primary_slot_name = 'postgres_slave' >> recovery.conf
# pg_ctl restart

Step 4: Verify configuration on master

postgres=# SELECT * FROM pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
 ----------------+--------+-----------+--------+----------+--------+------+--------------+------
 postgres_slave | | physical | | | t | | | 1/50002B0

Cheers,
Wei Shan

Leave a comment

PostgreSQL – Autovacuum Daemon not working

PostgreSQL – Autovacuum Daemon not working

I have configured the Autovacuum process. However, it does not seem to be working as intended.

Autovacuum Configuration

postgres=# show autovacuum;
 autovacuum
------------
 on
postgres=# show track_counts;
 track_counts
--------------
 on

According to the documentation, only 2 parameters are required to enable the autovacuum process.

Verifying that Autovacuum is not working as intended
Step 1:

# grep autovacuum /var/log/messages

Step 2:

# ps -aef | grep autovacuum
postgres 4923 4883 0 Jul20 ? 00:00:15 postgres: autovacuum launcher process
root 52995 5313 0 10:12 pts/0 00:00:00 grep autovacuum

Step 3:

select schemaname, relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count from pg_stat_user_tables;

All the values were 0 for the tables in every single database.

There’s something weird here. Autovacuum is configured correctly and the process is running. However, the autovacuum job did not happen even once.

Troubleshooting Process

Step 1: Verify autovacuum settings.

psql> SELECT name, setting || unit AS setting FROM pg_settings WHERE category = 'Autovacuum';
 name | setting
---------------------------------+-----------
 autovacuum |
 autovacuum_analyze_scale_factor |
 autovacuum_analyze_threshold | 50
 autovacuum_freeze_max_age | 200000000
 autovacuum_max_workers | 3
 autovacuum_naptime | 60s
 autovacuum_vacuum_cost_delay | 20ms
 autovacuum_vacuum_cost_limit | -1
 autovacuum_vacuum_scale_factor |
 autovacuum_vacuum_threshold | 50
(10 rows)

The threshold for Vacuum and Analyze was 50 rows. Could it be that the DML on the databases be lesser than 50 so far?

Step 2: Create test table for deletion

psql> create table test as select * from emp;
SELECT 12495
psql> select count(*) from test;
 count
-------
 12495
(1 row)

Step 3: Verify autovacuum statistics on test table

psql> select schemaname,relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count from pg_stat_user_tables where relname='test';
-[ RECORD 1 ]-----+------------------------------
schemaname | public
relname | test
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze | 2015-07-22 12:02:23.931391+08
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 1

Step 4: Delete 5000 rows from test table

psql> delete from test where fingerprint in (select fingerprint from test order by updatetime limit 5000);
DELETE 5000

Step 5: Verify that the autovacuum statistics have changed.

psql> select schemaname,relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count from pg_stat_user_tables where relname='test';
-[ RECORD 1 ]-----+------------------------------
schemaname | public
relname | test
last_vacuum |
last_autovacuum | 2015-07-22 12:06:23.329375+08
last_analyze |
last_autoanalyze | 2015-07-22 12:06:24.109004+08
vacuum_count | 0
autovacuum_count | 1
analyze_count | 0
autoanalyze_count | 2

Summary

Autovacuum process is definitely working. The reason that it was not scheduling earlier was due to to low activity of the databases. This is proven by the autoanalyze_count counter which was increased from 1 to 2 after 5000 records was deleted.

Cheers,
Wei Shan

Leave a comment

PostgreSQL – Missing chunk 0 for toast value in pg_toast

PostgreSQL – Missing chunk 0 for toast value in pg_toast

What to do when you see similar errors in your pg_log?

STATEMENT:  SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1
ERROR:  unexpected chunk number 1 (expected 0) for toast value 5268264 in pg_toast_16444
STATEMENT:  SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1
ERROR:  missing chunk number 4 for toast value 5268263 in pg_toast_16444

The error above are associated with data corruption in your TOAST tables. These were the steps I took to resolve the issue. Firstly, identify the table with issues.

select pg_toast_16444::regclass;
regclass
--------
nodes

Now we know that the TOAST table having issue is associated with nodes tables (I know it’s pretty obvious in the logs!).

Step 1: Reindex the table

REINDEX table pg_toast.pg_toast_16444;
REINDEX table nodes;

Step 2: Vacuum and analyze the table

VACUUM ANALYZE nodes;

Step 3: If it doesn’t work, identify the corrupted rows and delete them (This will incur data loss!).

for ((i=0; i<"Number_of_rows_in_nodes"; i++ )); do psql -U "Username" "Database Name" -c "SELECT * FROM nodes LIMIT 1 offset $i" >/dev/null || echo $i; done

Step 3a: The mini script above will list the rows which are having data corruption. In this case, row 11.

ERROR:  missing chunk number 0 for toast value 5268251 in pg_toast_16444
11

Step 3b: Select the primary key of row 11.

psql> select id from nodes limit 1 offset 11;

id
----------------------------------
00000000000068f26b60aa5ef533cc85
(1 row)

Step 3c: Delete that row

psql> delete from nodes where id ='00000000000068f26b60aa5ef533cc85'

Step 4: Verify if issue has been resolved (No error mesages)

psql> VACUUM FULL nodes;

Regards,
Wei Shan

Leave a comment

Conflict Management in Oracle Active DataGuard 11gR2

Conflict Management in Oracle Active DataGuard 11gR2

In PostgreSQL, I found the following;

The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. An example of the problem situation is an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the primary, the primary doesn’t have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE can be applied.

The test below is to find out if this applies to Oracle as well. I could not find any information for Oracle Active DataGuard.

1. Execute on primary

SQL> select count(*) from test;
COUNT(*)
----------
 2052000

2. Execute on secondary

SQL> select * from (select * from (select * from test));

3. Execute on primary

SQL> truncate table test;

4. Ensure that changes have been replicated to standby

Primary
SQL> alter system switch logfile;
SQL> SELECT THREAD#, MAX(SEQUENCE#) AS "LAST_APPLIED_LOG" FROM V$LOG_HISTORY GROUP BY THREAD#;
<THREAD# LAST_APPLIED_LOG
---------- ----------------
 1 71
Standby
 SQL> SELECT SEQUENCE#,APPLIED FROM V$ARCHIVED_LOG ORDER BY SEQUENCE#
SEQUENCE# APPLIED
 ---------- ---------
 71 YES

After about 3 minutes, the standby will hit the following error;

ERROR:
 ORA-08103: object no longer exists

Using oerr utility will show us the following;

[oracle@primary ~]$ oerr ora 08103
08103, 00000, "object no longer exists"
// *Cause: The object has been deleted by another user since the operation
// began, or a prior incomplete recovery restored the database to
// a point in time during the deletion of the object.
// *Action: Delete the object if this is the result of an incomplete
// recovery.

I know the above error is not too intuitive… but hey, at least it’s a different error from ORA-01555 Snapshot Too Old

Summary

Interestingly, I also tested the same scenario with DROP and DELETE command. DROP command behaves exactly the same as TRUNCATE command. However, DELETE command is very different. The standby was allowed to complete the SELECT query. I believe the reason is because of UNDO tablespace is the primary. DROP and TRUNCATE does not generate UNDO information for the database to rollback. However, DELETE will generate UNDO information in the primary which is REPLICATED via REDO steam to the UNDO tablespace in standby. That is how it is able to maintain the read consistency in standby.

However, the above is just my educated guess. If you have a different opinion, do share it with me. I’m definitely all ears!

Cheers,
Wei Shan

Leave a comment

Oracle 11gR2 DataGuard – ORA-16826 Error

Oracle 11gR2 DataGuard – ORA-16826 Error

I was working with Oracle Active DataGuard recently when I hit the ORA-16826 error. This was what I did to resolve it. When I did a show configuraiton in DG broker, I saw the following error.

DGMGRL> show configuration;

Configuration - test_ADG
Protection Mode: MaxAvailability
 Databases:
 ORCL - Primary database
 ORCL_STBY - Physical standby database
 Warning: ORA-16826: apply service state is inconsistent with the DelayMinsproperty
Fast-Start Failover: DISABLED
Configuration Status:
 WARNING

So I used the oerr utility to find out more about the error.

[oracle@standby trace]$ oerr ora 16826

16826, 0000, "apply service state is inconsistent with the DelayMins property"
// *Cause: This warning was caused by one of the following reasons:
// - The apply service was started without specifying the real-time
// apply option or without the NODELAY option when the DelayMins
// property was set to zero.
// - The apply service was started with the real-time apply option or
// with the NODELAY option when the DelayMins property was set to
// a value greater than zero.
// *Action: Reenable the standby database to allow the broker to restart
// the apply service with the apply options that are consistent
// with the specified value of the DelayMins property.

To verify which is the database having the error.

DGMGRL> show database verbose ‘ORCL_STBY’;

Database - ORCL_STBY

Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 0 seconds
Real Time Query: OFF
Instance(s):
ORCL

Properties:
DelayMins = '0'

------Formatted for readability------

Database Status:
WARNING

Check the status on the standby. Note the bold columns. From managed to managed real time apply.

SELECT LOCAL.THREAD#, LOCAL.SEQUENCE# FROM (SELECT THREAD#, SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=1) LOCAL WHERE LOCAL.SEQUENCE# NOT IN (SELECT SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND THREAD# = LOCAL.THREAD#);

DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION
------------------------------ ----------------------- --------------- -------------- ------
LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive
LOG_ARCHIVE_DEST_2 MANAGED MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY

Change to real time apply mode.

SQL> alter database recover managed standby database cancel;

SQL> alter database recover managed standby database using current logfile disconnect;

DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION
------------------------------ ----------------------- --------------- -------------- --------- ----------------------------------------
LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive
LOG_ARCHIVE_DEST_2 MANAGED REAL TIME APPLY MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY

Check that the DataGuard broker configuration is successful now.

DGMGRL> show configuration;

Configuration - test_ADG
Protection Mode: MaxAvailability
 Databases:
 ORCL - Primary database
 ORCL_STBY - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS

Cheers,
Wei Shan

Leave a comment

PostgreSQL Error – WARNING: canceling wait for synchronous replication due to user request

PostgreSQL Error – WARNING: canceling wait for synchronous replication due to user request

I needed to create index on my PostgreSQL cluster today. It was 1 master performing synchronous streaming replication to 1 slave. I did the following:

1. Created a new tablespace on master to store all the indexes.
#pgsql create tablespace tablespace_index location ‘/var/lib/pgsql/9.2/data/tablespace’;

2. Create the actual index itself.
CREATE INDEX CONCURRENTLY certificatedata_idx2 ON CertificateData (username) tablespace tablespace_index;

The index creation hangs. I checked the pg_stat_activity and saw the the query was still active. It took me about 2 hours before I decide to stop the index creation. When I did a Ctrl-C, I saw the following:

WARNING: canceling wait for synchronous replication due to user request
 DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
 ^CCancel request sent
 ERROR: canceling statement due to user request

The error speaks for itself. It means you have an issue with the replication to the slave database. Now, this is weird. I decided to look at the log files. And I found out that the replication has stopped because I did not create the tablespace directory (/var/lib/pgsql/9.2/data/tablespace) on the slave! Thus, the replication process could not create an tablespace and the slave node crashes. After creating the directory, the index creation took only less than 20 minutes on a 7GB table!

Thankfully, all these was done in the staging environment.

Regards,
Wei Shan

Leave a comment

What to do when you join a company as a new DBA

What to do when you join a company as a new DBA

When we take up a new job and join a company as a new DBA, what should we do immediately? Unless your manager has specific task or expectations for you, this is what I recommend. Start documenting or looking through all the database you will be managing. This should include important database information like the following:

1. Database architecture diagram (Is this RAC, DataGuard or GoldenGate?)

2. Database parameters (AMM or ASMM, Archive logging, DB block size, DBID?)

3. Understand the application using the database (OLTP, DWH, Hybrid?)

4. Read and understand existing documentations available. (DR, Failover, Backup procedures)

The reason is simple, understand the environment you will be managing throughly. Every database environment is different. By documenting the environment, you will be able to understand it fully. It will help when you are troubleshooting issues.

Regards,
Wei Shan

Leave a comment

Follow

Get every new post delivered to your Inbox.