PostgreSQL 9.2 – Configuring plsh extension
This extension allows you to run shell commands from within the database which are useful for moving or renaming files in the Unix level. The GitHub repository is over here. Below is a quick guide on how I get it up and running.
Create an auxiliary directory to hold the files.
# mkdir /tmp/plsh # cd /tmp/plsh
Download and extract the tar files
# wget https://github.com/petere/plsh/archive/1.20130823.tar.gz # tar xvf 1.20130823.tar.gz # cd 1.20130823.tar.gz
Install gcc and postgresql92 in order to compile plsh.
# yum install postgresql92-devel.x86_64 # yum install gcc # make PG_CONFIG=/usr/pgsql-9.2/bin/pg_config # make install PG_CONFIG=/usr/pgsql-9.2/bin/pg_config
Create plsh extension. (This needs to be done on each database which requires plsh)
# psql weishan weishan=# create extension plsh;
Verify that the plsh extension has been installed correctly.
weishan=# \dx List of installed extensions Name | Version | Schema | Description ---------+---------+------------+------------------------------ plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language plsh | 2 | public | PL/sh procedural language
PostgreSQL 9.4 – Configuring Replication Slots
PostgreSQL 9.4 introduce a very useful feature called Replication Slots. Previously, we had to configure a high value for WAL_KEEP_SEGMENTS parameter in postgresql.conf. This is done via estimation based on delta change on the master database. Alternative, we could also configure the ARCHIVE_COMMAND instead to move it to another location for the slave to pick it up. Both methods require a larger than required capacity to store the WAL archives. This new feature is just plain cool :)
You can read more about it over here.
Step 1: Modify MAX_REPLICATION_SLOTS parameter in postgresql.conf on master
1.1) max_replication_slots = 0 =>max_replication_slots = 3 1.2) pg_ctl restart
Step 2: Create the replication slot on master
postgres=# SELECT * FROM pg_create_physical_replication_slot('postgres_slave'); slot_name | xlog_position ----------------+--------------- postgres_slave |
Step 3: Configure PRIMARY_SLOT_NAME in recovery.conf on slave
# echo "primary_slot_name = 'postgres_slave' >> recovery.conf # pg_ctl restart
Step 4: Verify configuration on master
postgres=# SELECT * FROM pg_replication_slots; slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn ----------------+--------+-----------+--------+----------+--------+------+--------------+------ postgres_slave | | physical | | | t | | | 1/50002B0
PostgreSQL – Autovacuum Daemon not working
I have configured the Autovacuum process. However, it does not seem to be working as intended.
postgres=# show autovacuum; autovacuum ------------ on postgres=# show track_counts; track_counts -------------- on
According to the documentation, only 2 parameters are required to enable the autovacuum process.
Verifying that Autovacuum is not working as intended
# grep autovacuum /var/log/messages
# ps -aef | grep autovacuum postgres 4923 4883 0 Jul20 ? 00:00:15 postgres: autovacuum launcher process root 52995 5313 0 10:12 pts/0 00:00:00 grep autovacuum
select schemaname, relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count from pg_stat_user_tables; All the values were 0 for the tables in every single database.
There’s something weird here. Autovacuum is configured correctly and the process is running. However, the autovacuum job did not happen even once.
Step 1: Verify autovacuum settings.
psql> SELECT name, setting || unit AS setting FROM pg_settings WHERE category = 'Autovacuum'; name | setting ---------------------------------+----------- autovacuum | autovacuum_analyze_scale_factor | autovacuum_analyze_threshold | 50 autovacuum_freeze_max_age | 200000000 autovacuum_max_workers | 3 autovacuum_naptime | 60s autovacuum_vacuum_cost_delay | 20ms autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | autovacuum_vacuum_threshold | 50 (10 rows)
The threshold for Vacuum and Analyze was 50 rows. Could it be that the DML on the databases be lesser than 50 so far?
Step 2: Create test table for deletion
psql> create table test as select * from emp; SELECT 12495 psql> select count(*) from test; count ------- 12495 (1 row)
Step 3: Verify autovacuum statistics on test table
psql> select schemaname,relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count from pg_stat_user_tables where relname='test'; -[ RECORD 1 ]-----+------------------------------ schemaname | public relname | test last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2015-07-22 12:02:23.931391+08 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 1
Step 4: Delete 5000 rows from test table
psql> delete from test where fingerprint in (select fingerprint from test order by updatetime limit 5000); DELETE 5000
Step 5: Verify that the autovacuum statistics have changed.
psql> select schemaname,relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count from pg_stat_user_tables where relname='test'; -[ RECORD 1 ]-----+------------------------------ schemaname | public relname | test last_vacuum | last_autovacuum | 2015-07-22 12:06:23.329375+08 last_analyze | last_autoanalyze | 2015-07-22 12:06:24.109004+08 vacuum_count | 0 autovacuum_count | 1 analyze_count | 0 autoanalyze_count | 2
Autovacuum process is definitely working. The reason that it was not scheduling earlier was due to to low activity of the databases. This is proven by the autoanalyze_count counter which was increased from 1 to 2 after 5000 records was deleted.
PostgreSQL – Missing chunk 0 for toast value in pg_toast
What to do when you see similar errors in your pg_log?
STATEMENT: SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1 ERROR: unexpected chunk number 1 (expected 0) for toast value 5268264 in pg_toast_16444 STATEMENT: SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1 ERROR: missing chunk number 4 for toast value 5268263 in pg_toast_16444
The error above are associated with data corruption in your TOAST tables. These were the steps I took to resolve the issue. Firstly, identify the table with issues.
select pg_toast_16444::regclass; regclass -------- nodes
Now we know that the TOAST table having issue is associated with nodes tables (I know it’s pretty obvious in the logs!).
Step 1: Reindex the table
REINDEX table pg_toast.pg_toast_16444; REINDEX table nodes;
Step 2: Vacuum and analyze the table
VACUUM ANALYZE nodes;
Step 3: If it doesn’t work, identify the corrupted rows and delete them (This will incur data loss!).
for ((i=0; i<"Number_of_rows_in_nodes"; i++ )); do psql -U "Username" "Database Name" -c "SELECT * FROM nodes LIMIT 1 offset $i" >/dev/null || echo $i; done
Step 3a: The mini script above will list the rows which are having data corruption. In this case, row 11.
ERROR: missing chunk number 0 for toast value 5268251 in pg_toast_16444 11
Step 3b: Select the primary key of row 11.
psql> select id from nodes limit 1 offset 11; id ---------------------------------- 00000000000068f26b60aa5ef533cc85 (1 row)
Step 3c: Delete that row
psql> delete from nodes where id ='00000000000068f26b60aa5ef533cc85'
Step 4: Verify if issue has been resolved (No error mesages)
psql> VACUUM FULL nodes;
Conflict Management in Oracle Active DataGuard 11gR2
In PostgreSQL, I found the following;
The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. An example of the problem situation is an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the primary, the primary doesn’t have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE can be applied.
The test below is to find out if this applies to Oracle as well. I could not find any information for Oracle Active DataGuard.
1. Execute on primary
SQL> select count(*) from test; COUNT(*) ---------- 2052000
2. Execute on secondary
SQL> select * from (select * from (select * from test));
3. Execute on primary
SQL> truncate table test;
4. Ensure that changes have been replicated to standby
Primary SQL> alter system switch logfile; SQL> SELECT THREAD#, MAX(SEQUENCE#) AS "LAST_APPLIED_LOG" FROM V$LOG_HISTORY GROUP BY THREAD#; <THREAD# LAST_APPLIED_LOG ---------- ---------------- 1 71
Standby SQL> SELECT SEQUENCE#,APPLIED FROM V$ARCHIVED_LOG ORDER BY SEQUENCE# SEQUENCE# APPLIED ---------- --------- 71 YES
After about 3 minutes, the standby will hit the following error;
ERROR: ORA-08103: object no longer exists
Using oerr utility will show us the following;
[oracle@primary ~]$ oerr ora 08103 08103, 00000, "object no longer exists" // *Cause: The object has been deleted by another user since the operation // began, or a prior incomplete recovery restored the database to // a point in time during the deletion of the object. // *Action: Delete the object if this is the result of an incomplete // recovery.
I know the above error is not too intuitive… but hey, at least it’s a different error from ORA-01555 Snapshot Too Old
Interestingly, I also tested the same scenario with DROP and DELETE command. DROP command behaves exactly the same as TRUNCATE command. However, DELETE command is very different. The standby was allowed to complete the SELECT query. I believe the reason is because of UNDO tablespace is the primary. DROP and TRUNCATE does not generate UNDO information for the database to rollback. However, DELETE will generate UNDO information in the primary which is REPLICATED via REDO steam to the UNDO tablespace in standby. That is how it is able to maintain the read consistency in standby.
However, the above is just my educated guess. If you have a different opinion, do share it with me. I’m definitely all ears!
Oracle 11gR2 DataGuard – ORA-16826 Error
I was working with Oracle Active DataGuard recently when I hit the ORA-16826 error. This was what I did to resolve it. When I did a show configuraiton in DG broker, I saw the following error.
DGMGRL> show configuration;
Configuration - test_ADG Protection Mode: MaxAvailability Databases: ORCL - Primary database ORCL_STBY - Physical standby database Warning: ORA-16826: apply service state is inconsistent with the DelayMinsproperty Fast-Start Failover: DISABLED Configuration Status: WARNING
So I used the oerr utility to find out more about the error.
[oracle@standby trace]$ oerr ora 16826
16826, 0000, "apply service state is inconsistent with the DelayMins property" // *Cause: This warning was caused by one of the following reasons: // - The apply service was started without specifying the real-time // apply option or without the NODELAY option when the DelayMins // property was set to zero. // - The apply service was started with the real-time apply option or // with the NODELAY option when the DelayMins property was set to // a value greater than zero. // *Action: Reenable the standby database to allow the broker to restart // the apply service with the apply options that are consistent // with the specified value of the DelayMins property.
To verify which is the database having the error.
DGMGRL> show database verbose ‘ORCL_STBY’;
Database - ORCL_STBY Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds Apply Lag: 0 seconds Real Time Query: OFF Instance(s): ORCL Properties: DelayMins = '0' ------Formatted for readability------ Database Status: WARNING
Check the status on the standby. Note the bold columns. From managed to managed real time apply.
SELECT LOCAL.THREAD#, LOCAL.SEQUENCE# FROM (SELECT THREAD#, SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=1) LOCAL WHERE LOCAL.SEQUENCE# NOT IN (SELECT SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND THREAD# = LOCAL.THREAD#);
DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION ------------------------------ ----------------------- --------------- -------------- ------ LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive LOG_ARCHIVE_DEST_2 MANAGED MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY
Change to real time apply mode.
SQL> alter database recover managed standby database cancel;
SQL> alter database recover managed standby database using current logfile disconnect;
DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION ------------------------------ ----------------------- --------------- -------------- --------- ---------------------------------------- LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive LOG_ARCHIVE_DEST_2 MANAGED REAL TIME APPLY MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY
Check that the DataGuard broker configuration is successful now.
DGMGRL> show configuration;
Configuration - test_ADG Protection Mode: MaxAvailability Databases: ORCL - Primary database ORCL_STBY - Physical standby database Fast-Start Failover: DISABLED Configuration Status: SUCCESS
PostgreSQL Error – WARNING: canceling wait for synchronous replication due to user request
I needed to create index on my PostgreSQL cluster today. It was 1 master performing synchronous streaming replication to 1 slave. I did the following:
1. Created a new tablespace on master to store all the indexes.
#pgsql create tablespace tablespace_index location ‘/var/lib/pgsql/9.2/data/tablespace’;
2. Create the actual index itself.
CREATE INDEX CONCURRENTLY certificatedata_idx2 ON CertificateData (username) tablespace tablespace_index;
The index creation hangs. I checked the pg_stat_activity and saw the the query was still active. It took me about 2 hours before I decide to stop the index creation. When I did a Ctrl-C, I saw the following:
WARNING: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. ^CCancel request sent ERROR: canceling statement due to user request
The error speaks for itself. It means you have an issue with the replication to the slave database. Now, this is weird. I decided to look at the log files. And I found out that the replication has stopped because I did not create the tablespace directory (/var/lib/pgsql/9.2/data/tablespace) on the slave! Thus, the replication process could not create an tablespace and the slave node crashes. After creating the directory, the index creation took only less than 20 minutes on a 7GB table!
Thankfully, all these was done in the staging environment.
What to do when you join a company as a new DBA
When we take up a new job and join a company as a new DBA, what should we do immediately? Unless your manager has specific task or expectations for you, this is what I recommend. Start documenting or looking through all the database you will be managing. This should include important database information like the following:
1. Database architecture diagram (Is this RAC, DataGuard or GoldenGate?)
2. Database parameters (AMM or ASMM, Archive logging, DB block size, DBID?)
3. Understand the application using the database (OLTP, DWH, Hybrid?)
4. Read and understand existing documentations available. (DR, Failover, Backup procedures)
The reason is simple, understand the environment you will be managing throughly. Every database environment is different. By documenting the environment, you will be able to understand it fully. It will help when you are troubleshooting issues.