PostgreSQL – Missing chunk 0 for toast value in pg_toast

PostgreSQL – Missing chunk 0 for toast value in pg_toast

What to do when you see similar errors in your pg_log?

STATEMENT:  SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1
ERROR:  unexpected chunk number 1 (expected 0) for toast value 5268264 in pg_toast_16444
STATEMENT:  SELECT id, authz_id, org_id, name, environment, last_updated_by, created_at, updated_at, serialized_object FROM nodes WHERE (org_id = $1 AND name = $2) LIMIT 1
ERROR:  missing chunk number 4 for toast value 5268263 in pg_toast_16444

The error above are associated with data corruption in your TOAST tables. These were the steps I took to resolve the issue. Firstly, identify the table with issues.

select pg_toast_16444::regclass;
regclass
--------
nodes

Now we know that the TOAST table having issue is associated with nodes tables (I know it’s pretty obvious in the logs!).

Step 1: Reindex the table

REINDEX table pg_toast.pg_toast_16444;
REINDEX table nodes;

Step 2: Vacuum and analyze the table

VACUUM ANALYZE nodes;

Step 3: If it doesn’t work, identify the corrupted rows and delete them (This will incur data loss!).

for ((i=0; i<"Number_of_rows_in_nodes"; i++ )); do psql -U "Username" "Database Name" -c "SELECT * FROM nodes LIMIT 1 offset $i" >/dev/null || echo $i; done

Step 3a: The mini script above will list the rows which are having data corruption. In this case, row 11.

ERROR:  missing chunk number 0 for toast value 5268251 in pg_toast_16444
11

Step 3b: Select the primary key of row 11.

psql> select id from nodes limit 1 offset 11;

id
----------------------------------
00000000000068f26b60aa5ef533cc85
(1 row)

Step 3c: Delete that row

psql> delete from nodes where id ='00000000000068f26b60aa5ef533cc85'

Step 4: Verify if issue has been resolved (No error mesages)

psql> VACUUM FULL nodes;

Regards,
Wei Shan

Leave a comment

Conflict Management in Oracle Active DataGuard 11gR2

Conflict Management in Oracle Active DataGuard 11gR2

In PostgreSQL, I found the following;

The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. An example of the problem situation is an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the primary, the primary doesn’t have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE can be applied.

The test below is to find out if this applies to Oracle as well. I could not find any information for Oracle Active DataGuard.

1. Execute on primary

SQL> select count(*) from test;
COUNT(*)
----------
 2052000

2. Execute on secondary

SQL> select * from (select * from (select * from test));

3. Execute on primary

SQL> truncate table test;

4. Ensure that changes have been replicated to standby

Primary
SQL> alter system switch logfile;
SQL> SELECT THREAD#, MAX(SEQUENCE#) AS "LAST_APPLIED_LOG" FROM V$LOG_HISTORY GROUP BY THREAD#;
<THREAD# LAST_APPLIED_LOG
---------- ----------------
 1 71
Standby
 SQL> SELECT SEQUENCE#,APPLIED FROM V$ARCHIVED_LOG ORDER BY SEQUENCE#
SEQUENCE# APPLIED
 ---------- ---------
 71 YES

After about 3 minutes, the standby will hit the following error;

ERROR:
 ORA-08103: object no longer exists

Using oerr utility will show us the following;

[oracle@primary ~]$ oerr ora 08103
08103, 00000, "object no longer exists"
// *Cause: The object has been deleted by another user since the operation
// began, or a prior incomplete recovery restored the database to
// a point in time during the deletion of the object.
// *Action: Delete the object if this is the result of an incomplete
// recovery.

I know the above error is not too intuitive… but hey, at least it’s a different error from ORA-01555 Snapshot Too Old

Summary

Interestingly, I also tested the same scenario with DROP and DELETE command. DROP command behaves exactly the same as TRUNCATE command. However, DELETE command is very different. The standby was allowed to complete the SELECT query. I believe the reason is because of UNDO tablespace is the primary. DROP and TRUNCATE does not generate UNDO information for the database to rollback. However, DELETE will generate UNDO information in the primary which is REPLICATED via REDO steam to the UNDO tablespace in standby. That is how it is able to maintain the read consistency in standby.

However, the above is just my educated guess. If you have a different opinion, do share it with me. I’m definitely all ears!

Cheers,
Wei Shan

Leave a comment

Oracle 11gR2 DataGuard – ORA-16826 Error

Oracle 11gR2 DataGuard – ORA-16826 Error

I was working with Oracle Active DataGuard recently when I hit the ORA-16826 error. This was what I did to resolve it. When I did a show configuraiton in DG broker, I saw the following error.

DGMGRL> show configuration;

Configuration - test_ADG
Protection Mode: MaxAvailability
 Databases:
 ORCL - Primary database
 ORCL_STBY - Physical standby database
 Warning: ORA-16826: apply service state is inconsistent with the DelayMinsproperty
Fast-Start Failover: DISABLED
Configuration Status:
 WARNING

So I used the oerr utility to find out more about the error.

[oracle@standby trace]$ oerr ora 16826

16826, 0000, "apply service state is inconsistent with the DelayMins property"
// *Cause: This warning was caused by one of the following reasons:
// - The apply service was started without specifying the real-time
// apply option or without the NODELAY option when the DelayMins
// property was set to zero.
// - The apply service was started with the real-time apply option or
// with the NODELAY option when the DelayMins property was set to
// a value greater than zero.
// *Action: Reenable the standby database to allow the broker to restart
// the apply service with the apply options that are consistent
// with the specified value of the DelayMins property.

To verify which is the database having the error.

DGMGRL> show database verbose ‘ORCL_STBY';

Database - ORCL_STBY

Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 0 seconds
Real Time Query: OFF
Instance(s):
ORCL

Properties:
DelayMins = '0'

------Formatted for readability------

Database Status:
WARNING

Check the status on the standby. Note the bold columns. From managed to managed real time apply.

SELECT LOCAL.THREAD#, LOCAL.SEQUENCE# FROM (SELECT THREAD#, SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=1) LOCAL WHERE LOCAL.SEQUENCE# NOT IN (SELECT SEQUENCE# FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND THREAD# = LOCAL.THREAD#);

DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION
------------------------------ ----------------------- --------------- -------------- ------
LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive
LOG_ARCHIVE_DEST_2 MANAGED MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY

Change to real time apply mode.

SQL> alter database recover managed standby database cancel;

SQL> alter database recover managed standby database using current logfile disconnect;

DEST_NAME RECOVERY_MODE DATABASE_MODE TYPE STATUS DESTINATION
------------------------------ ----------------------- --------------- -------------- --------- ----------------------------------------
LOG_ARCHIVE_DEST_1 IDLE OPEN LOCAL VALID /archive
LOG_ARCHIVE_DEST_2 MANAGED REAL TIME APPLY MOUNTED-STANDBY PHYSICAL VALID ORCL_STBY

Check that the DataGuard broker configuration is successful now.

DGMGRL> show configuration;

Configuration - test_ADG
Protection Mode: MaxAvailability
 Databases:
 ORCL - Primary database
 ORCL_STBY - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS

Cheers,
Wei Shan

Leave a comment

PostgreSQL Error – WARNING: canceling wait for synchronous replication due to user request

PostgreSQL Error – WARNING: canceling wait for synchronous replication due to user request

I needed to create index on my PostgreSQL cluster today. It was 1 master performing synchronous streaming replication to 1 slave. I did the following:

1. Created a new tablespace on master to store all the indexes.
#pgsql create tablespace tablespace_index location ‘/var/lib/pgsql/9.2/data/tablespace';

2. Create the actual index itself.
CREATE INDEX CONCURRENTLY certificatedata_idx2 ON CertificateData (username) tablespace tablespace_index;

The index creation hangs. I checked the pg_stat_activity and saw the the query was still active. It took me about 2 hours before I decide to stop the index creation. When I did a Ctrl-C, I saw the following:

WARNING: canceling wait for synchronous replication due to user request
 DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
 ^CCancel request sent
 ERROR: canceling statement due to user request

The error speaks for itself. It means you have an issue with the replication to the slave database. Now, this is weird. I decided to look at the log files. And I found out that the replication has stopped because I did not create the tablespace directory (/var/lib/pgsql/9.2/data/tablespace) on the slave! Thus, the replication process could not create an tablespace and the slave node crashes. After creating the directory, the index creation took only less than 20 minutes on a 7GB table!

Thankfully, all these was done in the staging environment.

Regards,
Wei Shan

Leave a comment

What to do when you join a company as a new DBA

What to do when you join a company as a new DBA

When we take up a new job and join a company as a new DBA, what should we do immediately? Unless your manager has specific task or expectations for you, this is what I recommend. Start documenting or looking through all the database you will be managing. This should include important database information like the following:

1. Database architecture diagram (Is this RAC, DataGuard or GoldenGate?)

2. Database parameters (AMM or ASMM, Archive logging, DB block size, DBID?)

3. Understand the application using the database (OLTP, DWH, Hybrid?)

4. Read and understand existing documentations available. (DR, Failover, Backup procedures)

The reason is simple, understand the environment you will be managing throughly. Every database environment is different. By documenting the environment, you will be able to understand it fully. It will help when you are troubleshooting issues.

Regards,
Wei Shan

Leave a comment

PostgreSQL – How to kill database sessions

PostgreSQL – How to kill database sessions

From session #1

[root@postgresql ~]# su - weishan
[weishan@postgresql ~]$ psql -d testdb
psql (9.3.6)
Type "help" for help.
testdb=# SELECT datname as database, pid as pid, usename as username, application_name as application, client_addr as client_address, query FROM pg_stat_activity;
database | pid | username | application | client_address | query 
----------+-------+----------+-------------+----------------+-----------------------------------------
 testdb | 11834 | weishan | psql | | SELECT datname as database, +
 | | | | | pid as pid, +
 | | | | | usename as username, +
 | | | | | application_name as application,+
 | | | | | client_addr as client_address, +
 | | | | | query +
 | | | | | FROM pg_stat_activity;
(1 row)

We can see the username logged in is “weishan” and the SQL statement being executed as well.

From session #2

[root@postgresql ~]# su - postgres
-bash-4.1$ psql -d testdb
psql (9.3.6)
Type "help" for help.
testdb=# SELECT datname as database, pid as pid, usename as username, application_name as application, client_addr as client_address, query FROM pg_stat_activity;
database | pid | username | application | client_address | query 
----------+-------+----------+-------------+----------------+--------------------------------------------------------------
 testdb | 11834 | weishan | psql | | SELECT datname as database, +
 | | | | | pid as pid, +
 | | | | | usename as username, +
 | | | | | application_name as application, +
 | | | | | client_addr as client_address, +
 | | | | | query +
 | | | | | FROM pg_stat_activity;
 testdb | 12067 | postgres | psql | | SELECT datname as database, pid as pid, usename as username,+
 | | | | | application_name as application, +
 | | | | | client_addr as client_address, query +
 | | | | | FROM pg_stat_activity;
(2 rows)

We can now see 2 sessions in the database, session #1 and session #2.

To kill the session, use PG_TERMINATE_BACKEND function.

testdb=# select pg_terminate_backend(pid) from pg_stat_activity where usename='weishan';
 pg_terminate_backend 
----------------------
 t
(1 row)

From session #1

testdb=# SELECT datname as database, pid as pid, usename as username, application_name as application, client_addr as client_address, query FROM pg_stat_activity;
server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

We can see that session #1 has been disconnect and auto-connection was being made.

Cheers,
Wei Shan

Leave a comment

How did I became an (Oracle) Database Administrator

How did I became an (Oracle) Database Administrator

Every now and then, I get questions asking me how did I get into the field of database administration. Getting a role as a DBA is tricky. If you look at the requirements for DBA, they usually require more than 3 years of experience. There are very few companies looking for fresh graduates for DBA role.

After completing my 2 years of mandatory National Service (NS), I began to look for an IT job. 2 years of NS reduced my IT knowledge to non-existence. I was out of touch with the IT industry. I couldn’t even tell you what version was .NET framework at that time. I attended an interview (System Engineer) with a web-hosting company in Singapore. It was a horrible experience.

Interviewer: So, why do you want to join the IT industry?
Me: Because I love technology and I think it’s rather interesting.

He started asking me technical questions about the position and I could barely answer his questions. At the end of the interview, he made this remark.

I’m going to be frank with you because I do not want to waste your time. I will not hire you. You have no talent in IT. Since you mentioned that you are interested in technology, why don’t you go and try to sell handphones. It will probably suit you better.

I was devastated. I felt so insulted and shameful of myself. I was never insulted like this before. I went home and cried my lungs out. I didn’t had to mood to attend any interviews after that. I was seriously contemplating about what to do with my career. Then, my brother introduced me a 2 month free Oracle database course (MDeC KDI programme) in Malaysia, KL.

I worked my ass off for 2 months straight and got my first Oracle certification (OCA 11G DBA). When I got back to Singapore, I applied for many Oracle DBA positions and finally I got an interview from Seagate. It took me about 3 months of job hunting. The interview went like this.

He asked me 2 technical questions.

1. Purpose of Redo log
2. Purpose of Undo log

Interview: Why do you want to join us?
Me: Because I want to learn. The job scope is exactly what I am looking for. The salary does not matter to me, you can pay me $800 and I will still want the job.

And yes, I got hired in the end. :)

Summary
Today, I am still very grateful to my hiring manager at that time (I still drink coffee with him once in a while). He told me, the only reason why he hired me was because of my attitude, he knew that I was eager to prove myself and that was why I will be willing to put in everything I got to make it work. My advice to all fellow IT professionals, keep that learning attitude and your chance will come. Don’t give up and don’t get discouraged by the nay-sayers.

We have a saying in Chinese.

机会是留给有准备的人.

Chance is for those who are prepared.

2 Comments

Preparing ASM disks for Oracle RAC Installation (Solaris SPARC)

Preparing ASM disks for Oracle RAC Installation (Solaris SPARC)

Environment:
1. Solaris SPARC 64bit 11.2
2. Oracle RAC 12.1.0.2
3. Storage using HDS VSP

While I was doing my first (Solaris SPARC) RAC installation for a customer, I was slightly stumped. My typical installations were on Linux, with or without ASMLib. It was easy and very well documented. However, I was not so experienced in RAC installation on a SPARC!

I came on-site and the Solaris engineer told me that the LUNs (27 of them!) have been zoned to the system. When I do a “ls /dev/rdsk/”, I see plenty of disks!

c1d0s0 c1d11s2 c1d13s5 c1d16s0 c1d18s3 c1d1s6 c1d22s1 c1d24s4 c1d26s7 c1d3s3 c1d5s6 c1d8s1
c1d0s1 c1d11s3 c1d13s6 c1d16s1 c1d18s4 c1d2 c1d22s2 c1d24s5 c1d27s0 c1d3s4 c1d5s7 c1d8s2

I have formatted the above for readability.

Apparently, Solaris disks have 8 slices, from s0 to s7.

According to MOS[Doc ID 368840.1]:

Partition # 2 is a standard partition and it SHOULD NOT be changed as it reflects the total capacity of the disk. All other partitions are available for the configuration. All this information for the partitions and other data is contained in a special structure stored within the first 512 bytes of the disk, on the first cylinders. This structure is called VTOC (Volume Table of Contents). Actions like re-writting the VTOC will cause loosing the disk. Thus, it would be best to modify partition 0, starting at cylinder.

So, how do I actually format the disk correctly?

#format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c1d1 <HITACHI-OPEN-V -SUN-7303-100.00GB>
/virtual-devices@100/channel-devices@200/disk@1
1. c1d2 <HITACHI-OPEN-V -SUN-7303-50.00GB>
/virtual-devices@100/channel-devices@200/disk@2
2. c1d3 <HITACHI-OPEN-V -SUN-7303 cyl 1363 alt 2 hd 15 sec 512>
/virtual-devices@100/channel-devices@200/disk@3
3. c1d4 <HITACHI-OPEN-V -SUN-7303 cyl 1363 alt 2 hd 15 sec 512>
/virtual-devices@100/channel-devices@200/disk@4

Specify disk (enter its number): Specify disk (enter its number): 3

partition> 0
Part Tag Flag Cylinders Size Blocks
0 unassigned wm 0 - 1362 4.99GB (1363/0/0) 10467840

Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 3
Enter partition size[10444800b, 1360c, 1362e, 5100.00mb, 4.98gb]: $
partition> p
Current partition table (unnamed):
Total disk cylinders available: 1363 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 unassigned wm 3 - 1362 4.98GB (1360/0/0) 10444800
1 unassigned wu 0 0 (0/0/0) 0
2 backup wu 0 - 1362 4.99GB (1363/0/0) 10467840
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 unassigned wm 0 0 (0/0/0) 0
7 unassigned wm 0 0 (0/0/0) 0

Ta-Da! Now, you should see c1d4s0 as /dev/rdsk/c1d4s0 and be able to view it as a candidate disk!

Regards,
Wei Shan

Leave a comment

Follow

Get every new post delivered to your Inbox.