Food for thought: Troubleshooting issues

Today, I found out that I made an extremely silly mistake.

This was a very simple project. A 2-Node Oracle RAC setup connected to a IBM v7000 storage. The deployment was smooth. I configured the ASM disk using udev and device-mapper. However, after the system was handover to the customer, they begin encountering reboots and the following error.

qla2xxx [0000:1b:00.0]-801c:1: Abort command issued nexus=1:1:0 -- 1 2002.

These were the steps I took to resolve the issue.

  1. Logging of service ticket to IBM, Red Hat and QLogic.
  2. Flashing of QLogic HBA firmware
  3. Patching of QLogic HBA driver

However, the issue was still not resolved. The root cause was infact that the udev was configured incorrectly! The SCSI timeout was supposed to be set to 120s according to IBM documentation. This was a mistake that made me angry with myself. These were the few reasons for my mistake

  1. Check and verify all configurations again. (This was the part I did not perform. I was VERY confident that I configured it correctly)
  2. Do not assume and bet on the root cause so quickly. (I was too quick to assume to root cause to be the firmware/driver issues)
  3. Do not be swayed by the customer no matter how experienced they are. Have your own understanding and judgement!

 

Regards,
Wei Shan

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: