Today, I found out that I made an extremely silly mistake.
This was a very simple project. A 2-Node Oracle RAC setup connected to a IBM v7000 storage. The deployment was smooth. I configured the ASM disk using udev and device-mapper. However, after the system was handover to the customer, they begin encountering reboots and the following error.
qla2xxx [0000:1b:00.0]-801c:1: Abort command issued nexus=1:1:0 -- 1 2002.
These were the steps I took to resolve the issue.
- Logging of service ticket to IBM, Red Hat and QLogic.
- Flashing of QLogic HBA firmware
- Patching of QLogic HBA driver
However, the issue was still not resolved. The root cause was infact that the udev was configured incorrectly! The SCSI timeout was supposed to be set to 120s according to IBM documentation. This was a mistake that made me angry with myself. These were the few reasons for my mistake
- Check and verify all configurations again. (This was the part I did not perform. I was VERY confident that I configured it correctly)
- Do not assume and bet on the root cause so quickly. (I was too quick to assume to root cause to be the firmware/driver issues)
- Do not be swayed by the customer no matter how experienced they are. Have your own understanding and judgement!