Wednesday, October 31, 2007

High I/O Wait in AIX

Someone in our dept monitoring the AIX was making claims that a particular machine was running at 70% I/O wait during a period when a portion of the application was running slower than normal. The assertion being made was our system had a I/O bottleneck.

Given the fact that the application was running on a AIX platform I looked at the sar for just such an investigation. The sar data indicated a very low percentage of idle cpu, and surprisingly the percentage of time the system was waiting for I/O was about 60-70%. Although a high value for wait I/O generated from sar does not indicate a I/O bottleneck, an I/O bottleneck could result in high wait I/O percentages. Further looking at I/O service times and average waiting queue in iostat showed that the I/O subsystem was performing very poorly. This was a SAN box having only 3 drives in RAID-5. After adding 3 more new drives to this box, the I/O wait reduced substantially to <20%. Our application has had a bit performance's improvement since then. I felt something more could be archived. After some googling, I've found someone said that ideal I/O wait should be < 10%.

I remembered reading in IBM’s book, AIX Performance Tuning, that "a high % iowait indicates that the system has an application problem, a memory shortage, or an inefficient I/O subsystem configuration". The machine has enough memory (8G). The I/O subsystem now is OK. Thus, the rest problem must be in our application. It was found that there was a lot of PL/SQL functions used in queries that may lead to performance degradation. For a simple query the application requested a lot of I/O due to repeatedly calling PL/SQL functions. There also was other application problems such as missing indexes, not using bind variables etc. Our developers now are working very hard to fix them.