Tracking down performance issues in large distributed systems is inherently complicated. Why is the application slow? Where is the bottleneck? In my experience, one of the more insidious culprits is known as high IO wait. A place where, in the words of Dr. Seuss, everyone is just waiting.
The first indication of a high IO wait issue is normally system load average. The load average is computed based on CPU utilization, and includes the number of processes using or waiting to use the CPU, and, importantly on Linux, process that are in uninterruptible sleep. The load average can be interpreted on a basic level as being a CPU core at full utilization has a system load average of one. So, for a quad-core machine, a system load average of 4 would mean that the machine had adequate resources to handle the work it needed to do, but just barely. On the same quad-core system, a load average of eight would mean that if the server had eight cores instead of four, it would have been able to handle the work, but it is now overloaded. Maybe.