In the rest of this article we will outline a series of steps that you may use to help you
solve a crash or lockup problem.
Identify the problem:
Classify the performance issue: crash, lockup, or slow-down.
Why? Because each scenario has specific symptoms that will give you specific hints as to
the cause of the problem.
Repeat the problem If the problem only happens in the
production environment make a full effort to make it happen again on a controlled
environment, like the Quality Assurance (QA) or the development system. Why? Because being able to repeat
the problem will enable you to understand and test the problem. You can try different
cause and effect scenarios that will lead eventually to the actual cause of the problem—not
to mention avoiding using the production machine for troubleshooting.
Log key systems information Tracking vital signs of a system,
such as its memory usage, CPU utilization, and disk I/O performance, are always helpful
in finding out how efficient your system operates. Collect this information before and
after the problem occurs as this may give you some clues. You may need to turn
logging in for a while before the problem occurs. Therefore, watch out for the system
log getting too big. Don't forget to time stamp any information that you log.
Chart or graph all your data Use a spreadsheet package to
data you collect in order to understand the behavior of your system.
Finding a visual pattern in the occurrence of your performance problem is
a key starting point.
Add custom trace statements If you cannot repeat the problem,
start inserting your own debugging information:
Make sure timing and other relevant diagnostic information are included.
Do active tracing of key information at all times.
Take all-out systems snapshots of the system information when possible.
Be aware that active logging will take disk space and full-time logging will degrade
Look back at the latest system changes If the problem occurred
suddenly, backtrack your steps to see if that gives you any clues. Sometimes unforeseen
or undocumented simple changes could cause problems. For example, an unplanned
backup starting at midnight on a server machine dedicated to a specific application
process may well cause a poor response time as the server's CPU would be hogged by
the backup program.
Start with the software changes first Begin your
troubleshooting, if possible, with software changes that could make the most impact, as
the software in general is for most part easier to change relative to hardware changes. (
Especially if you have to order and install hardware.)
Include all parties involved: In an all-out systems approach you will need to include
the network team, the systems operation (systems administrators/database administrators (DBAs)), the QA and
the development teams. This has to happen at early stages of the performance problem
troubleshooting even if those parties may not have an active role at the start.
Write down the strategic plan.
Look at the entire system first and then point to the component that is most likely to
buy you the largest bang for the buck when you optimize it.
Avoid guesswork and make sure you look at the impact any changes you make on
other parts of the system.
Write down a list of solution options and mark them as short-term or long-term and
see which one fits your case better.
If you come across a situation that you cannot troubleshoot in the context of the
entire application, extract the key components and make a simple application. This
way you have easier time understanding the problem and easier time resolving it.
Write down a backup plan in case your first set of solutions doesn't solve the
performance problem. (For example a complete system rebuild in case some parts of your
system was corrupted.)
Keep an accurate and detailed log of all the changes and all the steps taken. This
document becomes invaluable when you try multiple changes, as you want to minimize your
test scenarios and avoid the repeat of any previous step.
Performance is a system-wide issue that spans consideration over application
software, network system, and the underlying computing hardware. A solid
performance strategy and a planned effort coordinated with the operations staff, QA,
development, and network systems domain experts are key elements in resolving
systems performance issues and creating an optimized and well tuned web application.
Since software changes are often the easiest to make, a good approach would be to
start with software code or software architectural changes first and then later proceed
to bigger and costlier changes to the network or the computing hardware systems.