Analyzing a Web-Based Performance Problem

by Arash Barirani and Jeffrey Blake

May 10, 2004

Summary

Have performance problems? This article outlines a methodology and a plan of attack in solving performance problems in a web-based system.

Resolving a web application performance problem is as much an art as it is a technical challenge. Although performance problems can have many causes, the outcome is always the same—unscalable and slow-to-respond application software. To resolve such problems the best approach is an all-out systems approach where the application software, network, and the underlying computing hardware are all considered and evaluated. In this article we describe examples of performance problems and several approaches in solving them.

Makeup of a Performance Problem

A performance problem could be caused by any number of things: a poorly designed architecture, an underpowered CPU, limited network bandwidth, or a combination of several factors. For example, a higher than expected load can easily overwhelm a system's resources. However, a higher volume is not always required to uncover performance problems. Poorly designed software that does not handle resource allocation and contention properly can easily cause deadlocks that eventually lead to nefarious performance problems even at a normal load.

Regardless of what causes a performance problem in a web-based application, the first step in resolving such a problem is to create a performance plan document—even if it is a short one. When you put such a document together, you should identify and involve all the domain experts relevant to the web-based application at hand.

Performance Strategy: Total Systems Approach

Performance troubleshooting of a web-based system, much like a mind-body healing, requires a holistic approach. A web-based system is much more than just an application server, a few thousand lines of code, a database, and a firewall. It is more than the sum of its parts. It is an interconnected whole. Therefore, the most effective approach to solving a performance problem is to take a total systems approach, a process where all parts of the application domain are examined from the perspective of performance.

Why is it important, as a manager, or a lead engineer, to consider all the aspects of an application? The foundation of any web system includes software, hardware and the network. A short leg on any one of these three foundations can cause performance problems. A system's performance is dependent as much on a well-tuned, well-configured network as it is on fast computing hardware or well designed software architecture. Making your SQL statements more efficient will not solve a performance problem caused by a lack of network bandwidth. Faster computers will not solve a performance problem caused by a poorly configured routing table. Replacing your current network with faster fiber optics will not solve a performance problem caused by a lack of memory or an underpowered CPU.

For example, during one project we came across a performance problem during deployment. After initial investigations it was decided to add more CPUs and memory to the server machines. After an initial spike in the performance, the system slowly degraded and we were back in the same place. After reviewing and further testing we determined that the performance problem was a combination of an unscalable architecture and a poor network routing table configuration. The good news was that we had solved the problem. The bad news was that we had spent over $10,000 for hardware that had not solved our problem. Furthermore, we had spent a good amount of time chasing the wrong solution. Sometimes only a combination of software design, hardware, and network changes will solve a performance problem.

In another project we discovered that the underlying architecture of the system we were developing was not scalable in that too few users were able to logon to the system. The project was using MQSeries [1] as the input queue for the incoming requests. In our initial investigation, we discovered that the request input module was using only a single input queue. In our first attempt, we tried to solve the problem by changing the software design to a multi-threaded, multi-process architecture that used a number of input queues. To our surprise, however, we discovered during a production run that because many more users were now able to log in, the system itself would run out of key resources such as CPU and memory. It was as if we had upgraded a pickup truck with a more powerful engine so that the truck could move a heavier payload, but as a result it required a larger frame, a more sturdy chassis, and wider tires. Once our architecture was able to handle larger volumes it needed a faster machine to handle the larger load. Consequently, a software redesign had to be combined with faster hardware before the system performance problem could be solved and turned into a multi-process environment.

How to Execute a Total Systems Approach

In an ideal world, you would catch and resolve all performance problems by load testing prior to deployment. In the real world, however, the test and production environments are not always identical. Even when the test and production environments are identical, a load test itself is only an approximation of the real load the application will face in production. Therefore, performance bottlenecks and slowdown issues often show up right after a new project is deployed. The key is being able to resolve such performance problems efficiently without too much guesswork and therefore avoiding losses in the deployment time and potential business revenues. So it is important to plan ahead for performance problems.

If you are dealing with a new project, schedule time throughout the project development for design reviews that focus on potential deadlock scenarios. While coding the application, think about inserting key timing and trace information that may come in handy later when debugging for performance issues. In addition, put together and involve a team of domain experts that will be key in helping you resolve a future performance problem.

Should a performance problem surface, make sure you have a plan. As mentioned previously, it is important that the plan be an overall performance resolution strategy where the effects of software design changes are evaluated in conjunction with the systems network and the underlying computing hardware. Although fancy tools and monitoring agents can be a great help when faced with a performance problem, the central part of resolving a web-based performance problem is a sharp-minded project manager or an experienced lead systems engineer with a good knowledge of the systems domain and armed with the plan.

A Sample Strategy Guide: A Step-by-Step Approach

In the rest of this article we will outline a series of steps that you may use to help you solve a crash or lockup problem.

Identify the problem:
- Classify the performance issue: crash, lockup, or slow-down. Why? Because each scenario has specific symptoms that will give you specific hints as to the cause of the problem.
- Repeat the problem If the problem only happens in the production environment make a full effort to make it happen again on a controlled environment, like the Quality Assurance (QA) or the development system. Why? Because being able to repeat the problem will enable you to understand and test the problem. You can try different cause and effect scenarios that will lead eventually to the actual cause of the problem—not to mention avoiding using the production machine for troubleshooting.
- Log key systems information Tracking vital signs of a system, such as its memory usage, CPU utilization, and disk I/O performance, are always helpful in finding out how efficient your system operates. Collect this information before and after the problem occurs as this may give you some clues. You may need to turn logging in for a while before the problem occurs. Therefore, watch out for the system log getting too big. Don't forget to time stamp any information that you log.
- Chart or graph all your data Use a spreadsheet package to display the data you collect in order to understand the behavior of your system. Finding a visual pattern in the occurrence of your performance problem is a key starting point.
- Add custom trace statements If you cannot repeat the problem, you must start inserting your own debugging information:
  - Make sure timing and other relevant diagnostic information are included.
  - Do active tracing of key information at all times.
  - Take all-out systems snapshots of the system information when possible.
  - Be aware that active logging will take disk space and full-time logging will degrade system performance.
- Look back at the latest system changes If the problem occurred suddenly, backtrack your steps to see if that gives you any clues. Sometimes unforeseen or undocumented simple changes could cause problems. For example, an unplanned backup starting at midnight on a server machine dedicated to a specific application process may well cause a poor response time as the server's CPU would be hogged by the backup program.
- Start with the software changes first Begin your troubleshooting, if possible, with software changes that could make the most impact, as the software in general is for most part easier to change relative to hardware changes. ( Especially if you have to order and install hardware.)
Include all parties involved: In an all-out systems approach you will need to include the network team, the systems operation (systems administrators/database administrators (DBAs)), the QA and the development teams. This has to happen at early stages of the performance problem troubleshooting even if those parties may not have an active role at the start.
Write down the strategic plan.
- Look at the entire system first and then point to the component that is most likely to buy you the largest bang for the buck when you optimize it.
- Avoid guesswork and make sure you look at the impact any changes you make on other parts of the system.
- Write down a list of solution options and mark them as short-term or long-term and see which one fits your case better.
- If you come across a situation that you cannot troubleshoot in the context of the entire application, extract the key components and make a simple application. This way you have easier time understanding the problem and easier time resolving it.
- Write down a backup plan in case your first set of solutions doesn't solve the performance problem. (For example a complete system rebuild in case some parts of your system was corrupted.)
Keep an accurate and detailed log of all the changes and all the steps taken. This document becomes invaluable when you try multiple changes, as you want to minimize your test scenarios and avoid the repeat of any previous step.

Conclusion

Performance is a system-wide issue that spans consideration over application software, network system, and the underlying computing hardware. A solid performance strategy and a planned effort coordinated with the operations staff, QA, development, and network systems domain experts are key elements in resolving systems performance issues and creating an optimized and well tuned web application. Since software changes are often the easiest to make, a good approach would be to start with software code or software architectural changes first and then later proceed to bigger and costlier changes to the network or the computing hardware systems.

Talk back!

Have an opinion? Readers have already posted 6 comments about this article. Why not add yours?

About the authors

Jeffrey Blake is also a developer and consultant who helps clients solve performance problems of web-based applications.