A recent paper by a team from Amazon.com, UC Berkeley, and Stanford, discusses systems management at Amazon.com, including techniques and tools used to identify and fix system problems and to reduce downtime.
A recent paper, Advanced Tools for Operators at Amazon.com [Note: PDF download], describes how system problems, such as downtime of a service or a component, are identified and fixed at Amazon.com's sprawling datacenter operations.
The paper—delivered at the First Workshop on Hot Topics in Autonomic Computing (HotAC) in Dublin, Ireland, with co-authors such David Patterson, inventor of RISC as well as RAID, and Michael I. Jordan, whose statistical research is well-known in the computer science community—gives interesting insights into Amazon.com's highly dynamic network operations. The following are quotes from the paper:
"Just in the Monitoring team, the online documentation repository registered hundreds of changes per month during a four-month period last year, and the deployment system registered an average of over a hundred code changes per month being rolled into the production system (excluding any testing or debugging-related
"Most software developers are also problem resolvers and are part of an on-call rotation. During his rotation of a few consecutive days, the developer/resolver becomes the primary resolver for services owned by his team."
"Fewer than a dozen operators monitor the health of the
whole web site 24×7. Their task is to monitor for sev1 [high-severity] failures, perform a rapid troubleshooting, and immediately page the affected services’ primary resolvers, who are expected to respond in 15 minutes. The primaries then perform extensive troubleshooting and recovery actions."
"A few tens of software teams are responsible for designing, implementing, deploying, and maintaining one or more of the several services that collectively comprise the site’s functionality, both customer-facing services and services that support the site’s infrastructure."
"Members of each team usually know which other services
they depend on; i.e., over time they learn the local
neighborhood around their service in the dependency graph.
But nobody remembers dependencies for all services in the
company... As a result, the operators and resolvers don’t
always see 'the big picture.'"
"Amazon.com collects a few million metrics from all their datacenters. These metrics include about 100 hardware
metrics from every host (CPU and memory utilization, I/O,
swap space, and network interfaces) and application-level
metrics such as latency, availability, and error rate of each service."
Amazon.com categorizes system problems into two layers of severity:
Sev1 problems affect customers directly and need to be resolved immediately, and rarely recur. Sev2 problems do not immediately affect the behavior of the website, but could turn into sev1 problems if not resolved quickly.
Sev2s typically affect only a single system component and tend to recur, so the operators learn to recognize and fix them. However, they are about 100 times more frequent than sev1 problems, in part because of rapid churn of the site software.
The paper mentions three problems the system's maintainers face. The following, again, are quotes from the paper:
"Complex dependencies among system components can
cause failures to propagate to other components, triggering
multiple alarms and complicating root-cause determination."
"No individual understands all the dependencies among different parts of the system."
"The whole system is heavily instrumented at a fine
grain, but even though many problems can ultimately
be characterized in terms of the behavior of a dozen or
so metrics, the total amount of information collected ... can be overwhelming."
The paper then introduces two tools that help manage the system in the face of those obstacles. The first one visualizes trouble-spots in a wiki-like environment, enabling quicker problem identification.
The second tool turns Amazon's own prowess in predicting user interests into a useful problem solving technique: By analyzing the actions system administrators took to resolve past problems, the system can suggest operators possible resolutions to newly occurring, similar problems.
To what extent does your organization share Amazon's challenges in detecting and fixing system problems? What tools do you use to ease that task?
For single-system monitoring, you can use, among other products, MessAdmin, which is a non-intrusive notification system and Session administration for J2EE Web Applications, giving detailed statistics and informations on any Web application. It installs as a plug-in to any Java EE WebApp, and requires zero-code modification. Some of the information about your users given by this tool: what are they doing? How much CPU did they use? How much bandwidth? How much memory (HttpSession size)? Check it out!