The Artima Developer Community
Sponsored Link

Java Community News
Managing Amazon.com's Systems

1 reply on 1 page. Most recent reply: Sep 20, 2006 5:43 AM by bug not

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 1 reply on 1 page
Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

Managing Amazon.com's Systems Posted: Sep 19, 2006 12:44 PM
Reply to this message Reply
Summary
A recent paper by a team from Amazon.com, UC Berkeley, and Stanford, discusses systems management at Amazon.com, including techniques and tools used to identify and fix system problems and to reduce downtime.
Advertisement

A recent paper, Advanced Tools for Operators at Amazon.com [Note: PDF download], describes how system problems, such as downtime of a service or a component, are identified and fixed at Amazon.com's sprawling datacenter operations.

The paper—delivered at the First Workshop on Hot Topics in Autonomic Computing (HotAC) in Dublin, Ireland, with co-authors such David Patterson, inventor of RISC as well as RAID, and Michael I. Jordan, whose statistical research is well-known in the computer science community—gives interesting insights into Amazon.com's highly dynamic network operations. The following are quotes from the paper:

  • "Just in the Monitoring team, the online documentation repository registered hundreds of changes per month during a four-month period last year, and the deployment system registered an average of over a hundred code changes per month being rolled into the production system (excluding any testing or debugging-related deployments)."
  • "Most software developers are also problem resolvers and are part of an on-call rotation. During his rotation of a few consecutive days, the developer/resolver becomes the primary resolver for services owned by his team."
  • "Fewer than a dozen operators monitor the health of the whole web site 24×7. Their task is to monitor for sev1 [high-severity] failures, perform a rapid troubleshooting, and immediately page the affected services’ primary resolvers, who are expected to respond in 15 minutes. The primaries then perform extensive troubleshooting and recovery actions."
  • "A few tens of software teams are responsible for designing, implementing, deploying, and maintaining one or more of the several services that collectively comprise the site’s functionality, both customer-facing services and services that support the site’s infrastructure."
  • "Members of each team usually know which other services they depend on; i.e., over time they learn the local neighborhood around their service in the dependency graph. But nobody remembers dependencies for all services in the company... As a result, the operators and resolvers don’t always see 'the big picture.'"
  • "Amazon.com collects a few million metrics from all their datacenters. These metrics include about 100 hardware metrics from every host (CPU and memory utilization, I/O, swap space, and network interfaces) and application-level metrics such as latency, availability, and error rate of each service."

Amazon.com categorizes system problems into two layers of severity:

Sev1 problems affect customers directly and need to be resolved immediately, and rarely recur. Sev2 problems do not immediately affect the behavior of the website, but could turn into sev1 problems if not resolved quickly.

Sev2s typically affect only a single system component and tend to recur, so the operators learn to recognize and fix them. However, they are about 100 times more frequent than sev1 problems, in part because of rapid churn of the site software.

The paper mentions three problems the system's maintainers face. The following, again, are quotes from the paper:

  • "Complex dependencies among system components can cause failures to propagate to other components, triggering multiple alarms and complicating root-cause determination."
  • "No individual understands all the dependencies among different parts of the system."
  • "The whole system is heavily instrumented at a fine grain, but even though many problems can ultimately be characterized in terms of the behavior of a dozen or so metrics, the total amount of information collected ... can be overwhelming."

The paper then introduces two tools that help manage the system in the face of those obstacles. The first one visualizes trouble-spots in a wiki-like environment, enabling quicker problem identification.

The second tool turns Amazon's own prowess in predicting user interests into a useful problem solving technique: By analyzing the actions system administrators took to resolve past problems, the system can suggest operators possible resolutions to newly occurring, similar problems.

To what extent does your organization share Amazon's challenges in detecting and fixing system problems? What tools do you use to ease that task?


bug not

Posts: 41
Nickname: bugmenot
Registered: Jul, 2004

Re: Managing Amazon.com's Systems Posted: Sep 20, 2006 5:43 AM
Reply to this message Reply
This certainly is monitoring on a large scale! :)

For single-system monitoring, you can use, among other products, MessAdmin, which is a non-intrusive notification system and Session administration for J2EE Web Applications, giving detailed statistics and informations on any Web application. It installs as a plug-in to any Java EE WebApp, and requires zero-code modification.
Some of the information about your users given by this tool: what are they doing? How much CPU did they use? How much bandwidth? How much memory (HttpSession size)?
Check it out!

Flat View: This topic has 1 reply on 1 page
Topic: Managing Amazon.com's Systems Previous Topic   Next Topic Topic: Cay Horstmann's EJB 3 and JSF Experience

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use