Sponsored Link •
This morning Artima was down around eight hours because of a 15-minute disk replacement job. In this blog post I recount the struggle I've had with backups over the past half year.
This morning Artima was offline from approximately 3AM to 11AM EST. This outage was caused by a problem that occurred when replacing a disk intended for backups.
I've been having a lot of problems with disks the past year. Last August, a disk went bad on the Artima server, and I found out that although I had been paying for daily backups, apparently I hadn't been paying for the ability to restore from backup. Needless to say I was very disappointed about that. I did have my own offsite "backup" backups of critical data, though not as fresh as the official backups should have been. And fortunately the disk that went bad wasn't completely out to lunch, so I had my ISP mount it on a different server and was able for the most part to restore anything that was still missing. But it was a painful, time-consuming process. The site had been offline for more than 24 hours before it reappeared, and it took about three days to get everything back in place.
Earlier, a salesperson from my ISP had attempted to sell me on a different backup system, which would involve installing a backup disk in the server, running Bacula to backup files to the extra disk, and charging me more money each month. Since disks are quite reliable these days and since I already had a backup regime in place, I declined. What the salesman neglected to tell me was that the existing backup system wouldn't allow me to reliably restore files. So after the catastrophic disk failure in August, the same salesman offered to provide the Bacula system for free. I accepted, and we scheduled some more downtime.
Artima currently runs on a server service. We didn't buy the hardware and co-locate it. Instead we pay a monthly fee and the ISP, INETU, provides the hardware and support technicians. If there's a hardware problem, they take care of it. When one of the technicians took the site down to install the backup drive, they apparently didn't get the memo explaining exactly why the disk was being installed, because they spent extra time trying to get the server to boot with the new disk in place. The new drive needed some time to recover journals, and several reboot iterations to get the configuration correct so the new disk would stop trying to boot the OS. They sent me an email indicating where the new disk's old
/var directories were now mounted. I took a look and it was some other customer's data. I guessed they had recycled a disk from a server previously used by a customer who had left. I doubt that customer would be pleased to learn that I had free access to their data. But in this case, it didn't matter because I didn't look any longer than it took me to figure out it was someone else's data. I informed the support people that they should format that disk, adding, "Please be careful to format only the new disk intended for backups."
Before long, Bacula was happily making good—I was assured—backups of Artima's data. But in early October I received an email from tech support indicating the backup drive was showing signs of failure. So we scheduled another 15 minutes of downtime to replace the backup disk. This went fairly smoothly, except that on reboot the OS demanded that
fsck (UNIX File System Check) be run on the OS disk, and that required root password. The tech support folks don't have root password (they have a password for a different account that has root privileges), and so the technician booted the system from a CD-ROM and ran
fsck that way. I was nervous about the window of time in which I didn't have good backups, but replacing the disk was the way to solve that problem.
Four days later, however, I received another email indicating this second disk was showing signs of failure. So we scheduled another 15 minutes of downtime to install backup disk version 3. This went smoothly. I felt relatively secure that I had good backups for Artima's data until yesterday, when I received an email indicating that this third backup disk was showing signs of failure. So we scheduled another 15 minutes of downtime, for 3AM last night, to install yet another disk to hold backups (the fourth disk in six months).
Unfortunately, on reboot the OS demanded that
fsck be run again, but this time the technician didn't solve the problem by booting from a CD-ROM. Instead they called the number they had on file for me to ask for the root password. Unfortunately that number merely takes voicemails and then emails them to me. So I definitely made the mistake yesterday of not making sure they had a number at which they could reach me if there were a problem. As a result, the server was down for the six hours between 3AM EST when they started the operation and 6AM PST, when my alarm went off and I checked the site.
I called and they told me they had given up on running
fsck on the OS disk, because they didn't have root password, and were attempting to restore the OS drive from the backup drive that was acting up. They had put in a new drive for the OS drive and installed a fresh Linux OS there, and put back the old backup drive that actually had backup data on it. They said they were in the process of running
fsck on the 120 Gb backup drive, which they expected to take a long time. I asked them to put the original OS drive on a different system and try running
fsck on it, because I couldn't believe I would be so unlucky as to have two disks go bad at the same time. That worked, and within a half hour the site was struggling to get back on its feet.
Frank Sommers and I have been working hard on building a new architecture for the new Artima, and we try and deploy the latest software every Sunday morning. Occasionally we let bugs slip out the door, but I have found that most of the problems we encounter in keeping Artima up and running are not software bugs. Usually they are problems with configuring the software. We use several different software products and APIs, and each them provides "convenient" configuration files and utilities, which unfortunately also provide convenient places for bugs to lurk that don't show up in testing. This morning, however, the problem was that the tech support people were unable to either contact me when they encountered a problem, or find a quick solution to it themselves. So I apologize for the downtime, and will make sure this kind of problem doesn't happen again.
|Bill Venners is president of Artima, Inc., publisher of Artima Developer (www.artima.com). He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Active in the Jini Community since its inception, Bill led the Jini Community's ServiceUI project, whose ServiceUI API became the de facto standard way to associate user interfaces to Jini services. Bill is also the lead developer and designer of ScalaTest, an open source testing tool for Scala and Java developers, and coauthor with Martin Odersky and Lex Spoon of the book, Programming in Scala.|