This morning Artima was down around eight hours because of a 15-minute disk replacement job. In this blog post I recount the struggle I've had with backups over the past half year.
This morning Artima was offline from approximately 3AM to 11AM EST. This outage was caused by a problem that occurred when replacing a disk intended for backups.
I've been having a lot of problems with disks the past year. Last August, a disk went bad on the Artima server, and I found out that although I had been paying for daily backups, apparently I hadn't been paying for the ability to restore from backup. Needless to say I was very disappointed about that. I did have my own offsite "backup" backups of critical data, though not as fresh as the official backups should have been. And fortunately the disk that went bad wasn't completely out to lunch, so I had my ISP mount it on a different server and was able for the most part to restore anything that was still missing. But it was a painful, time-consuming process. The site had been offline for more than 24 hours before it reappeared, and it took about three days to get everything back in place.
Earlier, a salesperson from my ISP had attempted to sell me on a different backup system, which would involve installing a backup disk in the server, running Bacula to backup files to the extra disk, and charging me more money each month. Since disks are quite reliable these days and since I already had a backup regime in place, I declined. What the salesman neglected to tell me was that the existing backup system wouldn't allow me to reliably restore files. So after the catastrophic disk failure in August, the same salesman offered to provide the Bacula system for free. I accepted, and we scheduled some more downtime.
Artima currently runs on a server service. We didn't buy the hardware and co-locate it. Instead we pay a monthly fee and the ISP, INETU, provides the hardware and support technicians. If there's a hardware problem, they take care of it. When one of the technicians took the site down to install the backup drive, they apparently didn't get the memo explaining exactly why the disk was being installed, because they spent extra time trying to get the server to boot with the new disk in place. The new drive needed some time to recover journals, and several reboot iterations to get the configuration correct so the new disk would stop trying to boot the OS. They sent me an email indicating where the new disk's old /boot, /, /usr, and /var directories were now mounted. I took a look and it was some other customer's data. I guessed they had recycled a disk from a server previously used by a customer who had left. I doubt that customer would be pleased to learn that I had free access to their data. But in this case, it didn't matter because I didn't look any longer than it took me to figure out it was someone else's data. I informed the support people that they should format that disk, adding, "Please be careful to format only the new disk intended for backups."
Before long, Bacula was happily making good—I was assured—backups of Artima's data. But in early October I received an email from tech support indicating the backup drive was showing signs of failure. So we scheduled another 15 minutes of downtime to replace the backup disk. This went fairly smoothly, except that on reboot the OS demanded that fsck (UNIX File System Check) be run on the OS disk, and that required root password. The tech support folks don't have root password (they have a password for a different account that has root privileges), and so the technician booted the system from a CD-ROM and ran fsck that way. I was nervous about the window of time in which I didn't have good backups, but replacing the disk was the way to solve that problem.
Four days later, however, I received another email indicating this second disk was showing signs of failure. So we scheduled another 15 minutes of downtime to install backup disk version 3. This went smoothly. I felt relatively secure that I had good backups for Artima's data until yesterday, when I received an email indicating that this third backup disk was showing signs of failure. So we scheduled another 15 minutes of downtime, for 3AM last night, to install yet another disk to hold backups (the fourth disk in six months).
Unfortunately, on reboot the OS demanded that fsck be run again, but this time the technician didn't solve the problem by booting from a CD-ROM. Instead they called the number they had on file for me to ask for the root password. Unfortunately that number merely takes voicemails and then emails them to me. So I definitely made the mistake yesterday of not making sure they had a number at which they could reach me if there were a problem. As a result, the server was down for the six hours between 3AM EST when they started the operation and 6AM PST, when my alarm went off and I checked the site.
I called and they told me they had given up on running fsck on the OS disk, because they didn't have root password, and were attempting to restore the OS drive from the backup drive that was acting up. They had put in a new drive for the OS drive and installed a fresh Linux OS there, and put back the old backup drive that actually had backup data on it. They said they were in the process of running fsck on the 120 Gb backup drive, which they expected to take a long time. I asked them to put the original OS drive on a different system and try running fsck on it, because I couldn't believe I would be so unlucky as to have two disks go bad at the same time. That worked, and within a half hour the site was struggling to get back on its feet.
Frank Sommers and I have been working hard on building a new architecture for the new Artima, and we try and deploy the latest software every Sunday morning. Occasionally we let bugs slip out the door, but I have found that most of the problems we encounter in keeping Artima up and running are not software bugs. Usually they are problems with configuring the software. We use several different software products and APIs, and each them provides "convenient" configuration files and utilities, which unfortunately also provide convenient places for bugs to lurk that don't show up in testing. This morning, however, the problem was that the tech support people were unable to either contact me when they encountered a problem, or find a quick solution to it themselves. So I apologize for the downtime, and will make sure this kind of problem doesn't happen again.
Bill, I don't know if you want to consider moving to a different provider, but when I read your article I couldn't help shaking my head and thinking "the boys that look after our servers provide 1000 times better support than that".
I don't know how much you are paying at INetU, but you should seriously drop a line to the guys at Contegix (http://www.contegix.com/). I can't rave enough about how happy we have been since moving our servers over to Contegix (infact, you'll probably see a quote from me on their front page).
Sorry to hear about your woes. Can't help feeling that the technicians looking after the hardware could do with some proper training rather than bodging their way through. On the other hand, it's a real world out there; we have just the same kind of problems from time to time.
Incidentally, the last three articles (on databases, Tangosol and unit testing) haven't yet appeared on the Articles Forum even though two of them have replies.
Rather than go to the front page I now use the Articles Forum and the Weblogs Forum as my entry points into Artima since they display all the discussion threads, marking the ones changed since my last visit.
> Sorry to hear about your woes. Can't help feeling that > the technicians looking after the hardware could do with > some proper training rather than bodging their way > through. On the other hand, it's a real world out there; > we have just the same kind of problems from time to time. > Other than the trouble with backups, the tech support people have been very competent and helpful. I've had my server there for years, so for the most part service was fine. I got the feeling last summer the tech support people had long known the old backup regime was unreliable. They were probably the ones who suggested switching customers to Bacula, but that required buying a new disk for every server, and management must have decided to try and get customers to pay for that.
> Incidentally, the last three articles (on databases, > Tangosol and unit testing) haven't yet appeared on the > Articles Forum even though two of them have replies. > > Rather than go to the front page I now use the Articles > Forum and the Weblogs Forum as my entry points into Artima > since they display all the discussion threads, marking the > ones changed since my last visit. > That's because we posted them directly to Spotlight, i.e., directly onto the home page. We've been focused a lot on writing software the last few months, so lately Articles and Weblogs have been the main place new stuff has gotten into Spotlight, but we're planning on posting two or three items of news each morning from now on. It occurs to me, however, that we should be posting in the news groups where appropriate (like Java Community News for the Tangosol news item) and approving it to Spotlight. Regardless, if you want to watch these posts you'll need to watch the home page again.