I recieved a nice little notice from PG&E a couple of weeks ago that they would be doing some routine maintenance in my neighborhood and that they would need to shut the power off for several hours to complete it. The letter was a nice heads up that the power would probably be shut off between the hours of 10 & 2 on the upcoming saturday. This wasn’t a huge problem, I’ve got a rack of servers in the garage serving several properties that I’m working on, but since none of them are generating revenue yet or are still in an early beta stage, a little down time wouldn’t be the end of the world.
I made a promise that I wouldn’t sink the money for colo space until the properties were generating enough revenue to pay for it. The bandwidth to my garage is sufficient and the power up until now had been very reliable. Also this way I would be pushed to do more with constrained resources and wouldn’t be careless with expenses.
Saturday morning I woke up to the sound of the puppy running around the house and my wife cursing about not being able to get a load of laundry done. It turns out that PG&E had jumped the gun and the power went off at 8am instead of 10am. My plan of carefully shutting down all of the servers went out the window, not much you could do about it now though. So I spent the rest of the afternoon out running errands waiting for the power to come back up.
Later that afternoon the power was returned and I quickly went about the task of bringing everything back online. The first problem to creep up was that my backup server with the large disk shorted out immediately and nothing was going to bring it back. One server down.
The rest of the servers came back online without to much of a hitch. They all booted properly and seemed to be functioning normally. Upon closer inspection though I found that the mysql database had corrupted and that I would have to restore from the nightly backup (thankfully this was also copied to a server other than the backup server). I would lose 12 hours of data, but everything else would be restored. The only problem was that it was 7Gb now and would take awhile to process.
2 hours later, the database was back online and all of the sites were back up. Everything seemed to be fine. I just needed to find some new hardware for the backup server and get those cron jobs running again.
The next morning however, I started to see some issues on my gateway server that handles web, email, cvs, and dns. Disk errors all over the logs and that horrible clicking noise coming from the hard drive. The machine quickly went down and wouldn’t reboot to allow me to copy the files since the backup was also gone. One trip to Frys to pick up a new hard drive and 6 hours of going through the configs again to get all of the services running and I was back in business. It turns out that mounting a bad disk under a rescue boot will allow you to get a lot off of it. I also took this opportunity to get the backup server online with new hardware.
Everything was smooth for a couple of days, but then the DB server went down for no apparent reason. Turns out that one of the 3 fans on the power supply had failed and the server would rather shut down hard than run with a iffy power supply. 20 min to copy the DB to another server and reconfigure everything to point at the new location and everything is back up and running, if at a slightly slower speed.
I put in a call to my CDW rep (who also happens to be my brother) and learn that the part I need to replace is no longer made and while HP might be able to get it to me it’ll probably be a few months and a special order(read costly). So I turn to ebay and quickly find a reseller in OH that is willing to ship me the part today for $14, SOLD.
Last night I installed the new power supply, reconfigured the db back on the db server and hopefully ended the saga of the early power shutoff. I’m working very hard to get some cash flow coming in so that I can move these servers out of my garage and into a proper colo and not have a repeat of all this excitement. It’s always interesting to see what a bad shutdown can do to a server that’s used to running continuously.