So the other week I was in the datacenter doing a routine check on one of our servers, and I noticed that Chimera (our MySQL server) had a blinking orange light on it, and it was making this *beep beep beep* noise. Great! The Raid array has failed, or possibly the hard drive.
Well it doesn’t really matter now, because I’m at the datacenter again preparing to fix this problem.
Thankfully, we use RAID. RAID stands for a Redundant Array of Inexpensive Disks (some claim the I stands for independent). In particular, this server runs RAID 1, or mirroring. That means that it has 2 hard drives, and one mirrors the other in real time, so both drives contain the exact same data.
This is a good thing, because in the case that the hard drive is actually bad, I just call up Dell, they send me a new hard drive, and I re-image the array off the good disk. But hopefully the case is that the array just broke it’s sync, and I just need to rebuild it. In any case, it means I have to take our database server offline to do this process, which means thousands of databases for our shared hosting customers are going to be offline for a wee bit.
I feel really bad about this, because it means that they will have an interruption of service, but then again, they are only paying $10/mo to share a server cluster with hundreds of other customers. This kind of stuff comes with the territory. Unlike our major dedicated server customers who have their own database servers and redunant load balancers etc. Then again, they are paying hundreds of dollars per month for the assurance that they will not have an outage.
This is a little off topic, but as I’m sitting in the datacenter here backing up the server before this process (in case we need to do a bare-bones re-install), a centipede like bug just ran across the floor from under the rack. I think I need to tell the DC techs that there are bugs in the datacenter.
Anyways, the main point of this little tirade is that any reputable web host will have redundancy built into their system for cases like this. Modern hardware and computer equipment is not infallible, rather it is quite fallible and will break or fail at some point. The key is to know what you are going to do when it happens, and what you are going to do to minimize the situation with pre-planning.