This Monday we recovered a server from a potentially disastrous hard drive failure.
We always provision servers with at least 2 drives configured in RAID, as this ensures that all data is written to at least 2 separate physical disks and protects against data loss in the case of drive failure. This server was configured with RAID 1 (also known as ‘mirrored disks’), a system where all data is written to two hard drives instead of one. Both disks contain a complete, identical copy of all the data, hence the term “redundancy”. As a result, data can be read from either disk and if one disk fails, then the other disk can continue functioning with a complete copy of the data. Once the failed disk has been replaced the server copies all the existing data to it in the background, and once this process is complete, goes back to its normal process of writing all data to both disks, and reading from either.
The above is a rough description of how the system is intended to work – it’s a very common method of protecting data on servers, and the process is well understood and documented. However practice doesn’t always match theory, and in the case in question, something slightly different actually happened.
The client informed us at around 8:30am that they couldn’t access one of their servers. We talked the client through rebooting the servers and checking for error messages on the console and found that the server was showing “Boot Failure”, and failing to load windows. At this point it was obvious that the issues were relatively serious, so we immediately attended onsite, arriving at the client’s site by 9:45am.
Checking the server we discovered that the RAID controller was reporting that both of the disks making up the main RAID array been removed and reconnected at some point over the weekend. The office was closed all weekend and no-one had been in, so the obvious culprit was hardware problems. As the server was now reporting that both disks were “online”, and a physical examination showed that both disks were spinning up in the normal manner, we decided to boot the server, scan the disks for any errors that might have been caused by the hardware problems, using ChkDsk, and then boot the server normally.
Alarm bells started going off when CheckDisk appeared to be making far more corrections to the disk than we were expecting – we’d expect a few files to have been damaged when the failure occurred, but we were seeing hundreds of files being removed from the index on the disk as they were corrupt. At this point, we stopped the CheckDisk process, as it looked as though it could be causing more damage than it was repairing. We removed both disks from the server, and checked them on another computer.
What we found was that one of the disks had actually failed but still appeared to be operating correctly. As the server could not tell that the disk had failed, it was mixing the bad data from this disk with the good data from the other disk. While CheckDisk was trying to correct the corruption, it was actually unknowingly deleting data from the good disk! This is quite unusual – normally the RAID controller should have identified the failed drive and stopped using it, but in this case the failure was such that the drive wasn’t detected as failed.
Once we understood what was happening, we replaced the failed disk with a new blank disk from stock, marked the disk as replaced and allowed the server to duplicate the data from the ‘good’ disk (known as ‘rebuilding the array’). Once this was complete, it was possible to boot the server normally, and repair the damage that the CheckDisk process had caused by deleting files (which was a non-trivial process in itself, details may follow in a later blog post…).
The total time between the issue first being raised and the server being back online and users working as normal was around 5.5 hours, and all data protected by RAID within 12 hours.
There’s a couple of points of interest that come to light from this incident:
- Although these systems are designed to be very fault-tolerant, and can in theory be operated by non-technical staff, if they don’t work as expected it’s critical that the technicians carrying out the work can tell that the system isn’t behaving in the way expected. If we hadn’t stopped the ChkDsk process when we did, it would diligently have worked through the entire disk deleting all of the files which were “inconsistent”, which in this instance would have been all of them. So the recommended processes would have ‘bricked’ the server, requiring a server rebuild and full restore from backup.
- Multiple backups are essential. Simply relying on RAID to protect the data in this instance could have been catastrophic – although in this instance we didn’t need to restore data from backup it was a close call, and the job was much less stressful with the knowledge that if recovery wasn’t possible the way we were attempting, alternative options would have been available.