Redundancy
Most RAID systems, but not all, have redundancy. This means that if one storage device in the RAID fails, the failed storage device can be replaced and the RAID will rebuild itself without loss of data. RAID 1 gets its redundancy from keeping a second copy of the data. RAID 5 gets its redundancy from storing parity. There are many different ways parity is calculated – a common method is to count the number of 1-bits and set the parity based on the count being odd or even. When a storage device fails, the 1-bits are counted on the remaining devices and compared with the parity value. This tells the RAID if the missing drive was a one or a zero.
Not all RAID have redundancy, for example, RAID 0 has none. If you lose a storage device, you lose all the data in the RAID.
If you have a storage device fail in a RAID, you should replace it quickly. Before you do so, make sure you check that cabling is not the problem. If a cable comes loose from a storage device, the RAID may think the drive has failed. This can cause the RAID to go out of sync. When this occurs, your RAID made need to rebuild itself in order to get back in sync.
When replacing a storage device, make sure you don’t unplug the wrong one. If you do that, you risk the RAID controller thinking the storage device has failed and you may lose all the data in the RAID. Let’s have a look at how to replace a storage device in a RAID system.
RAID Failure Demonstration
For this demonstration, I have set up a computer with two hard disks. These two hard disks have been set up to use RAID 1. RAID 1 is also called a mirror, because each drive contains identical copies of the data.
I have configured the RAID in the computer’s setup. Thus, this RAID is using motherboard RAID not a dedicated RAID card. Regardless of which solution you use, the principles of how to set it up and use it remain the same. Dedicated RAID cards generally perform better, have more features, and may also have a better interface.
To see what RAIDs I have set up, I will select the option, “Select array”. Even though there are two hard disks in the RAID, the RAID will only show one. This is because a RAID is essentially one group consisting of a number of different storage devices.
The level of RAID is shown as RAID 1. There are two hard disks having identical data; thus, I have redundancy for a single drive failure, but the cost is my available storage is reduced by half.
To look at what storage devices are in the RAID, I will look at the option, “View Associated Physical Disks”. This interface is not very intuitive, meaning it is not easy to understand how it operates if you have not used it before.
Both hard disks are working and functional, although the interface shows them as disabled. This discrepancy occurs because the interface is designed to display the properties of individual storage devices, and this does not reflect their current operational status. To view the properties of each disk, you must manually select and enable them within the interface. If this is confusing, think of it as selecting the device rather than enabling it. Yes, the interface is confusing. Therefore, I will proceed by enabling the first hard disk, which is essentially selecting it, then view its specific properties through the option, “View Physical Disk Properties”.
This will show the hard disk properties, including if it is currently in use. To see the other hard disk properties, I will disable the first hard disk and enable the second one. With the second hard disk enabled, when I select the option “View Physical Disk Properties”, the properties for the second one will appear. As I said, the interface is not easy to understand. Enabling the device is effectively selecting it.
Even though RAID is included on the motherboard, it is effectively an add-on to the computer’s setup. Thus, the RAID itself may be a dedicated chip on the motherboard, a part of the chipset functionality, or implemented in software. Regardless of how it is implemented, the main takeaway is, the computer setup is communicating to an add-on to get results. Thus, sometimes the interface between the two may work in unexpected ways or not be very intuitive. I generally find that when you use an interface that was not included as part of the computer’s setup, it is a little easier to work with, but when you get something for free included with your computer, you get what you pay for.
I have installed Windows 11 on this RAID. The RAID itself will appear in Windows as a single storage device. Effectively, Windows does not know the difference between a single physical storage device and when it is using a RAID.
To demonstrate what happens when a drive fails, I will now use a specialized tool to fail one of the hard disks. Do not try this at home. You will notice that even though one of the hard disks has failed, Windows continues to operate as normal. In some cases, you may have some performance degradation as the RAID attempts to access the failed hard disk. A RAID, once it has worked out the hard disk has failed, will mark it as bad and stop using it.
You will notice in this case, in the Event Viewer of the computer, not a single event has been created letting you know there is a problem. Since RAID works at the hardware level, unless the RAID manufacturer added some additional notification, you won’t know there is a problem. Usually, to do this, additional software is provided by the manufacturer.
I will now shut down the computer and start it back up. You will notice when the computer starts up, I did not get a message indicating there was a problem with the RAID. In most cases, you will get some sort of warning message or something telling you there was a problem; however, on this computer I am not getting anything. Whenever I set up a RAID, before I start putting data on it, I will physically pull the cables out of a hard disk to simulate a failure. That way I know if/how I am going to be alerted when there is a problem. Once I have worked this out, I plug the hard disk back in and recreate the RAID from scratch. If your RAID contains live data, I would not recommend doing this. Instead, refer to the manual from the manufacturer to find out how you will be alerted when there is a problem.
I will now enter the computer’s setup to have a look at what is happening with the RAID. To do this, I will select “Settings” then select the “Advanced” option. In the advanced menu, I next need to go down and select the last option for RAID.
I will next select, “Array Management” and finally, “Manage Array Properties”. You will notice that the array status is reported as “Critical”. One hard disk has failed, so if another hard disk were to fail, we would lose all our data.
To get an idea of what has occurred, I will select the option, “View Associated Physical Disks”. The second disk is now listed as a removed physical disk. Essentially, the hard disk was so badly damaged the computer is not even detecting it.
To fix this problem, I would replace the hard disk. The data from the existing hard disk will be copied to the new one. I could demonstrate this on this computer, however, as we have seen, we are not getting a lot of information. I will change to a different computer, so we can get a better understanding of what is happening. On this computer, I have plugged in four hard disks. With three of them, I have created a RAID-5 and the last one I have left as a spare.
In this demonstration, I have installed Windows onto the RAID-5 array. On my Windows computer, I have installed the management software. If your RAID comes with management software, this will often have more features and be easier to use than trying to configure it through the initial startup setup utility.
You can see the RAID 5 configuration in the software. This software has other features, such as viewing performance and configuring other settings. You can also see the unused hard disk, which I can also mark as a spare. Doing this will erase the data on the drive. Having marked it as a spare means that if a hard disk were to fail in the RAID, the RAID would be automatically rebuilt using the spare.
I will now simulate a hard disk failure. However, this time I will do it by simply unplugging the hard disk, the power and the data cables, even though only one needs to be unplugged for the hard disk to be unusable.
You will notice that the hard disk failure has been detected and the RAID is rebuilding itself using the spare hard disk. This process is called syncing. This demonstrates how important it is to make sure that your storage devices are plugged in firmly and won’t accidentally fall out. As soon as a storage device fails, the RAID becomes out of sync and must rebuild itself. Even if I plugged the hard disk back in again, the rebuild must still be performed. This can be quite time consuming on large storage devices. While the RAID is being rebuilt, in most cases, another hard disk failure will cause you to lose the RAID and all the data on it.
RAID Recovery Software
For the A+ exam, if you get a RAID question it will most likely be about understanding the basics of different RAID levels or about replacing a failed storage device in a RAID with a spare. For other kinds of failure, for example, a hardware failure, you probably won’t get a question on that.
If you have a hardware failure, you can try and use the storage devices on another device with the same hardware. If this is not available, there are a number of RAID recovery tools available. For example, this freely available RAID tool.
Using tools like these, you can reconstruct the RAID. In this example, I have connected two hard disks from the RAID 5 I used in the previous example. These are the hard disks at the top and the next one down. This demonstration had three hard disks in the set with one spare. Thus, since I am only using the first two, I am missing one. Therefore, the data will need to be rebuilt using the parity information.
I will now select the two hard disks and press the button, “Start RAID 5”. The process takes a while, so I will pause the video and return when it is complete. In this case, the RAID was detected as RAID 5 containing three hard disks. Pressing the “o.k.” button will start the recovery process – I won’t worry about doing it. The point to understand is, that if you have trouble with your RAID, there are options available for you to recover it.
End Screen
That concludes this video on RAID failures. I hope this video has been informative. Until the next video, I would like to thank you for watching.
References
“The Official CompTIA A+ Core Study Guide (Exam 220-1101)” pages 105 to 107
“Mike Myers All in One A+ Certification Exam Guide 220-1101 & 220-1102” pages 364 to 365
“License CC BY 4.0” https://creativecommons.org/licenses/by/4.0/
Credits
Trainer: Austin Mason https://ITFreeTraining.com
Voice Talent: HP Lewis http://hplewis.com
Quality Assurance: Brett Batson https://www.pbb-proofreading.uk