Hard Disk/Solid-State Drive Failures
In the world of computing, you are most likely going to be using storage in the form of hard disks and Solid-State Drives. These devices fail as time goes on. This video will look at how to recognize some of the signs a storage device is failing and what to do when one does fail.
Working Hard Disk
To start this video, I will look at hard disks, and later in the video I will look at Solid-State Drives. Before you can start to understand what to look for in a failed hard disk, I will first look at what a working hard disk sounds like.
This hard disk is brand new and has never been used before. This will allow us to determine what a normal functioning hard disk should sound like.
Since hard disks don’t make a lot of noise, it is difficult to record the sounds they make. To get a better audio recording, I will move the hard disk into a specialized recording box. Once in the recording box I will connect the hard disk to power.
Have a listen to the sound of the hard disk when I connect it to power.
When you first switch on a hard disk, you should hear a whirling noise which is the motor spinning up. I’ll play the sound again for you.
If you put your hand on the hard disk when you first switch it on, you may be able to feel some vibrations from the motor. After the hard disk has spun up, you should hear the sound of the hard disk head moving. It is a difficult sound to describe. I would say it as a light buzzing noise. I’ll play the sound again so you can hear.
This sound is normal and should be pretty quiet and random in nature. Essentially, it is the sound of the hard disk head moving. When the hard disk is first switched on, the hard disk will do some self-checks which involves moving the hard disk head around – this is what you are hearing.
To get a better understanding of what is occurring, I will look at what happens when reading and writing data to the hard disk.
Reading/Writing Data
For this demonstration, I will remove the cover from the hard disk so we can see what is happening inside. Hard disks are easily damaged. A fingerprint, a hair or even a speck of dust can stop the hard drive reading the platter. Thus, if you remove the cover from a hard disk, you are exposing the platters to these things. If you remove the cover from a hard disk, don’t expect the hard disk to ever work again.
I will switch on the hard disk and as before you will hear a light buzzing noise as the hard disk starts up and the hard disk head moves over the platter. I will now read some random data from the hard disk, have a listen to what it sounds like.
It may be difficult to hear. If you ever go into a server room with a lot of hard disks, you will get used to hearing this light buzzing noise. It is the sound of the hard disk doing what it is designed to do. Now, that we know what a hard disk sounds like when it is working as expected, let’s have a look at what happens when a hard disk is not working the way it is supposed to.
Failed Hard Disk
This next hard disk is damaged. Let’s do some troubleshooting and try and work out what has happened to it. When I switch it on, have a listen to the sounds it makes.
This is an interesting one, because, probably your first thought is the clicking noise is a problem with the drive head. Have another listen and see what is missing rather than what you hear.
You will notice that the sound of the motor of the hard disk is missing. So, I suspect, the motor in this hard disk as failed. Let’s open it up and have a look.
Keep in mind that when I open a hard disk like this to have a look what is going on, I am doing this for educational purposes. I don’t care about any of the data on it. If you are trying to recover a hard disk, don’t open it, take it to a data recovery specialist to get the data back. They are expensive, but are the best shot at getting your data back. Opening the hard disk and having a look around reduces the chances of getting anything off the hard disk, even if you pay for professional data recovery. Later in the video, I will show you how easy it is to damage a hard disk.
You will notice that when I switch the hard disk on, the platters remain still. The ticking noise was the sound of the motors attempting to spin, however they have not been able to spin the platter. To confirm this is the case, I will get my screwdriver and give the platter a little spin to ensure it is not stuck; once again, this is all for educational purposes.
You will notice the platter spins, so the drive is not stuck. If I were to take this hard disk to a data recovery specialist, they would most likely get an identical hard disk as a doner, pull the platters and circuit board out and put them in the doner drive. Then, once done, copy all the data off the hard disk and give you the data. Sounds simple but it is not. One speck of dust and the platters may not work. The platters also need to be correctly aligned on the doner drive, which is a hard enough task by itself. A professional recovery specialist will train on hundreds if not a thousand hard disks before they will be allowed to work on a customer’s hard disk. It is not an easy task.
You may think a failed hard disk like this won’t work at all, but this is not correct. Notice that in disk management the hard disk has appeared as an unknown disk. Notice, that when I attempt to initialize the disk Windows will try to do it. It will even ask questions such as which partition table you want to use.
Once I press O.K. Windows will make an attempt to initialize the hard disk but will display an error, “The request failed due to a fatal device hardware error”. If you get an error like this and you can hear the hard disk is not spinning up, you know the problem is most likely a failed motor or stuck spindle. If you can’t hear the hard disk because you can’t get your ear close enough or there is too much background noise, you can always put your hand on the external case of the hard disk when you switch it on and you should be able to feel the vibrations of it starting up. Some hard disks vibrate more than others on start up, so sometimes it is hard to tell.
Click/Click
One of the worst sounds you will hear in IT is the click, click noise of a failed hard disk. Also known as the click of death, because often your hard disk will need to be replaced if this occurs. This hard disk has the click of death. When I switch it on, you will notice that it will start up normally and then there will be a repetitive clicking noise.
There are number of different causes for this clicking noise. The first is when you first switch the hard disk on. If this is the case, it’s bad. To understand why, consider what the hard disk is trying to achieve when it gets in a clicking loop like this. Essentially, the hard disk is moving the head of the hard disk in order to read something on the platter. When first switched on, there is some configuration data that needs to be read in order to use the hard disk. The hard disk is attempting to read this information but is not able to.
When a hard disk is not able to read or write data, it will reset the head and try again. This is why you see the head of the hard disk move backwards and forwards. Since the hard disk is having trouble reading or writing, it is possible the head is attempting to access the wrong part of the platter, so it resets the head and tries again, just in case the head is slightly in the wrong position. In some cases, it may be the platter is damaged, in other cases the head can’t move to the required location due to something blocking the head.
In this case, since the hard disk has just been switched on, there is configuration information that needs to be read. If this configuration information cannot be read, the hard disk won’t be able to start up. This is why it is really bad if you hear this noise when you first switch the hard disk on.
This will also occur if there is a bad sector on the hard disk. If the hard disk can’t access the sector on the platter, it will keep resetting and trying again. Eventually it will try to move the sector to some reserve sectors on the hard disk. All hard disks have some extra sectors that are designed to be used if you have a bad sector. If you hear your hard disk making this clicking noise, I would recommend copying your data off that hard disk and throwing the hard disk away. There are also other noises that indicate your hard disk may need replacing.
Other Hard Disk Replacement Sounds
Once you understand what a hard disk should sound like, if you hear any other noises the hard disk most likely needs to be replaced. Besides the click of death, the other sound you may hear is a high pitch squealing noise. This generally indicates some kind of wear in the hard disk or mechanical failure.
Another sound that you may hear is a grinding noise. If you hear something like this, it generally means something in the hard disk is either worn out or wearing quicker than it should be. Hard disks are mechanical in nature and do wear out. Sometimes the hard disk may make some noises letting you know that it is failing. Next, I will look at Solid-State Drives.
Solid-State Drives
Solid-State Drives are different to hard disk drives in that they make no noise. This makes it difficult to know that the drive is working and if there are problems with it. The simplest way to test a Solid-State Drive is to simply plug it in and see if it is detected.
If it is not detected, shut down the computer, check the cables and try again. It is always best to shut the computer down, as sometimes hot-plugging a storage device will not work. In some cases, you may want to use a USB docking station to test storage devices. If you do this, make sure that you use a good quality one as low-quality docking stations may not work with all storage devices. Bear in mind, that some tools won’t work or don’t work as well if a USB connection is used rather than a direct connection.
For hard disks and Solid-State Drives, once they die you can’t service them. A data recovery specialist may be able to recover the data, but once that is done, they will most likely recommend that you throw the drive away.
If you do decide to open a Solid-State Drive, there is something you can look for. Keep in mind that once you open a Solid-State Drive, just like a hard disk it will most likely void any warranty.
This example Solid-State Drive has a burnt-out component, but it is very difficult to see. Sometimes, when a component burns it will mark the case which is easier to see than the component itself. In this example, it is easy to see the burn spot as it is black on a gray case.
In this case, the burn mark is blending in with the circuit board making it hard to see. Having a look at the case gives me an idea of where to look. You can see the burn is only a small component on the circuit board and very difficult to see.
We now know why this Solid-State Drive does not work. Unfortunately, this component is too small to replace, so the only option is to buy a new Solid-State Drive.
If a Solid-State Drive makes no noise, how do we know that it is about to fail. Let’s now look at some ways that can help us determine if our storage devices are about to fail before they either burn out or start making strange noises.
S.M.A.R.T
Since storage devices fail, it would be useful to be able to predict when they are going to fail. That way, you can replace the storage device before it fails. To attempt to do this, the S.M.A.R.T. standard was created in 1994. S.M.A.R.T. stands for Self Monitoring, Analysis and Reporting Technology. It was designed to attempt to report statistics and also detect and report imminent hardware failures.
Although S.M.A.R.T. is a good technology and useful in a lot of situations, it was developed without any specific rules on how it should be implemented. Thus, vendors are free to implement it how they wish. There are some industry standards the manufacturers tend to follow, but the vendor is free to implement S.M.A.R.T. however they wish.
Because of this, vendors can partially implement some industry standards and still advertise their storage devices as S.M.A.R.T. compatible. Thus, just because you used S.M.A.R.T. on one storage device, don’t assume that it works on another storage device the same way.
To get a better idea of how S.M.A.R.T. can be used to detect problems with storage devices, I will next have a look at some S.M.A.R.T. data from a number of different storage devices.
CrystalDiskInfo (http://CrystalMark.info)
There is a lot of software that is able to read S.M.A.R.T. data. In this case, we are going to use some freely available software called Crystal Disk Info. I have already installed Crystal Disk Info. In this example, there are four storage devices connected to this computer. Two hard disks, one M.2 Solid-State Drive and one SATA Solid-State Drive. This gives us a good range of storage devices, so you can get an idea of what S.M.A.R.T. data you may come across. Nowadays, all Solid-State Drives should have S.M.A.R.T.; hard disks, traditionally the high-end hard disks, have S.M.A.R.T. but low-end hard disks may not.
The first hard disk I will look at is a Seagate two Terrabyte hard disk. I could have selected newer hardware but have chosen devices based on the S.M.A.R.T. data they provide.
Much S.M.A.R.T. software will give you a quick indication of the health of the storage device. In this case, the software is reporting the storage device’s health status as good. S.M.A.R.T. software can often be customized. For example, it may be configured to report the storage device as not healthy if the device is over a certain temperature or if the motor in the device is not spinning as fast as it should be. This way, the storage device can hopefully be replaced before it fails. S.M.A.R.T. does help with this, but it is not guaranteed that it will detect a storage device failing before it does so.
Tools like this one, will often attempt to read data from the drive and display helpful information. The data it retrieves can be inconsistent between different storage devices. You can see at the top, the rotation speed of the hard disk.
I would always question data like this. In this case, is it the speed the hard disk is rated at, or is it the speed the hard disk is currently spinning at? In this case, it is the rated speed, so will always be reported as this even if the motor on the hard disk is spinning slower than it should be.
Under this is the power on count. This is essentially the number of times the storage device has been switched on or powered up. This should be accurate. Under this is the power on hours, which is the total hours the storage device has been running since it was manufactured.
To get a better understanding how long this has been, cursor over the value and it will give you the figure in years and days.
The data shown here is obtained and processed from drive information and S.M.A.R.T. data. The point to consider is, the software is doing its best to process this data and display it correctly. However, this is based on the manufacturer of the storage device providing data correctly in the first place and in a way the software can understand.
Below this you can see the raw S.M.A.R.T. data. The vendor is free to add their own data in here or use industry standards. The vendor can also report data but not actually update it. So, I would not always be quick to believe everything it reports.
Notice the value “Reallocated Sectors Count”. This keeps a record of how many sectors have been reallocated. Old hard disks used to mark a sector as bad, and then you could no longer use it. In modern hard disks, when a bad sector is detected, the hard disk will during idle time attempt to move the bad sector to a group of replacement sectors that the hard disk keeps as spares. The operating system won’t be aware that this is occurring, it will access the sector as normal not knowing that it has been moved.
If you see this value going up, it means you have bad sectors on the hard disk and you should consider replacing it. I will now press the right arrow to move onto the next hard disk.
This hard disk is an enterprise storage VelociRaptor. This was used in an enterprise storage solution so has seen some high-volume use. To get an idea how much the hard disk has been used, you can look at the S.M.A.R.T. data. In this case, notice the value for “Total Host Writes”. The value is in hexadecimal. When I convert it to binary it is about 3.8 billion writes. So, you can see it has had a lot of use. Although generally not possible, if you can get the S.M.A.R.T. data before you purchase a second-hand hard disk, it will give you an idea how much life may be left.
I will move onto the next storage device which is an M.2 Solid-State Drive. Unlike hard disks, Solid-State Drives wear out the more they are written to. Thus, it is important to measure this, unlike hard disks where expected lifespan is measured by factors such as usage, hours and operating environment.
For this reason, you can see at the top the number of host reads and writes. This gives you an indication how much the Solid-State Drive has been used. However, the one that will give you the best indication of wear is “Total NAND Writes”. The smallest unit that a Solid-State Drives writes is called a block. If it is near capacity all the time, the Solid-State Drive will have to move data around in order to write new data – kind of like trying to put items in a storage room. If the storage room is nearly full, you will need to move items around to get new items in. If the storage room has a lot of space, you won’t need to move anything around to put new items in and thus the process is a lot faster.
Therefore, looking at the amount of data written, or a similar statistic, is not always the best indicator. A better indicator is to look at a measure of how many NAND writes have been performed. Since the problem is the blocks on the Solid-State Drive wear out after a certain amount of time, looking at the NAND writes will give you an indication of how many times the blocks on the Solid-State Drive have been updated. This gives you an idea of the life expectancy of the drive. This measure is more important in Solid-State Drive than how many hours it has been powered on.
One of the other good measures is to look at the number of bad blocks reported. Even a brand-new Solid-State Drive may have some bad blocks. Faults in the manufacturing process can result in bad blocks, so don’t be concerned if you have some. Usually, the manufacturer will hide these initial bad blocks during manufacture, so you won’t be aware of them. However, regardless of what this figure is, if this figure starts going up, this is a concern.
Just like a hard disk, a Solid-State Drive will have a number of spare blocks that can be used to replace a bad block. The number of these spare blocks is under “Available Reserve Space”. It is just the nature of Solid-State Drives that blocks over time wear out and will need to be replaced; however, when this count gets to zero this is a real concern. What this means is, bad blocks can no longer be replaced with a spare block. If this count starts getting low, I would consider replacing the Solid-State Drive.
You can also look at other measures such as “Media Wearout Indicator”. On most Solid-State Drives this will be a percentage, with 100% being the best. Lower values mean the Solid-State Drive is closer to needing to be replaced.
This Solid-State Drive demonstrates how inconsistent S.M.A.R.T. can be. This figure is not a percentage, and it is unclear looking at it what it actually represents. Thus, when you start working with S.M.A.R.T. data, don’t trust what you are reading until you can verify what is being reported and what it means.
The last storage device that I will look at is a Solid-State SATA Drive. For this Solid-State Drive, notice at the top is the total number of reads and writes. However, the important figure for trying to determine the age of the Solid-State Drive is the NAND writes. In the case of this drive it is zero. You will notice that, for this particular Solid-State Drive a lot of the reported S.M.A.R.T. data is zero. So, you can see there is a big difference in what is reported by different storage devices when it comes to S.M.A.R.T. With some more data is reported than with others, and there can be inconsistencies between devices on how data is reported.
We can see for this Solid-State Drive the amount of data read and written is low. We can also see the power on hours is 33 days. So, this drive has not seen much use. You can also see that the total bad block value is zero.
Given that this value and so many others are zero, I would be concerned these values are not being updated. You can see that “Available Reserve Space” appears to be set correctly. So, it is possible that this Solid-State Drive does not have any bad blocks. Personally, I would be looking at the available reserve space to get an idea if there are any bad blocks. If this figure starts going down, we know some bad blocks have been found and replaced. If the “Available Reserve Space” goes down and “Total Bad Block” stays at zero, we know that bad blocks are not being reported correctly.
You will also notice that the “Media Wearout Indicator” is zero. We know since this is a relatively unused Solid-State Drive and has life left in it, this figure is not correct. There is a figure that can helps us determine how much this Solid-State Drive has been used.
Notice that there is a “NAND GB Written” and a “NAND GB Written (SLC)”. NAND GB Written is zero and thus probably what the software is reading to report the NAND writes being zero. However, “NAND GB Written (SLC)” appears to be updating. Thus, if you look at “NAND GB Written (SLC)” this will give us a figure for how much data has been written to NAND and thus an idea of how much life is left in the drive.
You can see that S.M.A.R.T. is a really good thing to have, however the data reported depends on the vendor and is subject to inconsistency. So, if you are going to use it, I would verify the data where you can to make sure that it is accurate. You don’t want to, in this case, replace a perfectly good Solid-State Drive because the “Media Wearout Indicator” is zero. Also, you would not want to be in a position where you would replace the Solid-State Drive after a certain amount of NAND writes, but since the value isn’t being updated it never gets replaced.
Since S.M.A.R.T. data can be inconsistent, let’s look at another way to get information about the storage device.
Western Digital (WD) Drive Tool
Another option is to use the software provided by the manufacturer. For this demonstration, I will use the WD Dashboard software provided by Western Digital. Sometimes you will find the manufacturer will provide a few different software applications, often with some older software for legacy drives. However, if the storage device is really old, don’t expect there to be software to support it.
This computer has a number of storage devices installed, however there is only one that is Western Digital. Thus, only the Western Digital will be displayed. Some software may display all the storage devices installed, but keep in mind they are designed to operate with their products.
You will notice here, this Solid-State Drive is running an older version of firmware. You will find that the firmware in storage devices does not get updated very often. Some very old Solid-State Drives did not support the Trim function. This option was added later with new firmware. I would recommend updating the firmware on the storage device, since updates also include error fixes. The last thing you want is your storage devices having errors in the firmware. Keep in mind, if you update the firmware, you do risk losing the data on the storage device, so I would make a backup first.
Notice that the capacity of the Solid-State Drive is listed. The free and used space is as expected, but there are also figures for unallocated and other. The documentation for the WD Dashboard does not say what this is, but it is reasonable to assume that it is either reserve blocks or blocks that did not pass testing when manufactured.
Below this is the temperature of the Solid-State Drive. If the Solid-State Drive is getting too hot, it will reduce its life expectancy. This is particularly important in large data centers where you have a lot of storage devices.
The more Solid-State Drives are written to, the quicker the blocks start wearing out. The life remaining will give you an idea of how much longer the Solid-State Drive may continue to operate. In this case, the drive has 87% life left. Although the Solid-State Drive can fail before this figure gets to zero, it gives you an idea of when you should consider replacing it. Some Solid-State Drives, when they reach the end of their life remaining, will stop working, while others will continue to operate. Keep this in mind, a failure may not stop the Solid-State Drive, it may be the firmware deciding to gracefully stop the drive from operating.
Next, I will select the performance tab. This will give you some information about the performance of the Solid-State Drive. There may also be a setting to optimize the drive to get better performance. You will notice there is an option for Trim.
Trim is a feature that allows the operating system to flag blocks on the Solid-State Drive that are no longer required. These usually occur because files have been removed. If this process is not performed, when the Solid-State Drive performs maintenance, it may move blocks around more often than they need to be. This increases the number of writes to the drive and thus decreases its lifespan.
Trim will increase the life of your Solid-State Drive; however, if it is run too often, it will have the opposite effect and reduce the life of the Solid-State Drive.
You will notice Trim is enabled and the tool is reporting that Trim has never been run. Windows can trigger Trim, or it can be controlled using this tool. Since Windows does this itself, I would not enable Trim in this tool. It does not really matter if Windows or this tool is managing when Trim is run, as long as it is run by one of them. However, don’t configure it in both places as this will cause Trim to run too often, reducing the life of your drive.
To set the Windows Trim settings, I will open the properties of the Solid-State Drive. This can be done by right clicking on the drive in Windows Explorer and selecting properties. From the properties, select the tab “Tools”. Under tools, press the button “Optimize”.
You will notice that Windows is reporting that Trim was run six days ago, so Trim is being run by Windows, and therefore we don’t need to enable it in this tool. I will now exit out of here, go back to the tool and select “Write-Cache Settings”.
This will allow the caching options to be set. Changing these options will improve the performance of the Solid-State Drive, but it increases the risk of data loss if there is a power outage.
There is an option to create a bootable USB drive. This will create a bootable USB drive with the firmware on it. Once created, the computer needs to be booted from the USB, and the firmware on the Solid-State Drive will be updated. The firmware can also be updated from Windows, but there is a little bit more risk that something will go wrong. If something does go wrong during a firmware update, there is a good chance the Solid-State Drive won’t be functional anymore.
There is an option “Update Firmware” which will update the firmware using the tool. When I press this button, a window will appear reminding me that firmware updates can fail and thus the data should be backed up first. I have backed up the data already. So, I will proceed with the firmware update.
In this case, the firmware update does not take too long to complete. During this process, make sure the computer has power, as if power is disconnected during the process the firmware update will most likely fail.
Once complete, I will press rescan and the tool will check for changes. This is everything I want to look at on this computer using this tool. I will now close it and reopen it on a different computer to compare a Solid-State Drive and a hard disk.
The Solid-State Drive that I will look at is the same Solid-State Drive that I looked at earlier in the S.M.A.R.T. demonstration. So, I will be able to compare using that information.
You will notice that the life remaining is reported as 100%. When I compare that with the data reported by S.M.A.R.T., you will notice this number is a large hexadecimal value. The tool has calculated the life remaining, however, it is not clear at this stage how this hexadecimal number translates to 100%.
I will next select the “Tools” tab. From the tools tab, I will select S.M.A.R.T. from the left-hand side. From here I can run a short or extended S.M.A.R.T. test. I will select the short S.M.A.R.T. test. Once I press the continue button the test will start. Even though it is a short test, it will take a few minutes to run, so I will pause the video and return shortly. Once the test is complete, you will notice that it reports that “No problems were detected”.
Different tools will have different tests that they can perform. The tests will give you an indication if your storage device is working correctly.
There is also the option for an extended test. This kind of test will perform a very thorough test of the device; however, it will take a long time to complete. Since this test takes a long time to complete, it is generally best to run the short test first. If the short test fails, there is no real point running the extended test.
At the bottom is the option to open the S.M.A.R.T. data. I will select that and compare it with what the other tool told me.
Notice that the value for “Media Wearout Indicator” is 0.65%. So, this gives us a clue that the hexadecimal value is representing a real value, that is a value with decimal places, rather than an integer with no decimal places. You can start to see how difficult it can be to interpret S.M.A.R.T. data.
I will next exit out of here and select the hard disk that is connected to this computer. Most of the interface and options are much the same. You will notice that life remaining has changed to “Drive Health”. With hard disks, writing data to the hard disk does not wear the hard disk out like writing data to a Solid-State Drive. Thus, you can’t really measure life remaining as you would with a Solid-State Drive. Instead, information is read from the hard disk to try and determine how healthy the drive is.
I will next select the tools tab and have a look at the options there. The first option is “Erase Drive”. There are two options, the quick and full. I will select the quick option and proceed with erasing the hard disk when asked.
The process does not take too long to complete. The fast erase is like a quick format. That is, it won’t erase all the data on the hard disk but will delete key structures to make it look like the data has been removed. The data on the hard disk can potentially be recovered using recovery software. Unlike quick format, quick erase will remove the partitions from the hard disks, and it will appear in Windows as it did when it left the factory.
Full overwrite will do the same as quick overwrite but will also erase every single sector on the hard disk making any data recovery impossible. You would normally do this if you were disposing of the hard disk. For example, you had finished using the hard disk and you wanted to donate it to another party, but wanted to make sure that there was no data on it. There is also third party software which will perform the same process. I would not be too concerned about using third party software to fully erase a hard disk, but with a Solid-State Drive I would be.
The reason is, there is a slight difference between how this option works for a hard disk and for a Solid-State Drive. To see the difference, I will select the Solid-State Drive.
Here, you will notice that the option has changed to “Sanitize”. Let’s have a close look at how a Solid-State Drive works in order to understand the difference between doing erasing and sanitizing.
Solid-State Translation Layer
Solid-State Drives store data in blocks. However, when a file is broken into blocks it may not be stored in order in these blocks. In order to record where the data is stored, a translation layer is used. This translation layer provides a mapping of where the blocks of data are stored.
To understand this process better, let’s consider that you want to store some important data on a Solid-State Drive. The file is broken up into blocks. To make things simpler for this example, it is assumed the operating system is going to store the data in contiguous chunks. Thus, for this example it will appear in the translation layer in order. If the drive is fragmented, the file may be spread out in multiple chunks, but for this example we will keep it simple.
If the data was stored on a hard disk, it would be stored in the same order. The only exception to this is if there is a bad sector, then the data would be reallocated to a reserve sector. Solid-State Drives change the way data is stored compared with hard disks, since the data could be stored anywhere on the drive.
Now that we are aware of this, let’s consider that we are storing the blocks. You will notice that blocks may not be stored in order, and the translation layer will store which block the data is stored in. Also, when maintenance like garbage collection occurs on the Solid-State Drive, these blocks could potentially be moved around, thus the translation layer is required to keep track of where these blocks are.
Now, let’s consider what occurs if an attempt to erase the data on the Solid-State Drive. As they wear out the more you write to them, if you can avoid writing to the drive then you can extend the life of the drive. So, some Solid-State Drives won’t erase the physical data. Instead, they will just update the translation layer, so for example, it may set the block number in the translation layer to -1 to indicate that the data is now being erased or simply set a flag that it is all zero.
You can see that the data is still on the Solid-State Drive and thus erasing the data may just update the translation layer and not actually erase the data. Not all Solid-State Drives do this. But you can see the advantage of doing it this way. Updating the translation layer, marking the block as zeros, is a lot faster than changing the data on the Solid-State Drive and also reduces wear on the drive.
Now that we understand how this works, let’s now consider how the sanitize function works. To do this, let’s consider that we have the same data that has been written to the Solid-State Drive. Sanitize is different in that it updates the translation layer and also erases the data on the Solid-State Drive.
Thus, if you are giving away a Solid-State Drive, I would recommend sanitizing the drive before you do so. This will make sure that any data on the Solid-State Drive is erased. Even doing a full format on a Solid-State Drive may not remove all the data on the drive as maintenance may move data around while the full format is being conducted. If the software does not have a sanitize option, the next best way to achieve this is to use software that writes random data to the Solid-State Drive to ensure all the blocks on the drive are overwritten.
I will next have a look at what you can do if your storage device develops errors.
Scan Disk/Check Disk (Chkdsk)
Your storage device may have developed some errors that can be corrected. Even with good hardware, at times errors can occur on the storage device. The most common cause of this is the computer losing power while it is operating, either by turning it off without giving it a chance to shut down or the power suddenly being cut.
If these errors occur within a file system on the storage device, Scan Disk and Check Disk may be able to fix these problems. To run Scan Disk, first open Windows Explorer. You next need to select the drive that you want to scan. If the drive contains the current running operating system, you will notice that there is a Windows icon next to the drive.
To run Scan Disk, right click on the drive and select properties. In properties, select the tools tab. Once in the tools tab, press the button “Check”.
In this case, there are no errors on the drive and as the drive is a Solid-State Drive the check will not take too long to complete. In some cases, some disk errors can’t be corrected while the file system is in use. In order to complete Scan Disk, you will need to close any applications that are using that file system. In the case of the operating system drive, it is not possible to stop using the drive since the operating system can’t simply be stopped to check it. When errors are required to be checked and they can’t be due to the disk being in use, the scan can be scheduled to be performed on the next restart of the computer.
Windows will automatically scan the drive when the computer is not shut down correctly or a drive error is detected. It may also scan the drive after a certain amount of time.
I will next have a look at running Check Disk from the command prompt. To do this, I will change to my command prompt.
The next step is to run the command chkdsk. After chkdsk, I will enter in the drive letter, in this case D drive. Without any more parameters, Windows will check the disk in read only mode; in this mode Windows will not correct any errors it finds. To correct these errors, I will add the switch /f. In this case, I will also add | more, so that the information does not scroll off the screen.
There is some information given by chkdsk, but in this case no errors have been detected or corrected. You will notice that there is also a message saying that no problems were found.
The last thing chkdsk will report on is some statistics about the file system. In the old days, one of the important statistics to look at was the bad sector count. With modern storage devices, when a bad sector is found, the storage device will stop using it and use a reserve sector to replace it. Windows, however, still has the ability to mark sectors as bad. Essentially, what occurs is the file system will mark the sector bad and stop using it. Windows does not have the ability to redirect the sector to a reserve sector, instead it will simply stop doing it.
For this reason, bad sectors won’t often be reported by Windows as the storage device will transparently reallocate sectors and Windows won’t know. Thus, I would not rely on Windows to detect bad sectors.
This hard disk is failing and has a number of bad sectors. To see how many, I will switch Crystal Disk Info to read the S.M.A.R.T. data. You will notice that the health status of this hard disk is “Caution”. When I cursor over “Caution”, notice a message saying that the reallocated sector count is 1920. I can confirm this by looking at the raw S.M.A.R.T. data.
This hard disk has 1920 bad sectors but Windows reports zero. This is not a fault of Windows, it just demonstrates how good hard disks are at detecting bad sectors and moving the data to reserve sectors. This presents some problems.
From a performance point of view, the reserve sectors are in the center of the platter and thus when a sector is reallocated, the hard disk head must move to this area to access it. Once it reads the reserve sector it needs to move the head back. These extra seeks slow down performance of the hard disk.
The second and bigger problem is, if the number of your reallocated sectors is going up, as in the case of this hard disk, there is something wrong. It is best to copy the data off the hard disk and replace it.
Let’s have a look at some indicators that you are having problems with your storage.
Stop Error (BSoD)
One of the indications you may have storage problems is if you have random stop errors. Errors like this can happen for a number of different reasons. It may not be because of your storage. When a stop error occurs, a stop code will be given. In this case the stop error code is “CRITICAL_PROCESS_DIED”.
When looking at stop errors, in some cases the stop code will help you work out what is causing the problem. In the case of storage device errors, the stop code won’t be storage related, it will be because data has become corrupted or could not be accessed and caused something else to crash. Thus, don’t expect stop errors to tell you directly that storage is causing the problem. However, if your computer is having a lot of stop errors, something is causing them and it could be storage related.
The next question is what do you do next to determine whether it is storage causing the problem?
Event Viewer
If you are getting stop errors, or other storage related problems such as poor performance, I would look at the Event Viewer to see if you can get more information about what is causing the problem. For this particular video, I will be looking at events that are related to storage problems. However, keep in mind that if you are getting stop errors that are not disk related, the event viewer may give you some information to help you determine what the stop error is.
To start with, I will open Event Viewer. Once Event Viewer is open, I will expand “Windows Logs” and then open “System”. System, as the names suggests, will show you events that are related to the running of the operating system.
For this hard disk, I have artificially caused it to have some problems. This way, I can cause it to create some events in the Event Viewer.
Storage device problems may be intermittent in nature. That is, problems will occur randomly and then disappear. It is difficult, if not impossible, to predict when these errors are going to occur. If you have a quick scroll through the Event Viewer, it is not uncommon for a lot of disk errors to be generated at once. For example, perhaps that hard disk is having trouble reading a certain part of the platter. When it attempts to read this part it generates a lot of errors, but when it is reading the rest of the hard disk it does not generate any errors. You can see why at times you may get a lot of errors and at other times get none.
I will now select one of the errors and have a look at what it can tell me. You will notice the message says the IO operation was retried. In Windows there is a storge device driver that is used with high-performance buses. When a request times out, this event may be generated. Essentially, this is saying a request was sent through and timed out in the device driver. This generally indicates some kind of hardware problem with the device. An error like this doesn’t indicate there is any data lost; however, it is concerning because it indicates there is a problem with Windows communicating with the storage device. Commands are essentially being sent to the storage device but are timing out.
I will scroll down some more and see if I can find some different Event IDs. Further down there is a different event. This error is of serious concern. Windows virtual memory swaps physical memory to and from the hard disks. Although slower than physical memory, it allows the amount of memory in the computer to be increased. Even if you have plenty of memory, Windows will still use a small amount of virtual memory.
This error is of serious concern because, basically it means, an error occurs when using virtual memory. If the process of transferring data to and from physical memory and the hard disk becomes corrupt, this can cause the computer to crash. You can see that, if memory gets corrupted, a stop error may occur that has nothing to do with the storage device.
The next error I will look at is an NTFS error not a disk error. This error is called, delayed write failed. Delayed write errors are serious errors. This means that there was a delay attempting to write data. The caches filled up and data lost before it could be written.
Delayed write errors can occur for a number of different reasons. They are often seen in networking connections where there is congestion on the network. Since there is congestion, data can’t get through, and it is lost.
The same can also happen on physical storage devices. The problem is often not congestion, the problem is the storage device is not able to write data from its buffers to the device quickly enough. For example, in the case of a hard disk, problems writing to the platter may result in the hard disk having to retry the write multiple times. These problems can also occur due to cabling issues, where the data is not able to get through the cable to the storage device fast enough. Regardless of the reasons, it means data was lost, which is of course a concern.
You could ask the question, why does Windows not keep trying? Consider the data is in a queue. If Windows can’t write the data, the other data in the queue is delayed. At some stage, Windows gives up trying to write the data and moves on to the other data in the queue. Otherwise, Windows would get stuck in a loop.
In this case, the delayed write error is trying to write to the MFT. NTFS records where files are stored using the Master File Table or MFT. This essentially is the index of where data is stored on the drive. Failing to write data to the MFT is a serious issue.
I will next find another NTFS error to have a look at.
This error is also of concern. Flushing data is the process of clearing the buffers out and writing them to the storage device. The transaction log on NTFS keeps a record of all updates to the file system. This way, if power is lost halfway through an update, Windows can look at the transaction log and roll back the changes preventing the hard disk becoming corrupted. Thus, you can see that errors like these are of a serious nature.
I will now have a look at another event. This event concerns the Advanced Host Controller Interface or AHCI. AHCI communicates with devices using connections like SATA. This event means a reset command has been sent to the storage device, resetting the interface on the storage device.
This can happen for a number of reasons. It is not really an issue to have one rarely, but if you start getting them regularly, it is worth looking into what is causing them. Resetting the device means the interface has reset which disrupts data transfer to the device. If this happens too often, it will affect the performance of the device. The question is, why is this occurring? This could be software or hardware problems and even connection problems. It potentially could even result in a loss of data.
This is all the events that I will look at. If you are having problems, have a look for disk or file system errors such as NTFS errors. If you are not sure what the event means, perform an internet search and this will tell you more about the event. Sometimes you will get these errors, but if you open the Event Viewer and you see a lot of them, something has seriously gone wrong.
Disappearing/Failed Storage Device
In some cases, your hard disk may disappear when you are using it or maybe it does not work at all. For example, when you boot your computer, you may get a message “No Boot Device Present”. At other times the computer may boot without a problem. This is a typical example of a disappearing storage device. When this occurs, there are a few things to check for.
One of the first things to check for is the data and power cabling. If your storage device is randomly getting disconnected this will prevent it being detected by the operating system. This can cause data loss and corruption. If the cables are connected correctly, it may be a problem with the cable and it needs to be replaced.
Once you have checked the cabling, check to see if the storage device is being detected in the BIOS or the computer’s setup. Unless your computer supports hot-swapping and it is enabled, the storage needs to be detectable when the computer is switched on. If you are using hot-swapping and the storage device is not being detected, I would shut down the computer and start it up again to make sure there is not a problem with hot-swapping.
In some cases, the storage device may be detected in the BIOS but not by Windows. When this occurs, update the device driver and also the chipset device drivers. This is most likely the cause. If this fails, you can also try updating the BIOS firmware and the storage device’s firmware.
If the storage device does not seem to be working, try it on another computer. You don’t want to throw a storage device out because of a failed connection on the motherboard. You can also try the storage device in a docking station. However, make sure it is a good docking station that is working. Lower-quality docking stations may sometimes have problems with some storage devices. Once again, you don’t want to throw a good storage device away because you thought it was broken. Often, before I throw a storage device out, I will test the connection I am using with a known working storage device to ensure the connection is working.
In the case of hard disks, you can check the hard disk to make sure the motor is spinning up. It is not a bad idea to test it with just the power connection. The hard disk, when connected to power without the data connection plug, should spin up. If it does not spin up, you know there is a problem with the hard disk.
Unfortunately, you won’t be able to test a Solid-State Drive like this. They don’t make any noise when plugged in and don’t have any lights on them to indicate they are working. The only way to tell if they are working is if the computer detects it.
If the storage device is damaged, it may shut down. If the storage device is doing this, it is recommended that when it is working, copy the data off as soon as possible. Once you have the data off the storage device, replace it.
Return Merchandise Authorization (RMA)
In some cases, replacing the storage device may be a simple matter of returning it from where you purchased it. In other cases, you may need to go through a procedure for returning the item referred to as RMA. .
Sometimes, before you return it, you first need to get what is called an RMA number. RMA stands for Return Merchandise Authorization. This is the first step in the process. Completing the process of getting the RMA number now gives you the authority to start the process proper. Depending on where you purchased the item, they may not accept it until you create the RMA number. In other cases, this may create the RMA number and you are given it when you return the item. The point is, this RMA number is what tracks that item throughout the whole process.
Once the RMA process has started, the item may be repaired, replaced or a credit given. You may be wondering why there is such a process in the first place. Let’s consider that you have a very large company that has a lot of IT devices. The supplier of these IT devices and the contracts they have with large companies are very important as they make a lot of money from them. Thus, you can understand that sometimes when they report something broken, in order to provide good service, they may just replace the item. The company may need this replacement quickly, since downtime to them is important and costly.
Later on, if it turns out the item was not broken, it can be returned to the company or used elsewhere. When you have a lot of IT devices, a few devices here or there that were returned but then found to be working is not a big deal within the overall size of the contract. They will just issue credit or debits later on to balance the books, so to speak, or perhaps the contract covers a couple of these, so there is nothing to be concerned about. At the end of the day, large contracts are all about keeping the customer happy.
If you find that your storage device is broken, consider how long the storage device warranty is. The storage device warranty can vary and can be as much as five years in some situations. This warranty will be referred to as a limited warranty. This means it will be limited as to what cover can be provided. For example, the warranty may cover a replacement hard disk but not recovery of the data on that hard disk.
Depending on the warranty, within the first year or two, you may be able to return it directly to the shop. After this, you may need to return it to the manufacturer. If this is the case, you will often need to pay for shipping. So, it is often worthwhile sending a batch of broken devices off at once if you are able to. If you don’t have enough, talk with some friends to see if they have any they want to include. The manufacturer should replace the hard disk with the same device, but sometimes if they run out of that model, they will replace it with a newer model, potentially with more storage, but don’t expect too much.
In some cases, the RMA may be called Return Authorization, Return Goods Authorization or Return to Vendor. They are all essentially talking about the same thing, so don’t worry too much about what it is being called.
In some cases, the storage device fails and you may not be able to get the data off before it does so. In fact, you should expect that the storage device will fail at some time because they don’t last forever.
Backup
If you value your data, you should back it up. The OS and data can be very big, so if you can’t back up all of it, have a look at what you can’t afford to lose. USB sticks hold a lot of data nowadays. More than enough to hold your irreplaceable photos or your important documents. To protect yourself from your house being burnt down, you can always leave the USB stick at a friend’s house or somewhere else.
The Operating System should come with a backup facility. These backups are generally not too bad, but if your OS does not have one or you want a better one, there are plenty of third party solutions.
For business, there are automatic storage libraries available. These libraries are either manually loaded into a tape drive or are automatic. In the old days, it was more common to manually load tapes to perform the backup. As time went by, storage became larger and larger so this became less common. Thus, automatic loading of tape drives has become commonplace. However, storage is getting so large nowadays, it is hard for even automatic libraries to keep up.
In some cases, cloud-based backup is an option. Cloud-based backup has the advantage that it is stored off site, meaning that if your site burns down, you have an off-site copy. I personally like cloud backups, however, keep in mind that they are only as good as the data going into them. If a virus gets into your computer and corrupts your data, this replicates to the cloud storage. If you are not aware this has occurred, you may lose your data. For example, if a virus corrupts your important family photos, and the cloud storage only keeps backups for 30 days. If you don’t notice the issue within 30 days, your data will be lost. Thus, even with cloud backups, I still like to, every so often, make a copy of my most important data, so it does not get lost.
The advantage with cloud storage and backups is that it is expandable. You need more space, you simply need to pay for it. Cloud storage is also getting cheaper and therefore becoming a viable option. So, something to consider.
Data Recovery
If you don’t have a backup and you can’t access your data, the last thing that you can try is data recovery. There are a lot of data recovery places around, so it should not be too hard to find one. If you do decide to use one, firstly stop using the storage device. Any additional use risks damaging the device making data recovery harder.
Even if the storage device no longer works, it may still be possible to have data recovered. This includes fire damage. In the case of hard disks, if the platters are damaged, it may still be possible to recover data depending on what kind of damage there is.
Many data recovery procedures involve opening the device and removing components. For example, removing platters or chips off the board. Many of these procedures are a one-shot process. In other words, once you start there is no going back. Thus, don’t try it yourself if you are considering data recovery. Data recovery is a highly specialized area, so don’t be surprised if you pull a hard disk apart and try it yourself, when you take it to data recovery, they won’t even look at the hard disk.
Data recovery is very expensive. A lot of their work is recovering data from RAIDs, as companies have the money to pay for the service, but if you just have a hard disk with the family photos on it, they will attempt a recovery for a price.
Some data recovery companies will offer a free assessment. They should be able to, in a day or two, let you know if the data can be recovered. Some places will also only charge if they recover the data. If you are considering data recovery, it may be worth your time to have a look around and see what each place offers.
Backup Your Data
In this video I have had a look at a lot of things that can go wrong and what you can do about it. The most important point you should take from this video is, to backup your data. Storage devices fail over time and the last thing you want is your data to magically disappear, so remember, backup your data!
End Screen
That concludes this video from ITFreeTraining on storage device failures. I hope you found this video informative. After you backup your data, I hope to see you in the next video, but until then I would like to thank you for watching.
References
“The Official CompTIA A+ Core Study Guide (Exam 220-1001)” Section 6 Paragraph 298-303
“CompTIA A+ Certification exam guide. Tenth edition” Pages 382-383
“What are the different types of hard drive noises?” https://www.stellarinfo.co.in/kb/how-to-identify-hard-drive-sounds.php
“S.M.A.R.T.” https://en.wikipedia.org/wiki/S.M.A.R.T.
“Picture: Hat rack” https://pixabay.com/vectors/hat-rack-cap-hat-pegs-149247/
“Picture: Café truck” https://unsplash.com/photos/o58cEDAnPB8
“Picture: No Rules” https://pixabay.com/illustrations/rubber-stamp-banner-business-button-895380/
“Picture: Standard light” https://unsplash.com/photos/fJSRg-r7LuI
“Picture: Ballon Pig” https://unsplash.com/photos/-3Aqpdt-wQ4
“Picture: Pestel grinding” https://www.pexels.com/photo/gray-mortar-and-pestle-on-white-and-blue-textile-4871291/
“Picture: Silence dog” https://www.pexels.com/photo/white-and-black-french-bulldog-4587991/
“Picture: British Shorthair cat” https://www.pexels.com/photo/photo-of-british-shorthair-cat-sitting-on-grass-field-1521306/
“Picture: Gears” https://pixabay.com/photos/gears-metal-rust-technology-1666499/
“Picture: Spinning” https://unsplash.com/photos/G8xBY-oiveo
“Picture: Broken chalk” https://www.pexels.com/photo/blue-red-and-yellow-chalk-1107495/
“Picture: Clock” https://www.pexels.com/photo/grayscale-photo-of-analog-clock-at-10-00-5223366/
“Picture: Sound mixer” https://unsplash.com/photos/OVEWbIgffDk
“Picture: Math stairs” https://unsplash.com/photos/4wF66_KWJxA
“Picture: Window pattern” https://unsplash.com/photos/7iLmkodYBQQ
“Picture: Kitten” https://pixabay.com/photos/cat-kitten-british-pet-kitty-4912213/
“Picture: Grim Reaper” https://pixabay.com/photos/reaper-death-halloween-spooky-3668855/
“Picture: Magic Hat” https://www.pexels.com/photo/happy-little-asian-boy-in-magician-costume-doing-trick-5560049/
“Picture: Siamese cat” https://pixabay.com/photos/cat-siamese-cat-fur-kitten-2068462/
“Picture: Cat Paw” https://pixabay.com/photos/cat-mackerel-paw-claw-puss-kitten-2596496/
“Picture: Cat in box” https://unsplash.com/photos/BsXeYX3efOI
“Picture: Box” https://pixabay.com/photos/parcel-parcels-packages-delivery-2484036/
“Picture: Different colors eyes” https://pixabay.com/photos/cat-cute-white-cat-persian-cat-1092371/
“Picture: Warranty Sign” https://www.picserver.org/highway-signs2/images/warranty.jpg
“Picture: Backup storage devices” https://pixabay.com/photos/usb-memory-card-cd-data-flash-932180/
“Picture: Backup Tape room” https://en.wikipedia.org/wiki/Tape_library#/media/File:NDOC_magnetic_tape_library.jpg
“Picture: Tape library” https://en.wikipedia.org/wiki/Tape_library#/media/File:StorageTek_Powderhorn_tape_library.jpg
“Picture: Server room” https://unsplash.com/photos/pgdaAwf6IJg
“Picture: Different storage” https://pixabay.com/photos/data-disk-hard-drive-hdd-3181149/
“Picture: Light background” https://unsplash.com/photos/LqKhnDzSF-8
“Picture: Open Hard disk” https://pixabay.com/photos/computer-science-hard-disk-2894340/
“Picture: Stop sign” https://pixabay.com/photos/stop-sign-traffic-sign-road-sign-634941/
“Picture: Toy Rocket” https://pixabay.com/photos/rocket-toy-playmobil-spaceship-630461/
“Picture: Flame lite” https://pixabay.com/photos/match-flame-smoke-matchstick-359970/
“Picture: Wallet” https://pixabay.com/photos/credit-squeeze-taxation-purse-tax-522549/
“Picture: Rember” https://pixabay.com/photos/remember-memory-remembrance-1750119/
“Picture: Hard disks” https://pixabay.com/photos/storage-storage-medium-hard-drive-870713/
Credits
Trainer: Austin Mason http://ITFreeTraining.com
Voice Talent: HP Lewis http://hplewis.com
Quality Assurance: Brett Batson http://www.pbb-proofreading.uk