April 12

How to determine if a SATA drive is failing.

When is it a good time to check to see if a hard drive is failing? Well, when your console is full of IO/seek errors, I’d say that is a pretty good time! Hah.

According to research conducted by Google, a document entitled Failure Trends in a Large Disk Drive Population states that the manufacturer, particular model and vintage plays a role, but does not provide failure statistics on model and manufacturers. Most drives were run at 45C or less.

From the SMART data, scan errors, reallocations, offline reallocations and probational counts had a significant correlation with failure probability, whereas seek errors, calibration retries and spin retries had little significance.

Soooo…. you want to look at the Raw_Read_Error_Rate, Seek_Error_Rate and Reallocated_Sector_Ct information from smartctl.

[root@SOMESERVER ~]# smartctl --all /dev/sdb | grep Error
Error logging capability:        (0x01)	Error logging supported.
  1 Raw_Read_Error_Rate     0x000f   117   100   006    Pre-fail  Always       -       166491825
  7 Seek_Error_Rate         0x000f   090   060   030    Pre-fail  Always       -       999290467
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
SMART Error Log Version: 1
[root@SOMESERVER ~]# smartctl --all /dev/sdb| grep Reallocated_Sector_Ct5
Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

In regards to Reallocated_Sector_Ct, the normalized values (current=100, worst=100) indicate the drive is in  perfect condition (higher is better, and looking at the overall report it appears that 100 is “best”). The threshold value (36) just indicates how low the normalized value would have to drop before the manufacturer would consider the drive to be in a “Pre-fail” condition.

If you run “smartctl –all /dev/sdb | grep Error” again and notice that Raw_Read_Error_Rate and Seek_Error_Rate keep incrementing AND Reallocated_Sector_Ct is greater than 0, its pretty safe to say that you have a ticking time-bomb on your hands. You should consider replacing those drives as soon as possible.

January 28

Environment Monitoring

Setup an APC AP9319 Environment monitoring unit today. It has temperature and humidity probes to do basic environmental monitoring of the are surrounding a rack. I’ve got it rigged up to a light beacon to flash an orange strobe and send an email if the alert criteria is met. See AP9319

According to APC, the AP9319 has since been replaced by their uber expensive NetBotz Rack Monitor 200. It seems like it offers more than you would actually need for individual zone monitoring.