Crash Prevention Hard Disks Don’T Always Die As Suddenly As You Might Think.The Right Tools Can Help You Detect Hard Disk Issues Before They Become Critical
Total Page:16
File Type:pdf, Size:1020Kb
SYSADMIN smartmontools Monitoring Hard Disks with smartmontools Crash Prevention Hard disks don’t always die as suddenly as you might think.The right tools can help you detect hard disk issues before they become critical. BY GABRIELE POHL a system that all modern ATA and SCSI the IDE bus as the primary master; it is hard disks, as well as SCSI tape drives, (or should be) accessible as /dev/hda. should have. Besides logging measured These commands all require root privi- values and errors, SMART has device leges, since non-privileged users do not testing features. Of course it is a good have access to device files. think to know about an impending disas- Listing 1 tells you that the drive is a ard disks have rotating moving ter well in advance. And this is exactly 34097H4 model by Maxtor; its serial parts – disks that rotate at 5400 what the Smartmontools package [1] number is L4101EJC; it has version Hor 7200 or more revolutions per does for you. It accesses the SMART fea- YAH814Y0 of the factory software; and it minute – and the heads are subject to ture provided by your hard disks and complies to version 6 of the ATA/ATAPI extreme acceleration and deceleration. runs a daemon called smartd to provide standard. The serial number can be very Because they have moving parts, hard automated controls. important when it comes to claiming a disks are subject to wear and tear. Manu- The package is available for the cur- replacement drive within the warranty facturers typically estimate a mean rent versions of the Linux, FreeBSD, period. failure time for their products – a purely NetBSD, Solaris, Darwin, and even SMART support is enabled, as you can statistical value, and they provide no Microsoft Windows. You can download see in the last line of the report, which guarantee that the disk will not die on the Linux version as a binary RPM or reflects the BIOS setting. If the last line you within the first month. And Mur- source code version from [1], and an tells you that SMART support is dis- phy’s law says that your hard disk is installation guide is available from the abled, you need to enter smartctl -s on to more likely to die when you are not same address. Smartmontools supports enable the SMART support feature. expecting it to do so, that is, you will versions 3 through 7 of the ATA/ATAPI probably lose it when you do not have a standard. Its predecessor was a software And How Are We Feeling recent backup or when spare parts are package called Smartsuite. Development Today? hard to come by. of Smartsuite was discontinued back in Setting the smartctl -H flag queries the If you are lucky, your computer might September 2001. health status of a device. As you can see output a message like the following Listing 1 shows you an interactive in Listing 2, this is FAILED for our hard when you boot it: approach to displaying basic information disk, and that explains the warning dis- for a device using smartctl. The failed played on booting the machine. The hard SMART Failure Predicted onU hard disk in our example is attached to disk’s embedded SMART logic has Primary Master: Maxtor 34098H4 Warning! Immediately back-up U Listing 1: Device Information U your data and replace your hard # <B>smartctl -i /dev/hda<B> U disk drive. A Failure may smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen be imminent. === START OF INFORMATION SECTION === Device Model: Maxtor 34098H4 The message is clear. Your hard disk Serial Number: L4101EJC looks likely to fail, and you should back Firmware Version: YAH814Y0 up your data as soon as possible. This is Device is: In smartctl database [for details use: -P show] a friendly warning from your computer ATA Version is: 6 BIOS’s SMART error detection feature. ATA Standard is: ATA/ATAPI-6 T13 1410D revision 0 SMART Helper Local Time is: Tue Aug 10 12:17:11 2004 CEST SMART support is: Available - device has SMART capability. SMART is the acronym for Self-Monitor- SMART support is: Enabled ing, Analysis and Reporting Technology, 62 December 2004 www.linux-magazine.com smartmontools SYSADMIN already reallocated 637 TYPE column in Listing 3, for example. defective sectors on the Old_age attributes are characteristic of a disk. This is an auto- normal aging process. Attributes with a matic feature that Pre-fail tag are more serious, as they remaps damaged sec- indicate an imminent failure! In this tors to a small number case, you should look to replace the disk of reserved sectors. in the near future. Unfortunately, the sup- ply of reserved sectors Online and Offline Tests is now exhausted. If the SMART-enabled hard disks can collect hard disk had checked values for specified attributes and store out okay, the result them without affecting performance. would have been There are a few examples of this in List- PASSED. ing 3; the attributes with the Always The ATA standard Figure 1: SMART capability report for a disk. label in the UPDATED column. The defines a whole bunch smartmontools manpage refers to this of attributes that describe a device’s threshold value indicates an error condi- approach as an online test. technical properties [3], and manufac- tion for the attribute in question. In our You can run smartctl with the -c flag to turers can add proprietary extensions. example (Listing 2), the threshold value discover which SMART methods your There may be some differences in the for the Reallocated_Sector_Ct attribute is device supports and how much time the way the data are stored; attribute #9, 63. The current value of 1 is way below tests take (see Figure 1). The same com- which specifies a device’s operating that value, and this is what causes the mand indicates the test progress in the hours, in particular. To avoid complica- FAILING_NOW message in the Self-test execution status section. tions smartmontools has a database with WHEN_FAILED column. The so-called offline test can affect the attributes for a number of models. If The attribute type is also important to disk performance, although you might your drive model is not in the latest ver- evaluating the health state of a device. not notice this in real-life conditions, as sion of the smartmontools database, you The attribute type is displayed in the the test is interrupted if the operating can check the smartmontools-database [2] mailing list for an update. The smart- Listing 2: Querying the health status montools homepage tells you how to do # <B>smartctl -H /dev/hda<B> so. smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen The -A option tells smartctl to output === START OF READ SMART DATA SECTION === the values of a device’s attributes (see SMART overall-health self-assessment test result: FAILED! Listing 3). You can then compare the lat- Drive failure expected in less than 24 hours. SAVE ALL DATA. est values with the manufacturer’s Failed Attributes: threshold values. The VALUE column ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE has the current value, WORST has the UPDATED WHEN_FAILED RAW_VALUE worst measured value so far, and 5 Reallocated_Sector_Ct 0x0033 001 001 063 Pre-fail Always THRESH the factory threshold value. A FAILING_NOW 637 current value equal to, or below, the Listing 3: Querying Attributes # smartctl -A /dev/hda smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 41 3 Spin_Up_Time 0x0027 222 222 063 Pre-fail Always - 4458 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 35 5 Reallocated_Sector_Ct 0x0033 001 001 063 Pre-fail Always FAILING_NOW 637 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 252 246 187 Pre-fail Always - 38203 9 Power_On_Minutes 0x0032 253 253 000 Old_age Always - 16h+46m [...] www.linux-magazine.com December 2004 63 SYSADMIN smartmontools system needs regular access to the disk. ical address of the first defective block in smartd can be configured via entries in To launch the offline test, type smartctl -t the LBA column. As you can see in line /etc/smartd.conf. If this file does not offline <device>. #2 of Listing 4, this test was aborted by exist, or if it contains an entry for To start periodical offline testing – usu- my typing smartctl -X /dev/hdb. DEVICESCAN, smartd will attempt to ally every four hours – you specify the -o Besides the self-test logfile, there is detect the ATA and SCSI devices in your on option, assuming that the manufac- also an error log that gives you details on system when launched, that is, it will turer supports this feature. You can check the last five errors that have occurred. look for /dev/hda through /dev/hdt and your device’s SMART capabilities to find The smartctl -l error option displays the /dev/sda through /dev/sdz. It then out (see Figure 1). The Offline data collec- error log. enables SMART for any devices it finds tion capabilities should have an entry for and launches into the monitoring Auto Offline data collection on/off sup- Automated Monitoring with process. port. No matter whether you run the smartd offline test explicitly or automatically, it All of the smartctl commands we have Fine Tuning with smartd.conf will update the attributes labeled as looked at thus far need to be launched Entries below the DEVICESCAN keyword offline in the UPDATED column (Listing explicitly by the admin user, and that specify the kind of monitoring to be per- 3).