SYSADMIN smartmontools

Monitoring Hard Disks with smartmontools Crash Prevention Hard disks don’t always die as suddenly as you might think.The right tools can help you detect hard disk issues before they become critical.

BY GABRIELE POHL

a system that all modern ATA and SCSI the IDE bus as the primary master; it is hard disks, as well as SCSI tape drives, (or should be) accessible as /dev/hda. should have. Besides logging measured These commands all require root privi- values and errors, SMART has device leges, since non-privileged users do not testing features. Of course it is a good have access to device files. think to know about an impending disas- Listing 1 tells you that the drive is a ard disks have rotating moving ter well in advance. And this is exactly 34097H4 model by Maxtor; its serial parts – disks that rotate at 5400 what the Smartmontools package [1] number is L4101EJC; it has version Hor 7200 or more revolutions per does for you. It accesses the SMART fea- YAH814Y0 of the factory software; and it minute – and the heads are subject to ture provided by your hard disks and complies to version 6 of the ATA/ATAPI extreme acceleration and deceleration. runs a daemon called smartd to provide standard. The serial number can be very Because they have moving parts, hard automated controls. important when it comes to claiming a disks are subject to wear and tear. Manu- The package is available for the cur- replacement drive within the warranty facturers typically estimate a mean rent versions of the , FreeBSD, period. failure time for their products – a purely NetBSD, Solaris, Darwin, and even SMART support is enabled, as you can statistical value, and they provide no . You can download see in the last line of the report, which guarantee that the disk will not die on the Linux version as a binary RPM or reflects the BIOS setting. If the last line you within the first month. And Mur- source code version from [1], and an tells you that SMART support is dis- phy’s law says that your hard disk is installation guide is available from the abled, you need to enter smartctl -s on to more likely to die when you are not same address. Smartmontools supports enable the SMART support feature. expecting it to do so, that is, you will versions 3 through 7 of the ATA/ATAPI probably lose it when you do not have a standard. Its predecessor was a software And How Are We Feeling recent backup or when spare parts are package called Smartsuite. Development Today? hard to come by. of Smartsuite was discontinued back in Setting the smartctl -H flag queries the If you are lucky, your might September 2001. health status of a device. As you can see output a message like the following Listing 1 shows you an interactive in Listing 2, this is FAILED for our hard when you boot it: approach to displaying basic information disk, and that explains the warning dis- for a device using smartctl. The failed played on booting the machine. The hard SMART Failure Predicted onU hard disk in our example is attached to disk’s embedded SMART logic has Primary Master: Maxtor 34098H4 Warning! Immediately back-up U Listing 1: Device Information U your data and replace your hard # smartctl -i /dev/hda U disk drive. A Failure may smartctl version 5.32 Copyright () 2002-4 Bruce Allen be imminent. === START OF INFORMATION SECTION === Device Model: Maxtor 34098H4 The message is clear. Your hard disk Serial Number: L4101EJC looks likely to fail, and you should back Firmware Version: YAH814Y0 up your data as soon as possible. This is Device is: In smartctl database [for details use: -P show] a friendly warning from your computer ATA Version is: 6 BIOS’s SMART error detection feature. ATA Standard is: ATA/ATAPI-6 T13 1410D revision 0 SMART Helper Local Time is: Tue Aug 10 12:17:11 2004 CEST SMART support is: Available - device has SMART capability. SMART is the acronym for Self-Monitor- SMART support is: Enabled ing, Analysis and Reporting Technology,

62 December 2004 www.linux-magazine.com smartmontools SYSADMIN

already reallocated 637 TYPE column in Listing 3, for example. defective sectors on the Old_age attributes are characteristic of a disk. This is an auto- normal aging process. Attributes with a matic feature that Pre-fail tag are more serious, as they remaps damaged sec- indicate an imminent failure! In this tors to a small number case, you should look to replace the disk of reserved sectors. in the near future. Unfortunately, the sup- ply of reserved sectors Online and Offline Tests is now exhausted. If the SMART-enabled hard disks can collect hard disk had checked values for specified attributes and store out okay, the result them without affecting performance. would have been There are a few examples of this in List- PASSED. ing 3; the attributes with the Always The ATA standard Figure 1: SMART capability report for a disk. label in the UPDATED column. The defines a whole bunch smartmontools manpage refers to this of attributes that describe a device’s threshold value indicates an error condi- approach as an online test. technical properties [3], and manufac- tion for the attribute in question. In our You can run smartctl with the -c flag to turers can add proprietary extensions. example (Listing 2), the threshold value discover which SMART methods your There may be some differences in the for the Reallocated_Sector_Ct attribute is device supports and how much time the way the data are stored; attribute #9, 63. The current value of 1 is way below tests take (see Figure 1). The same com- which specifies a device’s operating that value, and this is what causes the mand indicates the test progress in the hours, in particular. To avoid complica- FAILING_NOW message in the Self-test execution status section. tions smartmontools has a database with WHEN_FAILED column. The so-called offline test can affect the attributes for a number of models. If The attribute type is also important to disk performance, although you might your drive model is not in the latest ver- evaluating the health state of a device. not notice this in real-life conditions, as sion of the smartmontools database, you The attribute type is displayed in the the test is interrupted if the operating can check the smartmontools-database [2] mailing list for an update. The smart- Listing 2: Querying the health status montools homepage tells you how to do # smartctl -H /dev/hda so. smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen The -A option tells smartctl to output === START OF READ SMART DATA SECTION === the values of a device’s attributes (see SMART overall-health self-assessment test result: FAILED! Listing 3). You can then compare the lat- Drive failure expected in less than 24 hours. SAVE ALL DATA. est values with the manufacturer’s Failed Attributes: threshold values. The VALUE column ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE has the current value, WORST has the UPDATED WHEN_FAILED RAW_VALUE worst measured value so far, and 5 Reallocated_Sector_Ct 0x0033 001 001 063 Pre-fail Always THRESH the factory threshold value. A FAILING_NOW 637 current value equal to, or below, the

Listing 3: Querying Attributes # smartctl -A /dev/hda smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 41 3 Spin_Up_Time 0x0027 222 222 063 Pre-fail Always - 4458 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 35 5 Reallocated_Sector_Ct 0x0033 001 001 063 Pre-fail Always FAILING_NOW 637 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 252 246 187 Pre-fail Always - 38203 9 Power_On_Minutes 0x0032 253 253 000 Old_age Always - 16h+46m [...]

www.linux-magazine.com December 2004 63 SYSADMIN smartmontools

system needs regular access to the disk. ical address of the first defective block in smartd can be configured via entries in To launch the offline test, type smartctl -t the LBA column. As you can see in line /etc/smartd.conf. If this file does not offline . #2 of Listing 4, this test was aborted by exist, or if it contains an entry for To start periodical offline testing – usu- my typing smartctl -X /dev/hdb. DEVICESCAN, smartd will attempt to ally every four hours – you specify the -o Besides the self-test logfile, there is detect the ATA and SCSI devices in your on option, assuming that the manufac- also an error log that gives you details on system when launched, that is, it will turer supports this feature. You can check the last five errors that have occurred. look for /dev/hda through /dev/hdt and your device’s SMART capabilities to find The smartctl -l error option displays the /dev/sda through /dev/sdz. It then out (see Figure 1). The Offline data collec- error log. enables SMART for any devices it finds tion capabilities should have an entry for and launches into the monitoring Auto Offline data collection on/off sup- Automated Monitoring with process. port. No matter whether you run the smartd offline test explicitly or automatically, it All of the smartctl commands we have Fine Tuning with smartd.conf will update the attributes labeled as looked at thus far need to be launched Entries below the DEVICESCAN keyword offline in the UPDATED column (Listing explicitly by the admin user, and that specify the kind of monitoring to be per- 3). makes them unsuitable for long-term formed for each device. If you do not add While online and offline tests only col- monitoring tasks. Enter smartd, a dae- specific entries, the defaults apply for lect data, the self-diagnosis test checks mon that automatically queries SMART each device type (ata, , 3ware.) Run the electrical and mechanical character- attributes and performs self-tests as man smartd.conf for details of the istics of the device and the read specified by the admin. smartd logs sta- defaults. throughput. You can run the self-diagno- tus and error messages in the syslog and It makes sense not to monitor attrib- sis feature while working on a machine; optionally sends email warnings if criti- utes that are subject to frequent changes, again, the test is interrupted if the oper- cal errors are found. such as the temperature (attributes #194 ating system needs regular access to the It makes sense to launch smartd when and #231), or the operating hours device. If you prefer to disable the device you boot your machine and stop the dae- (attribute #9); this ensures that your sys- for the duration of the test, you can spec- mon when you shutdown. To support log stays readable. To disable these ify the -C option to do so. This so-called this, smartmontools has the smartd shell attributes, add a -I 194 -I 231 -I 9 entry to captive mode should only be used for script besides the /usr/sbin/smartctl and the smartd.conf file. For an email alert media that are not mounted. /usr/sbin/smartd executables. The script type -m followed by your email address. A short self-diagnosis (-t short) takes a is typically stored below /etc/rc.d/init.d. few minutes; a detailed self-diagnosis You can run chkconfig --add smartd to Mitigating the Danger of can take an hour or more depending on set links at the required runlevels. Suse Failures the size of the disk you are testing. To Linux users may need to move the Of course you cannot prevent system launch the extended test, launch smartctl smartd shell script from /etc/rc.d/init.d failures simply by using smartd, as a with the -t long option. to /etc/init.d before doing so! disk can fail without any warning. But The device stores the test results in a To launch smartd with specific even then, hard disk monitoring can still self-test logfile. You can tell smartctl to options, you need to modify the be useful. It is not always easy to detect display the logfile (see Listing 4) by spec- smartd_opts variable in the script to a failure. In a worst case scenario, the ifying the -l selftest option. Details of reflect your requirements. For example, failure might go unnoticed until your recent tests are displayed at the top of the --interval N option specifies the inter- next regular hard disk check, and that the list. The test automatically stops if it val for launching smartd. The default is would mean even more loss – all the detects an error. The Remaining column 1800 seconds, that is 30 minutes. You more reason to use smartmontools. ■ then tells you how much of the test was can run smartd --help to display your not processed as a percentage. If the test options and the manpage (man smartd) INFO detects a hard disk error, it stores the log- has more details. [1] smartmontools homepage: http://smartmontools.sourceforge.net/ Listing 4: Querying the Self-Test Logfile [2] smartmontools-database mailing list homepage:https://lists.sourceforge.net/ # smartctl -l selftest /dev/hdb lists/listinfo/smartmontools-database smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen [3] Guide to SMART attributes: http://freepgs.com/smart/attributes.php === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Gabriele Pohl is an Oracle DBA, Linux Num Test_Description Status Remaining LifeTime LBA_of_

(hours) first_error HOR administrator,IT coach, and consul- T #1Extended offline Completed without error 00% 453 - tant (http://www.dipohl.com/). #2Extended offline Aborted by host 90% 451 - Thanks to Steffen Grunewald and #3Short offline Completed without error 00% 451 - Bruce Allen for their support! THE AU

64 December 2004 www.linux-magazine.com