Windows Hardware Error Architecture Predictive Failure Analysis - 2

Total Page:16

File Type:pdf, Size:1020Kb

Windows Hardware Error Architecture Predictive Failure Analysis - 2

Windows Hardware Error Architecture Predictive Failure Analysis

May 5, 2010

Abstract This paper provides information about the predictive failure analysis (PFA) that Windows Hardware Error Architecture (WHEA) performs for the Windows® family of operating systems. It provides guidelines for system administrators to understand and manage this feature. This information applies to the following operating systems: Windows Server® 2008 R2 Windows 7 References and resources discussed here are listed at the end of this paper. The current version of this paper is maintained on the Web at: http://www.microsoft.com/whdc/system/pnppwr/whea/WHEA_PFA.mspx

Disclaimer: This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.

This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

© 2010 Microsoft Corporation. All rights reserved. Document History Date Change May 5, 2010 First publication

Contents Windows Hardware Error Architecture Predictive Failure Analysis - 3

Introduction To comply with the Windows® Logo Program requirements for Windows Server® 2008 R2, server hardware must be equipped with error-correction code (ECC) memory, which can automatically correct certain types of memory errors. In systems that have ECC memory, information about the corrected errors and their frequency is available to the operating system and can be used to predict impending failures, including uncorrectable failures that can be catastrophic and lead to unplanned downtime. The Windows Hardware Error Architecture (WHEA) implementation for Windows Server 2008 R2 includes a new predictive failure analysis (PFA) feature that uses this information to predict and manage memory errors. This paper describes the implementation of the PFA feature, the settings that control the behavior of the PFA feature, and how a system administrator can manage the PFA policy and the list of bad memory pages. For more information about WHEA, see “WHEA” on the MSDN® website. See also the “WHEA Platform Design Guide,” which is available to Microsoft partners under NDA.

WHEA Predictive Failure Analysis A system that is equipped with ECC memory notifies Windows through the WHEA infrastructure that a corrected memory error occurred. The PFA feature of WHEA maintains a list of the physical memory pages in which each corrected ECC memory error occurred. If the number of corrected errors for a particular physical memory page exceeds a specified threshold within a specified time interval, the WHEA PFA feature predicts that the memory page is likely to experience more errors, some of which could be uncorrected errors. Note: A PFA for memory can be performed by the platform instead of by the WHEA PFA feature, as described in the next section of this paper. When a memory page is predicted to fail, WHEA requests the Windows memory manager to take the memory page offline and discontinue its use. Preventing further use of a memory page eliminates the risk of a more catastrophic failure if an uncorrectable error occurs in that memory page. Depending on the current status and use of the memory page when WHEA requests to take a particular memory page offline, the memory manager might or might not be able to take the memory page offline immediately. WHEA generates an Event Tracing for Windows (ETW) event that indicates whether the memory page was immediately taken offline. If the memory manager cannot take a memory page offline immediately, it takes that page offline at the first possible opportunity. In addition, WHEA stores the list of physical memory pages that are predicted to fail in the Boot Configuration Data (BCD) system store. This list includes all memory pages that were predicted to fail, regardless of whether the memory manager successfully took the pages offline. Windows will not use any memory pages in this list during any subsequent session of the operating system until the list is cleared from the BCD.

May 5, 2010 © 2010 Microsoft Corporation. All rights reserved. For more specific information about how WHEA performs a PFA, see “PFA Performed by WHEA” and “How WHEA Performs PFA on ECC Memory” on the MSDN website.

Platform Predictive Failure Analysis System vendors can provide platform-specific hardware error driver (PSHED) plug-ins with their hardware to participate in the error handling flow. A PSHED plug-in can be used to perform a PFA for memory instead of using the WHEA PFA feature. In this situation, the PSHED plug-in monitors all memory errors and determines when a particular memory page should be taken offline. The PHSED plug-in indicates to WHEA when a memory page should be taken offline, and WHEA in turn makes a request to the memory manager to take the memory page offline, as described in earlier in this paper. Note: The settings that control the WHEA PFA policy do not apply when a PFA is implemented in a PSHED plug-in. In this situation, system vendors must provide their own mechanisms to specify the settings that control the PFA policy. For more information about how a PSHED plug-in performs a PFA, see “PFA Performed by a PSHED Plug-In” on the MSDN website. For more information about how to implement a PSHED plug-in, see “Platform- Specific Hardware Error Driver Plug-Ins” on the MSDN website. See also the “WHEA Platform Design Guide,” which is available to Microsoft partners under NDA.

Predictive Failure Analysis Administration The following sections discuss how to administer the WHEA PFA feature.

WHEA PFA Policy The policy for WHEA PFA is configured through registry values. WHEA reads these values when a system starts. If you change these settings, you must restart the system for the new values to take effect. Note: The registry values that are described later in this paper should be used only by WHEA PFA. If a PSHED plug-in performs PFA and uses the registry to store its configuration settings, the plug-in must use registry values that are different from those that WHEA PFA uses. The WHEA PFA registry keys and values are not created by default. If you change the default WHEA PFA behavior, you must create these registry keys and the necessary registry values. The WHEA PFA configuration settings are located in the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\WHEA\Policy Windows Hardware Error Architecture Predictive Failure Analysis - 5

Table 1 describes each WHEA PFA registry value. All these values are of type REG_DWORD.

Table 1. WHEA PFA Registry Values Name Description DisableOffline A Boolean value that specifies whether WHEA will make a request to the memory manager to take a memory page offline whenever a PFA predicts that the memory page will fail. Note: The DisableOffline value applies to memory pages that are predicted to fail because of a PFA that either WHEA or a PSHED plugin performed. A value of 0 means that WHEA will request the memory manager to take the memory page offline. Any other value means that WHEA will not request the memory manager take the memory page offline. The default value for this setting is 0. MemPersistOfflin A Boolean value that specifies whether the memory pages that e WHEA requested to be taken offline persist in the BCD system store. If the memory pages are saved in the BCD system store, Windows will not use those memory pages during any subsequent session of the operating system. Note: The MemPersistOffline value applies to memory pages that are requested to be taken offline because of a PFA that either WHEA or a PSHED plug-in performed. It also applies to memory pages that are taken offline because of an uncorrected memory error. A value of 0 means that WHEA will not save the memory pages in the BCD system store. Any other value means that WHEA will save the memory pages in the BCD system store. The default value for this setting is 1 for Windows Server and 0 for Windows client. MemPfaDisable A Boolean value that specifies whether the WHEA PFA feature for corrected memory errors is disabled. A value of 0 means that the WHEA PFA feature is enabled. Any other value means that the WHEA PFA feature is disabled. The default value for this setting is 0. MemPfaPageCoun A value that specifies the maximum number of memory pages t that WHEA PFA monitors for corrected errors. This value can be between 1 and 65536. The default value for this setting is 64. If this value is set to a number that is outside the allowable range, the default value is used. MemPfaThreshold A value that specifies the maximum number of corrected memory errors that are allowed on a memory page before WHEA PFA predicts that the memory page will fail again. This value can be between 1 and 65536.

May 5, 2010 © 2010 Microsoft Corporation. All rights reserved. The default value for this setting is 16. If this value is set to a number that is outside the allowable range, the default is used. MemPfaTimeout A value, in units of seconds, that specifies how long WHEA PFA monitors a memory page. WHEA PFA begins to monitor a memory page when the first corrected error is reported for that memory page. WHEA PFA stops monitoring a memory page when the last reported corrected error for that memory page is older than the MemPfaTimeout value or when the number of corrected memory errors for the page exceeds the MemPfaThreshold value. This value can be between 0 and 604800 (7 days). A value of zero indicates that the monitored memory pages will never time out. The default value for this setting is 86400 (24 hours). If this value is set to a number that is outside the allowable range, the default value is used. The following two legacy registry values are supported for application compatibility reasons: SingleBitEccErrorThreshold This value corresponds to the MemPfaThreshold value. MaxCorrectedMCEOutstanding This value corresponds to the MemPfaPageCount value. Whenever possible, you should use the registry values in Table 1.

Persistence of PFA Results If persistence is enabled, whenever PFA predicts that a memory page is likely to fail based on the current PFA policy settings, the page frame number (PFN) for the memory page is stored in a list in the {badmemory} object in the BCD system store. The list that is stored in this object contains the PFNs for all memory pages that the PFA has predicted are likely to fail. Whenever the Windows operating system is started, it does not use the memory pages in this list. Note: There is no industry standard for mapping a physical memory PFN to a specific physical memory module. Thus, WHEA cannot provide information about which memory modules are failing. There is no automated mechanism for clearing this list from the BCD system store. When the failing system memory is replaced, a system administrator must clear this list manually by using the BCDEdit command-line tool as described in the next section of this paper or through the BCD WMI interface. If the list is not cleared, Windows will continue to not use the memory pages in the list, even if the failing memory modules have been replaced. For more information about the BCD system store, see “Boot Configuration Data in Windows Vista” on the WHDC website. Windows Hardware Error Architecture Predictive Failure Analysis - 7

Managing the PFA Memory List A system administrator can manage the list of memory pages that are saved in the BCD system store by using the BCDEdit command line tool. As previously mentioned, a system administrator must manually clear this list when the failing system memory is replaced. The administrator can view and delete the list of PFA pages by using the following procedures. To access the BCD system store, BCDEdit must be run from a command window with elevated privileges.

To open a command window with elevated privileges 1. Click Start, point to All Programs, and then click Accessories. 2. Right-click Command Prompt and select Run as administrator. 3. Click Yes in the User Account Control dialog box, if it appears. To view the current list of PFNs in the BCD system store, run the following command: C:\Windows\system32>bcdedit /enum {badmemory}

RAM Defects ------identifier {badmemory} In the preceding example, the {badmemory} object in the BCD system store does not include a badmemorylist value, which indicates that no memory pages are predicted to fail. If the {badmemory] object contains a badmemorylist value, this value contains the list of PFNs for the memory pages that PFA predicted would fail, as shown in the following example: C:\Windows\system32>bcdedit /enum {badmemory}

RAM Defects ------identifier {badmemory} badmemorylist 0xffe38 0x100f To clear the list of PFNs in the BCD system store, run the following command: C:\Windows\system32>bcdedit /deletevalue {badmemory} badmemorylist

Note: Improper changes to the BCD system store can prevent Windows from starting. Therefore, you must review the commands and their results carefully before you restart Windows. For more information about the BCDEdit command line tool, see “BCDEdit Commands for Boot Environment” on the WHDC website.

May 5, 2010 © 2010 Microsoft Corporation. All rights reserved. Resources BCDEdit Commands for Boot Environment http://www.microsoft.com/whdc/system/platform/firmware/bcdedit_reff.mspx Boot Configuration Data in Windows Vista http://www.microsoft.com/whdc/system/platform/firmware/bcd.mspx How WHEA Performs PFA on ECC Memory http://msdn.microsoft.com/en-us/library/ff559390.aspx PFA Performed by a PSHED Plug-In http://msdn.microsoft.com/en-us/library/ff559445.aspx PFA Performed by WHEA http://msdn.microsoft.com/en-us/library/ff559450.aspx Platform-Specific Hardware Error Driver Plug-Ins http://msdn.microsoft.com/en-us/library/ff559456.aspx WHEA – Windows Hardware Error Architecture http://www.microsoft.com/whdc/system/pnppwr/WHEA/default.mspx WHEA Documentation in the WDK http://msdn.microsoft.com/en-us/library/ff559509.aspx WHEA Platform Design Guide Available to Microsoft partners under NDA.

Recommended publications