Enable PCI Express Advanced Error Reporting in the Kernel

Enable PCI Express Advanced Error Reporting in the Kernel

Enable PCI Express Advanced Error Reporting in the Kernel Yanmin Zhang and T. Long Nguyen Intel Corporation [email protected], [email protected] Abstract Express, it is far easier for individual developers to get such a machine and add error recovery code into specific PCI Express is a high-performance, general-purpose I/O device drivers. Interconnect. It introduces AER (Advanced Error Re- porting) concepts, which provide significantly higher re- liability at a lower cost than the previous PCI and PCI-X 2 PCI Express Advanced Error Reporting standards. The AER driver of the Linux kernel provides Driver a clean, generic, and architecture-independent solution. As long as a platform supports PCI Express, the AER 2.1 PCI Express Advanced Error Reporting Topol- driver shall gather and manage all occurred PCI Express ogy errors and incorporate with PCI Express device drivers to perform error-recovery actions. To understand the PCI Express Advanced Error Report- This paper is targeted toward kernel developers inter- ing Driver architecture, it helps to begin with the ba- ested in the details of enabling PCI Express device sics of PCI Express Port topology. Figure 1 illustrates drivers, and it provides insight into the scope of imple- two types of PCI Express Port devices: the Root Port menting the PCI Express AER driver and the AER con- and the Switch Port. The Root Port originates a PCI formation usage model. Express Link from a PCI Express Root Complex. The Switch Port, which has its secondary bus representing 1 Introduction switch internal routing logic, is called the Switch Up- stream Port. The Switch Port which is bridging from Current machines need higher reliability than before and switch internal routing buses to the bus representing need to recover from failure quickly. As one of failure the downstream PCI Express Link is called the Switch causes, peripheral devices might run into errors, or go Downstream Port. Each PCI Express Port device can crazy completely. If one device is crazy, device driver be implemented to support up to four distinct services: might get bad information and cause a kernel panic: the native hot plug (HP), power management event (PME), system might crash unexpectedly. advanced error reporting (AER), virtual channels (VC). As a matter of fact, IBM engineers (Linas Vepstas and The AER driver development is based on the service others) created a framework to support PCI error re- driver framework of the PCI Express Port Bus Driver covery procedures in-kernel because IBM Power4 and design model [3]. As illustrated in Figure 2, the PCI Power5-based pSeries provide specific PCI device er- Express AER driver serves as a Root Port AER service ror recovery functions in platforms [4]. However, this driver attached to the PCI Express Port Bus driver. model lacks the ability to support platform indepen- dence and is not easy for individual developers to get a Power machine for testing these functions. The PCI 2.2 PCI Express Advanced Error Reporting Driver Express introduces the AER, which is a world standard. Architecture The PCI Express AER driver is developed to support the PCI Express AER. First, any platform which supports PCI Express error signaling can occur on the PCI Ex- the PCI Express could use the PCI Express AER driver press link itself or on behalf of transactions initiated on to process device errors and handle error recovery ac- the link. PCI Express defines the AER capability, which cordingly. Second, as lots of platforms support the PCI is implemented with the PCI Express AER Extended • 297 • 298 • Enable PCI Express Advanced Error Reporting in the Kernel Root Port Root Complex CPU Root Root Interrupt Port Port Switch Upstream Root Complex Port Up Port Switch Root Downstream Port PCI Express Switch Port Down Down Port Port Up Port Switch Figure 1: PCI Express Port Topology Down Down Port Port Error Message PBD End Point Root Complex PMErs AERrs HPrs VCrs Root Root HPrs VCrs Port Port PMErs AERrs Figure 3: PCI Express Error Reporting procedures Claim When errors happen, the PCI Express AER driver could AER Port Service Driver provide such infrastructure with three basic functions: Figure 2: AER Root Port Service Driver • Gathers the comprehensive error information if er- rors occurred. • Performs error recovery actions. Capability Structure, to allow a PCI Express compo- nent (agent) to send an error reporting message to the • Reports error to the users. Root Port. The Root Port, a host receiver of all error messages associated with its hierarchy, decodes an er- ror message into an error type and an agent ID and then 2.2.1 PCI Express Error Introduction logs these into its PCI Express AER Extended Capabil- ity Structure. Depending on whether an error reporting Traditional PCI devices provide simple error reporting message is enabled in the Root Error Command Reg- approaches, PERR# and SERR#. PERR# is parity error, ister, the Root Port device generates an interrupt if an while SERR# is system error. All non-PERR# errors are error is detected. The PCI Express AER service driver SERR#. PCI uses two independent signal lines to rep- is implemented to service AER interrupts generated by resent PERR# and SERR#, which are platform chipset- the Root Ports. Figure 3 illustrates the error report pro- specific. As for how software is notified about the errors, cedures. it totally depends on the specific platforms. Once the PCI Express AER service driver is loaded, it To support traditional error handling, PCI Express pro- claims all AERrs service devices in a system device hi- vides baseline error reporting, which defines the basic erarchy, as shown in Figure 2. For each AERrs service error reporting mechanism. All PCI Express devices device, the advanced error reporting service driver con- have to implement this baseline capability and must map figures its service device to generate an interrupt when required PCI Express error support to the PCI-related an error is detected [3]. error registers, which include enabling error reporting 2007 Linux Symposium, Volume Two • 299 and setting status bits that can be read by PCI-compliant trol Protocol Errors, Completion Time-out Errors, Com- software. But the baseline error reporting doesn’t define pleter Abort Errors, Unexpected Completion Errors, Re- how platforms notify system software about the errors. ceiver Overflow Errors, Malformed TLPs, ECRC Er- rors, and Unsupported Request Errors. When an un- PCI Express errors consist of two types, correctable er- correctable error occurs, the corresponding bit within rors and uncorrectable errors. Correctable errors include the Advanced Uncorrectable Error Status register is set those error conditions where the PCI Express protocol automatically by hardware and is cleared by software can recover without any loss of information. A cor- when writing a “1” to the bit position. Advanced error rectable error, if one occurs, can be corrected by the handling permits software to select the severity of each hardware without requiring any software intervention. error within the Advanced Uncorrectable Error Severity Although the hardware has an ability to correct and re- register. This gives software the opportunity to treat er- duce the correctable errors, correctable errors may have rors as fatal or non-fatal, according to the severity asso- impacts on system performance. ciated with a given application. Software could use the Uncorrectable errors are those error conditions that im- Advanced Uncorrectable Mask register to mask specific pact functionality of the interface. To provide more errors. robust error handling to system software, PCI Express further classifies uncorrectable errors as fatal and non- fatal. Fatal errors might cause corresponding PCI Ex- 2.2.2 PCI Express AER Driver Designed To Handle press links and hardware to become unreliable. System PCI Express Errors software needs to reset the links and corresponding de- vices in a hierarchy where a fatal error occurred. Non- Before kernel 2.6.18, the Linux kernel had no root port fatal errors wouldn’t cause PCI Express link to become AER service driver. Usually, the BIOS provides basic unreliable, but might cause transaction failure. System error mechanism, but it couldn’t coordinate correspond- software needs to coordinate with a device agent, which ing devices to get more detailed error information and generates a non-fatal error, to retry any failed transac- perform recovery actions. As a result, the AER driver tions. has been developed to support PCI Express AER en- abling for the Linux kernel. PCI Express AER provides more reliable error report- ing infrastructure. Besides the baseline error reporting, PCI Express AER defines more fine-grained error types and provides log capability. Devices have a header log 2.2.2.1 AER Initialization Procedures register to capture the header for the TLP corresponding When a machine is booting, the system allocates in- to a detected error. terrupt vector(s) for every PCI Express root port. To service the PCI Express AER interrupt at a PCI Express Correctable errors consist of receiver errors, bad TLP, root port, the PCI Express AER driver registers its in- bad DLLP, REPLAY_NUM rollover, and replay timer terrupt service handler with Linux kernel. Once a PCI time-out. When a correctable error occurs, the corre- Express root port receives an error reported from the sponding bit within the advanced correctable error status downstream device, that PCI Express root port sends an register is set. These bits are automatically set by hard- interrupt to the CPU, from which the Linux kernel will ware and are cleared by software when writing a “1” call the PCI Express AER interrupt service handler. to the bit position. In addition, through the Advanced Correctable Error Mask Register (which has the similar Most of AER processing work should be done under bitmap like advanced correctable error status register), a process context.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us