HP Superdome: Mainframe-Class Availability at One-Eighth the Total
Total Page:16
File Type:pdf, Size:1020Kb
HP Superdome Mainframe-class availability at one-eighth the total cost of ownership (TCO) Business white paper Table of contents Executive summary ............................................................. 3 Introduction................................................................... 3 HP Superdome: mainframe availability................................................ 4 Engineered for superior RAS ....................................................... 6 One-eighth the TCO of mainframes over three years ...................................... 9 Conclusion .................................................................. 10 Appendix ................................................................... 11 Executive summary: For years, mainframe computers period. This paper provides a detailed analysis of were considered almost the sole practical alternative the availability HP Superdome systems deliver when for the world’s most mission-critical computing compared to a recent IBM z10 mainframe—with environments. They have long claimed the industry’s potentially even more dramatic differences when best reliability, availability, and serviceability (RAS). comparing older mainframe models. It also touches Today, with the HP Superdome, organizations have on the superb reliability and serviceability of an alternative that features RAS and availability levels HP Superdome systems and provides a detailed equivalent to mainframes, at approximately one-eighth business case comparing the TCO of the Superdome the total cost of ownership (TCO) over a three-year to a comparable mainframe system. The high availability The downside with mainframes, however, is their of the Superdome Introduction high cost—both upfront and on an ongoing basis. A baseline IBM z10 Model E40 with 40 active is possible in part The costs and risks associated with planned CPUs, for instance, has a baseline cost of $22.4 and unplanned downtime continue to escalate. because of highly million, including initial purchase outlays and annual According to the U.S. National Archives and Records reliable cell board hardware and software maintenance over one year. Administration, 93% of companies that lost the use technology. One With shrinking budgets and the need to do more of their data center for 10 days or more filed for with fewer IT resources, it is easy to see why so many measure of reliability bankruptcy within one year of the disaster. Half of organizations are looking for a computing alternative is mean time between all businesses that found themselves without data that offers RAS on par with mainframe systems—at a failures (MTBF). management for this same time period filed for fraction of the cost. Using actual field bankruptcy immediately. replacement rates, HP Considering the dire consequences of downtime, many engineers calculated organizations—especially those in industries such the Superdome cell as financial services, government, communications, board MTBF to be transportation, and retail—have traditionally relied 57 years. on mainframe systems to support their mission-critical computing needs. Mainframes have traditionally been known for being reliable and highly available. For instance, industry statistics for mainframes tout a mean time between failures (MTBF) of decades and near-perfect uptime. Mainframes also boast relatively easy serviceability. 3 While all HP Superdome servers offer superb HP Superdome: availability, engineers chose two specific configurations for modeling purposes. Engineers mainframe availability ran two models, the first predicting the availability of a single HP Integrity Superdome sx2000 system To counter the claim that no type of computer other with two hard partitions running HP Serviceguard, than a mainframe can deliver mainframe-class the high-availability clustering solution from HP, and availability, HP engineers conducted an in-depth study Oracle Real Application Clusters (RAC). The second modeling the availability of HP Superdome systems. configuration featured two HP Superdome sx2000 The models are based on actual field data gathered systems, each with two hard partitions, running over 10 years of Superdome use in some of the HP Serviceguard and Oracle RAC. (See the appendix world’s most demanding computing environments. The for the full set of assumptions.) availability calculations were derived using continuous- parameter Markov chains. Markov chains are used The high availability of the Superdome is possible in to model the different states of degradation of each part because of highly reliable cell board technology. configuration outlined in this document. Failure and One measure of reliability is mean time between recovery rates between states are specified based on failures (MTBF). Using actual field replacement rates, field failure data, field service data, and subsystem HP engineers calculated the Superdome cell board configurations. Once steady-state probabilities were MTBF to be 57 years. computed, the system availability was calculated as the sum of the probabilities for the system states in which the hardware was operational. 4 Configuration #1: single-server, 2-node Superdome cluster Partition 1 Serviceguard high availability cluster Partition 2 Superdome sx2000 Server Configuration #2: two-server, 4-node Superdome cluster Partition 1 Partition 1 Serviceguard high availability cluster Partition 2 Partition 2 Superdome sx2000 Superdome sx2000 Server Server Table 1. The results put the Configuration Steady-state Number of nines Average annual availability of HP Superdome servers solidly in line with availability unplanned downtime* mainframe systems Single-server, 2-node 99.9988% Five nines – 6.3 minutes Superdome cluster running HP Serviceguard Two-server, 4-node 99.9996% Five nines + 2.1 minutes Superdome cluster running HP Serviceguard *Due to hardware or operating system outages 5 General Mills: • Electrical isolation of hard partitions: Unique in a case in point Engineered for superior the industry, electrical isolation of hard partitions enables true server consolidation. More than two Customers in RAS generations of hard partitioning experience at HP real-world IT are the customer’s assurance that a server divided environments From redundant cell board components to double into hard partitions is an excellent approximation of chip spare memory technology and hot swap I/O, the report exceptional an array of smaller boxes, yet without all the system Superdome has been engineered since its inception management and cost-of-ownership headaches. availability and for resiliency to prevent both planned and unplanned uptime in line with downtime. Over the years, these features have • Dynamic processor resiliency: Dynamic processor what the theoretical helped Superdome servers operate with near-perfect resiliency (DPR) is the system’s ability to de-allocate availability. (online) those CPUs that are exhibiting an models predict. unacceptable number of correctable errors. CPUs Leading consumer Key RAS features available on every HP Superdome are de-allocated if the number of corrected cache packaged goods include: errors reaches a specific configurable threshold. company General • Intel® Itanium® 2 processors: with a robust error- DPR is currently available on HP-UX and, to a limited ® Mills switched from correction scheme and a world-class machine extent, with the Windows operating system. If a mainframe to check architecture, these processors are specifically HP Instant Capacity (iCAP) software is installed on Superdome for its designed to enable mainframe-class availability in the server, then an iCAP processor with no user downtime replaces the de-allocated processor. mission-critical SAP enterprise environments. Intel Itanium 2 processors, in combination with the advanced RAS capabilities R/3 environment. • Double chip spare: An enhanced feature in of sx2000-based systems, support a level of HP systems is double chip-sparing technology. Application availability that was previously only possible with The hardware can permanently detect and correct availability stays at high-end, proprietary platforms. an error in any given dynamic random-access mainframe rates—at • Intel Cache Safe Technology: As cache sizes become memory chip (DRAM) and also detect and correct a fraction of the larger, the likelihood of cache failures increases. an additional memory error in any other memory cost of operating a Intel Itanium processors (starting with the dual-core location in the same code word. This detection and mainframe. Intel Itanium 2 “Montecito” processor) offset this correction is done by firmware recognizing when increased likelihood of failures using Intel Cache the first DRAM has failed, and “erasing” its bits from Safe Technology (code-named “Pellston”). When the error-checking and correcting (ECC) correction there is a correctable error in the L3 cache, Cache calculations. This feature enables the ECC logic to Safe Technology tests the cache line and corrects correct for a second DRAM failure in the same ECC the error. If the cache line is found to be defective, code word. it is disabled. 6 • Dynamic memory resiliency: It is not uncommon to • Advanced I/O error recovery: The PCI error-handling have single bits in a DRAM “go bad” (fail) during feature enables an HP-UX 11i system to avoid a the life of a computer system. It is advantageous to machine check abort (MCA) or a high priority map these locations out of main memory, because machine check (HPMC) if a PCI error (for example, a a persistent single-bit