An Approach for Designing HPC Systems with Better Balance and Performance April 2016 by Steve Conway; Earl C
Total Page:16
File Type:pdf, Size:1020Kb
IDC TECHNOLOGY SPOTLIGHT An Approach for Designing HPC Systems with Better Balance and Performance April 2016 By Steve Conway; Earl C. Joseph, Ph.D.; and Bob Sorensen Sponsored by Intel Corporation The market for high-performance computing (HPC) systems has been one of the fastest-growing IT markets. The global HPC systems market more than doubled from 2001 to 2014 — growing from $4.8 billion to $10.2 billion — and IDC predicts that it will reach $15.2 billion in 2019. (Add software, storage, and services and the 2019 forecast expands to $31.3 billion.) An almost insatiable appetite for higher application performance has fueled spending growth among existing users while attracting thousands of new adopters, including small and medium-sized businesses (SMBs) and large commercial firms with big data analytics needs that enterprise technology alone can't handle well. Demand for HPC is robust and growing, but to meet this demand fully, the HPC community needs to address a daunting set of requirements — application compatibility, larger system sizes (soon to include exascale), mixes of compute-intensive and data-intensive workloads, newer environments (cloud computing), improved energy efficiency and reliability/resiliency, better command and control of all this functionality via the software stack, deeper memory hierarchies, and alleviation of the data movement (I/O) bottleneck via better interconnect fabrics. In recent years, reference architectures — flexible master blueprints — have arisen to help developers create HPC systems that are integrated, performant, and responsive to these complex requirements. This IDC Technology Spotlight discusses this trend and uses the Intel Scalable System Framework (SSF) to illustrate progress to date and where things are headed. Sustained, Balanced Performance: The Holy Grail for HPC Users It's no accident that HPC stands for high-performance computing. Since the start of the supercomputer era in the 1960s, a fast-growing contingent of scientists, design engineers, and advanced data analysts have turned to HPC systems to run their problems at the highest available speed and resolution. Especially during the past 15 years, the peak performances of HPC systems — their hypothetical speed limits — have skyrocketed. On the November 1999 list of the world's most powerful supercomputers (www.top500.org), the number 1 system boasted 9,632 cores and peak performance of 3.2 teraflops (TF). Fast-forward 16 years to the November 2015 list and the number 1 supercomputer featured 3.1 million cores and peak performance of 54.9 petaflops (PF). That's a 324-fold increase in the core count and a 17,156-fold leap in peak performance (see Table 1). Sustained performance — actual speed on end-user applications — has also made impressive advances but with varied success. "Embarrassingly parallel" codes still manage to exploit substantial fractions (>20%) of even the largest supercomputers, but many applications do not fit that description. US41171216 TABLE 1 Growth in System Size: Top Supercomputer on Top500 List November 1999 November 2015 Gain Factor (X) Peak performance (TF) 3.2 54,900 17,156 Core count 9,632 3,120,000 324 Peak performance/core (GF) 0.33 17.6 53 Source: IDC, 2015 IDC's 2015 worldwide HPC end-user study found that 51.8% of all codes were running on one node or less, and 10.9% of codes ran on just a single core. Only 9.1% of applications were being run on more than 1,000 cores, and just 2.5% were scaling to 10,000 or more cores. Some of the responsibility for limited sustained performance rests with the application codes themselves. Extreme examples include codes that were written decades ago to run on single-processor, central memory vector supercomputers. Many of these codes have been modified but were never fundamentally rewritten to exploit today's highly parallel HPC systems efficiently. Boosting sustained performance for these HPC codes and many others depends on more than just exposing enough parallelism to exploit more processor cores on CPUs or accelerators/coprocessors. These codes also need strong I/O capabilities to keep the processors supplied with data, along with a constellation of other requirements (see the following section). Addressing System Imbalance and Growing Complexity To move performance on end-user applications forward will require the HPC community to address two key challenges: the imbalanced architectures of today's HPC systems and the complex and growing set of requirements users have for these systems. Not just for exascale users. Although much of the discussion in the HPC community lately has focused on advancing capabilities to prepare for exascale computing in the 2020–2024 era and beyond, improvements in system balance and in managing system complexity stand to benefit the entire HPC community, with systems from a few nodes up to the largest supercomputers. Processors Are Not the Performance Problem The Top500 figures noted in the preceding section demonstrate that processors are not the problem. System peak performance is usually just the peak performance of one processor times the number of processors in the system — and the figures show that the peak performance of processors has advanced strongly in the past 15 years. Today, the processors — whether CPUs or accelerators/coprocessors — are nearly always the fastest elements in any HPC system, by far. The system's sustained performance on a range of user applications typically depends much more on the ability of the rest of the system, especially the memory subsystem and interconnect, to keep pace with the processors. 2 ©2016 IDC The global HPC user community is well aware of this system-level imbalance problem (the "memory wall" and "I/O wall") and has urged vendors to alleviate this worsening bottleneck by improving the capabilities of the nonprocessor parts of the system. That's important, because system-level imbalance not only constrains the performance of user applications but also throttles organizational productivity and the value (return on investment [ROI]) obtained from increasingly substantial investments in HPC resources. The rapid growth in high-performance data analysis — big data needing strong memory and I/O capabilities for data-intensive simulation and analytics — is exacerbating the imbalance issue. Advancing Interconnect Performance An important strategy for improving system balance is to enable interconnect fabrics to move data with higher bandwidth and lower latency, especially to keep the processors reasonably busy. With that goal in mind, interconnect-related R&D has heated up in recent years — and not only at large interconnect vendors such as Mellanox and Intel. Extoll (Germany) and Numascale (Norway) already market their interconnect products, and Atos (Bull) is developing an interconnect fabric for its HPC systems. In addition, IDC expects Cray to offer a special variant of the Intel Omni-Path interconnect, derived from the technical collaboration between these two companies. Breaking Down the Memory Wall Recognizing the significant impact an advanced memory architecture would have on HPC applications, Intel has been working for some time on technology innovation to reduce the lag time between processors and data. While Intel has not released a wide range of details specific to its new memory technologies, two significant milestones have been disclosed that offer a tremendous performance improvement for HPC applications. In July 2015, Intel and Micron unveiled 3D XPoint technology, a nonvolatile memory that will benefit both compute-intensive applications and data-intensive applications requiring fast access to large sets of data. This is the first new memory category since the introduction of NAND flash in 1989. Manufacturer-supplied proof points suggest the new 3D XPoint technology offers nonvolatile memory speeds up to 1,000 times faster than those of NAND, currently the most popular nonvolatile memory in the marketplace. In August 2015, Intel announced its Intel Optane Technology to combine the 3D XPoint nonvolatile memory media with the company's advanced system memory controller, interface hardware, and software IP as the foundation for a range of future products, including a new line of Intel DIMMs designed for next-generation platforms. Addressing Complexity via the Software Stack Although imbalance (extreme compute centrism) is arguably the biggest performance problem facing HPC systems today, it is not the only major problem. Another important issue for buyers and users, as noted previously, is managing the complex, growing set of requirements for operating these systems. These mounting requirements have made it more challenging to present users with an HPC resource that is comprehensive, coherent, and highly performant on their applications. The system management requirements fall into the following main categories: Heterogeneous workloads (floating point–based simulation, integer-based analytics) The movement from synchronous applications to asynchronous workflows Rapid growth in average system sizes and component counts Heterogeneous processing elements (CPUs, coprocessors/accelerators) ©2016 IDC 3 Heterogeneous environments (on-premise datacenters, public/hybrid clouds) Reliability/resiliency at scale Power efficiency and power awareness ("power steering") Cybersecurity The HPC community is also dealing with the emerging trend toward deeper, more heterogeneous memory hierarchies (e.g., solid state drives [SSDs], on-package memory, NVRAM, and burst buffers).