High-Performance Throughput Computing

HIGH-PERFORMANCE THROUGHPUT COMPUTING THROUGHPUT COMPUTING, ACHIEVED THROUGH MULTITHREADING AND MULTICORE TECHNOLOGY, CAN LEAD TO PERFORMANCE IMPROVEMENTS THAT ARE 10 TO 30× THOSE OF CONVENTIONAL PROCESSORS AND SYSTEMS. HOWEVER, SUCH SYSTEMS SHOULD ALSO OFFER GOOD SINGLE-THREAD PERFORMANCE. HERE, THE AUTHORS SHOW THAT HARDWARE SCOUTING INCREASES THE PERFORMANCE OF AN ALREADY ROBUST CORE BY UP TO 40 PERCENT FOR COMMERCIAL BENCHMARKS. Microprocessors took over the entire Power4 and Power5;8,9 and Intel, through computer industry in the 1970s and 1980s, hyperthreading, are shipping products in this leading to the phrase “the attack of the killer space. AMD has announced plans to ship micros” as a popular topic on the newsgroup dual-core chips in late 2005. What is misun- comp.arch for many years. We believe that the derstood, though, is how you can build a disruption offered by throughput computing microprocessor that has several multithread- is similar in scope. Over the next several years, ed cores that still delivers high-end single- we expect chip multithreading (CMT) proces- thread performance. The combination of sors to take over all computing segments, from high-end single-thread performance and a Shailender Chaudhry laptops and desktops to servers and super- high degree of multicore, multithreading (for computers, effectively creating “the attack of example, tens of threads) is what we call high Paul Caprioli throughput computing.” Most microproces- performance throughput computing, the topic sor companies have now announced plans of this article. Sherman Yip and/or products with multiple cores and/or A variety of papers have shown how work- multiple threads per core. Sun Microsystems loads running on servers10 and desktops11 can Marc Tremblay has been developing CMT technology for greatly benefit from thread-level parallelism. almost a decade through the Microprocessor It is no surprise that large symmetric multi- Sun Microsystems Architecture for Java Computing (MAJC, processors (SMPs) were so successful in the pronounced “magic”) program1,2 and a vari- 1990s in delivering very high throughput in ety of SPARC processors.3-7 IBM, through the terms of transactions per second, for instance, while running large, scalable, multithreaded This article was based on a keynote speech by Marc applications. Successive generations of SMPs Tremblay at the 2004 International Symposium on delivered better individual processors, better Computer Architecture in Munich, with some updat- interconnects (certainly in terms of band- ed information. width, not necessarily in terms of latency) and better scalability. Gradually, applications and 32 Published by the IEEE Computer Society 0272-1732/05/$20.00 © 2005 IEEE Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. operating systems have matched the hardware cache behave as a 64-Mbyte one for scalability. A decade later, architects design- SPECfp2000. ing CMT processors face a dilemma: Many Researchers have proposed several other applications and some operating systems techniques to reduce the impact of ever- already scale to tens of threads. This leads to increasing memory latencies, as the “Scaling the question, “Should you populate a proces- the Memory Wall” sidebar explains. sor die with many very simple cores or should you use fewer but more powerful cores?” We Throughput computing simulated (and built, or are building) two very Systems designed for throughput comput- different cores and two different threading ing emphasize the overall work performed models and discuss how they benefit from over a fixed time period as opposed to focus- multithreading and from growing the num- ing on a metric describing how fast a single ber of cores on a chip. We show that if multicore or a thread executes a benchmark. The threading is the prime feature and the rest of work is the aggregate amount of computation the core (that is, the rest of the pipeline and performed by all functional units, all threads, the memory subsystem) is architected around all cores, all chips, all coprocessors (such as it, as opposed to simply adding threading to network and cryptography accelerators) and an existing core, these two different cores ben- all the network interface cards in a system. efit greatly from multithreading (a 1.8 to 3.2× The scope of throughput computing is increase in throughput) and from multicores broad, especially if the microarchitecture and (3.1 to 3.7× increase). Server workloads dic- operating system collaborate to enable a tate that high throughput be the primary goal. thread to handle a full process. In this way, a On the other hand, the impact of critical sec- 32-thread CMT can appear to be an SMP of tions, Amdahl’s law, response time require- 32 virtual cores, from the viewpoint of the ments, and pipeline efficiency force us to try operating system and applications. At one end to design a high-performance single-thread of the spectrum, 32 different processes could pipeline although not at the expense of run on a 32-way CMT. At the other end, a throughput. single application, scaling to 32 threads—like Unfortunately as we discuss in a later sec- many large commercial applications running tion, two big levers that industry traditional- on today’s SMPs—can run on a CMT. Any ly used—clock rate and traditional point in between is also possible. For instance, out-of-order execution—no longer work very you could run four application servers, each well in the context of an aggressive 65-nm scaling to eight threads. Obviously, there are process and memory latencies now ranging in sweet spots, and it is the task of the load bal- the hundreds of cycles. ancing software and operating system to Clearly the industry needs new techniques. appropriately leverage the hardware. This article describes hardware scouting in Throughput computing relies on the fact which the processor launches a hardware that server processors run large data sets thread (invisible to software) which runs in and/or a large number of distinct jobs, result- front of the head thread. The goal here is to ing in memory footprints that stress all levels bring all interesting data and instructions (and of the memory hierarchy. This stress results control state) into the on-chip caches. Scout- in much higher miss rates for memory oper- ing heavily leverages some of the existing ations than those typical of SPECint2000, for threaded hardware to boost single-thread per- instance. A compounding effect is the miss formance. The control speculation accuracy, cost. In about 2007, servers will already be the scout’s depth, the memory-level paral- driving over 1 Tbyte of data provided by hun- lelism (MLP) impact, and the overall effect dreds of dual, inline memory modules. This on performance and cache sizes form the core makes the memory physically far from proces- of this article. We will show that this microar- sors (requiring long board traces, heavy load- chitecture technique will, for some configu- ing, backplane connectors, and so on), rations, improve performance by 40 percent resulting in large cache-miss penalties. on TPC-C, double the effective L2 cache size As a result, cores are often idle, waiting for for SPECint2000, and make a 1-Mbyte L2 data or instructions coming from distant MAY–JUNE 2005 33 Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS Scaling the Memory Wall Tolerating memory latency has of course been a goal for archi- no longer mostly rely on faster transistors and faster wires to provide tects and researchers for a long time. Various designs have employed an easy performance scaling, certain ways of using the transistor caches to tolerate memory latency by using the temporal and spa- budget compensate for the lack of scaling; these reach high levels tial locality that most programs exhibit. To improve the latency tol- of performance, especially in the context of throughput computing. erance of caches, other designs have employed nonblocking caches that support load hits under a miss and multiple misses. The use of software prefetching has proven effective when the References compiler can statically predict which memory reference will cause 1. C.-K. Luk and T.C. Mowry, “Compiler-Based a miss and disambiguate pointers.1,2 This is effective for a small Prefetching for Recursive Data Structures,” Proc. 7th subset of applications. In particular, however, it is not very effec- Int’l Conf. Architectural Support for Programming tive in commercial database applications, a prime target for servers. Languages and Operating Systems (ASPLOS 96), ACM Notice that software prefetching can also hurt performance if it Press, 1996, pp. 222-233. 2. T.C. Mowry, M.S. Lam, and A. Gupta, “Design and • generates extraneous load misses by speculatively hoisting Evaluation of a Compiler Algorithm for Prefetching,” multiple loads above branches based on static analysis and Proc. 5th Int’l Conf. Architectural Support for • increases instruction bandwidth. Programming Languages and Operating Systems (ASPLOS 92), ACM Press, 1992, pp. 62-73. Hardware prefetching alleviates the instruction cache pres- 3. S. Chaudhry and M. Tremblay, “Method and Apparatus sure, using dynamic information to predict misses and discover for Using an Assist Processor to Pre-fetch Data Values access patterns. Its drawback (and the reason why system vendors for a Primary Processor,” US Patent 6,415,356. often turn it off) is that it is hard to map a single hardware algo- 4. J.D. Collins et al., “Dynamic Speculative rithm to different applications, which would require a variety of Precomputation,” Proc. 34th Ann. ACM/IEEE Int’l algorithms. We have observed severe cache pollution when apply- Symp. Microarchitecture (Micro-34), IEEE CS Press, ing a generic hardware prefetch to codes with significantly dif- 2001, pp.

Load more