S CIEntI f IC P rogrA mmI ng

Editors: Konstantin Läufer, [email protected] Konrad Hinsen, [email protected]

Why Modern CPUs Are stArving And WhAt CAn Be done ABoUt it

By Francesc Alted

CPUs spend most of their time waiting for to arrive. Identifying low-level bottlenecks—and how to ameliorate them—can save hours of frustration over poor performance in apparently well-written programs.

well-documented trend shows As a consequence, a huge abyss opened (about 25 times slower). The result is that CPU speeds are in- between the processors and the memory that current CPUs are suffering from creasing at a faster rate than subsystem: CPUs had to wait up to 50 serious starvation: they’re capable of A 1,2 memory speeds. Indeed, CPU per- clock ticks for each memory read or consuming (much!) more data than formance has now outstripped mem- write operation. the system can possibly deliver. ory performance to the point that During the early and middle 2000s, current CPUs are starved for data, the strong competition between The Hierarchical as memory I/O becomes the perfor- and AMD continued to drive CPU Memory Model mance bottleneck. clock cycles faster and faster (up to 4 Why, exactly, can’t we improve mem- This hasn’t always been the case. GHz). Again, the increased impedance ory latency and bandwidth to keep Once upon a time, and mismatch with memory speeds forced up with CPUs? The main reason is memory speeds evolved in parallel. vendors to introduce a second-level cost: it’s prohibitively expensive to For example, memory clock access in in CPUs. In the past five years, manufacture commodity SDRAM the early 1980s was at approximately the size of this second-level cache that can keep up with a modern pro- 1 MHz, and memory and CPU speeds grew rapidly, reaching 12 Mbytes in cessor. To make memory faster, we increased in tandem to reach speeds some instances. need motherboards with more wire of 16 MHz by decade’s end. By the Vendors started to realize that they layers, more complex ancillary logic, early 1990s, however, CPU and mem- couldn’t keep raising the frequency and (most importantly) the ability to ory speeds began to drift apart: mem- forever, however, and thus dawned run at higher frequencies. This addi- ory speed increases began to level off, the multicore age. Programmers be- tional complexity represents a much while CPU clock rates continued to gan scratching their heads, wondering higher cost, which few are willing to skyrocket to 100 MHz and beyond. how to take advantage of those shiny pay. Moreover, raising the frequency It wasn’t too long before CPU capa- new and apparently innovative multi- implies pushing more voltage through bilities began to substantially outstrip core machines. Today, the arrival of the circuits. This causes the energy memory performance. Consider this: a Intel i7 and AMD Phenom makes consumption to quickly skyrocket and 100 MHz processor consumes a word four-core on-chip CPUs the most more heat to be generated, which re- from memory every 10 nanoseconds common configuration. Of course, quires huge coolers in user machines. in a single clock tick. This rate is im- more processors means more demand That’s not practical. possible to sustain even with present- for data, and vendors thus introduced To cope with memory limita- day RAM, let alone with the RAM a third-level cache. tions, architects introduced available when 100 MHz processors So, here we are today: memory la- a hierarchy of CPU memory caches.3 were of the art. To address this tency is still much greater than pro- Such caches are useful because they’re mismatch, commodity chipmakers in- cessor clock step (around 150 times closer to the processor (normally in troduced the first on-chip cache. greater or more) and has become an the same die), which improves both la- But CPUs didn’t stop at 100 MHz; by essential bottleneck over the past 20 tency and bandwidth. The faster they the start of the new millennium, pro- years. Memory throughput is improv- run, however, the smaller they must cessor speeds reached unparalleled ex- ing at a better rate than its latency, be due mainly to energy dissipation tremes, hitting the magic 1 GHz figure. but it’s also lagging behind processors problems. In response, the industry

68 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE Computing in SCienCe & engineering

CISE-12-2-ScientificPro.indd 68 2/8/10 2:23:25 PM Mechanical disk Mechanical disk Mechanical disk

Solid state disk Speed Main memory Main memory Main memory Capacity Level 3 cache

Level 2 cache Level 2 cache Central processing Level 1 cache Level 1 cache unit (CPU) CPU CPU (a) (b) (c)

figure 1. Evolution of the hierarchical memory model. (a) the primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.

implemented several memory lay- of (for an example Programmers should exploit the op- ers with different capabilities: lower- in progress, see the Sequoia project timizations inherent in temporal and level caches (that is, those closer to at www.stanford.edu/group/sequoia), spatial locality as much as possible. the CPU) are faster but have reduced programmers must learn how to One generally useful technique that capacities and are best suited for per- deal with this problem at a fairly low leverages these principles is the block- forming computations; higher-level level.4 ing technique (see Figure 2). When caches are slower but have higher ca- There are some common techniques properly applied, the blocking tech- pacity and are best suited for storage to deal with the CPU data-starvation nique guarantees that both spatial and purposes. problem in current hierarchical mem- temporal localities are exploited for Figure 1 shows the evolution of ory models. Most of them exploit the maximum benefit. this hierarchical memory model over principles of temporal and spatial Although the blocking technique time. The forthcoming (or should I locality. In temporal locality, the target is relatively simple in principle, it’s say the present?) hierarchical model dataset is reused several times over less straightforward to implement includes a minimum of six memory a short period. The first time the in practice. For example, should the levels. Taking advantage of such a dataset is accessed, the system must basic block fit in cache level one, deep hierarchy isn’t trivial at all, and bring it to cache from slow memory; two, or three? Or would it be bet- programmers must grasp this fact the next time, however, the processor ter to fit it in main memory—which if they want their code to run at an will fetch it directly (and much more can be useful when computing large, acceptable speed. quickly) from the cache. disk-based datasets? Choosing from In spatial locality, the dataset is ac- among these different possibilities Techniques to Fight cessed sequentially from memory. In is difficult, and there’s no substitute Data Starvation this case, circuits are designed to fetch for experimentation and empirical Unlike the good old days when the memory elements that are clumped analysis. processor was the main bottleneck, together much faster than if they’re In general, it’s always wise to use memory organization has now be- dispersed. In addition, specialized libraries that already leverage the come the key factor in optimization. circuitry (even in current commodity blocking technique (and others) for Although learning assembly language hardware) offers prefetching—that is, achieving high performance; exam- to get direct processor access is (rela- it can look at memory-access patterns ples include Lapack (www.netlib.org/ tively) easy, understanding how the and predict when a certain chunk of ) and Numexpr (http://code. hierarchical memory model works— data will be used and start to trans- google.com/p/numexpr). Numexpr is and adapting your data structures fer it to cache before the CPU has a virtual machine written in Python accordingly—requires considerable actually asked for it. The net result is and C that lets you evaluate poten- knowledge and experience. Until we that the CPU can retrieve data much tially complex arithmetic expressions have languages that facilitate the de- faster when spatial locality is properly over arbitrarily large arrays. Using the velopment of programs that are aware used. blocking technique in combination

marCh/april 2010 69

CISE-12-2-ScientificPro.indd 69 2/8/10 2:23:26 PM S CIEntI f IC P rogrA mmI ng

Compression and data aCCess It uses the blocking technique (which I describe in the main text) to reduce activity on the memory bus as much ver the past 10 years, it’s been standard practice to as possible. In addition, the shuffle algorithm maximizes ouse compression to accelerate the reading and writing the compression ratio of data stored in small blocks. of large datasets to and from disks. optimizations based As preliminary benchmarks show, for highly compress- on compression leverage the fact that it’s generally faster ible datasets, Blosc can effectively transmit compressed to read and write a small (compressed) dataset than a data from memory to CPU faster than it can transmit larger (uncompressed) one, even when accounting for uncompressed data (see www.pytables.org/docs/ (de)compression time. So, given the gap between proces- StarvingCPUs.pdf). However, for datasets that compress sor and memory speed, can compression also accelerate poorly, transfer speeds still lag behind those of uncom- data transfer from memory to the processor? pressed datasets. As the gap between CPU and memory the new blocking, shuffling, and compression (Blosc) speed continues to widen, I expect Blosc to improve library project uses compression to improve memory-access memory-to-CPU data transmission rates over an increasing speed. Blosc is a lossless compressor for binary data that is range of datasets. You can find more information about optimized for speed rather than high compression ratios. Blosc at http://blosc.pytables.org.

several cores fight for scarce memory C = A B resources. To make this situation worse, you must synchronize the cach- es of different cores to ensure coherent access to memory, effectively increas- ing memory operations’ latency even further. To address this problem, the in- dustry introduced the nonuniform Dataset A Operate memory access (NUMA) architec- ture in the 1990s. In NUMA, dif- Cache ferent memory banks are dedicated to different processors (typically in Dataset C different physical sockets), thereby avoiding the performance hit that results when several processors at- tempt to address the same memory at the same time. Here, processors can access local memory quickly and remote memory more slowly. This Dataset B can dramatically improve memory throughput as long as the data is figure 2. the blocking technique. the blocking approach exploits both spatial and localized to specific processes (and temporal localities for maximum benefit by retrieving a contiguous block that fits thus processors). However, NUMA into the CPU cache at each memory access, then operates upon it and reuses it as much as possible before writing the block back to memory. makes the cost of moving data from one processor to another signifi- cantly more expensive. Its benefits with a specialized, just-in-time ( JIT) Enter the Multicore Age are therefore limited to particular compiler offers a good balance between Ironically, even as the memory workloads—most notably on serv- cache and branch prediction, allowing subsystem has increasingly lagged ers where the data is often associated optimal performance for many vector behind the processor over the past with certain task groups. NUMA is operations. However, for expressions two decades, vendors have started to less useful for accelerating parallel involving matrices, Lapack is a better integrate several cores in the same processes (unless memory access is fit and can be used in almost any lan- CPU die, further exacerbating the optimized). Given this, multicore/ guage. Also, compression is another problem. At this point, almost ev- NUMA programmers should realize field where the blocking technique ery new computer comes with sev- that software techniques for improv- can bring important advantages (see eral high-speed cores that share a ing memory access are even more the sidebar, “Compression and Data single memory bus. Of course, this important in this environment than Access”). only starves the CPUs even more as in single-core processing.

70 Computing in SCienCe & engineering

CISE-12-2-ScientificPro.indd 70 2/8/10 2:23:27 PM CPUs and GPUs: current cards). In contrast, current the code itself, will be central to Bound to Collaborate commodity motherboards are de- program design. Multicores aren’t the latest challenge signed to address a much greater to memory access. In the last few amount of memory (up to 16 Gbytes Acknowledgments years, Nvidia suddenly realized that or more). I’m grateful to my friend, Scott Prater, its graphics cards (also called graph- You can compensate for the fact for turning my writing into something ics processing units, or GPUs) were that GPUs access less memory that a reader might actually want to in effect small parallel supercomput- (albeit at much faster speeds) by using look at. ers. The company therefore took the new programming paradigms—such opportunity to improve the GPUs’ as OpenCL—that effectively com- References existing shaders—used primarily bine GPUs with CPUs (which access 1. K. Dowd, “memory reference opti- to calculate rendering effects—and much more memory at lower speeds). mizations,” High Performance Comput- convert them into “stream” or Of course, many problems remain ing, o’reilly & Associates, 1993, processors. They also created the to be solved before this combination pp. 213–233. Compute Unified Device Architec- is truly effective, from implement- 2. C. Dunphy, “rAm-o-rama—Your field ture (CUDA),5 a parallel program- ing specific hardware to enhance guide to memory types,” Maximum ming model for GPUs that has proven the CPU-to-GPU communication PC, mar. 2000, pp. 67–69. to be so popular that a new standard, latency and bandwidth to developing 3. r. van der Pas, Memory Hierarchy in OpenCL (www.khronos.org/opencl), new software that allows program- Cache-Based Systems, tech. report, Sun quickly emerged to help any hetero- mers to take advantage of this new microsystems, nov. 2002; www.sun. geneous mix of CPUs and GPUs in computational model. Only time will com/blueprints/1102/817-0742.pdf. a system work together to solve prob- tell whether this approach becomes 4. U. Drepper, What Every Programmer lems faster. mainstream, but both hardware and Should Know About Memory, tech. You can seamlessly integrate the software developers are investing report, redHat Inc., nov. 2007; current generation of GPUs to run considerable effort in exploring it (see http://people.redhat.com/drepper/ tens of thousands of threads simulta- www.nvidia.com/object/cuda_home. cpumemory.pdf. neously. Regarding the starving cores html#). 5. t. Halfhill, “Parallel Processing with problem, you might wonder whether CUDA: nvidia’s High-Performance GPUs have any practical utility what- Computing Platform Uses massive soever given the memory bottleneck he gap between CPU and mem- multithreading,” or whether they’re mostly marketing T ory speeds is enormous and Report, vol. 22, no. 1, 2008; www. hype. will continue to widen for the fore- mdronline.com/mpr/h/2008/0128/ Fortunately, GPUs are radically seeable future. And, over time, an 220401.html. different beasts than CPUs and tra- increasing number of applications ditional motherboards. One of the will be limited by memory access. critical differences is that GPUs The good news is that chip manu- Francesc Alted is a freelance developer and access memory over much better facturers and software developers consultant in the high-performance data- bandwidth (up to 10 times faster are creating novel solutions to CPU bases field. He is also creator of the Pytables in some cases). This is because a starvation. project (www.pytables.org), where he puts GPU is designed to access memory But vendors can’t do this work into practice lessons learned during years in parallel over independent paths. alone. If computational scientists of fighting with . Contact him at One problem with this design, how- want to squeeze the most perfor- [email protected]. ever, is that to take advantage of mance from their systems, they’ll the tremendous bandwidth, all the need more than just better hardware available memory has to run at very and more powerful compilers or Selected articles and columns from high frequencies—effectively limit- profilers—they’ll need a new way to IEEE Computer Society publica- ing the total amount of high-speed look at their machines. In the new tions are also available for free at http:// memory (up to 1 Gbyte on most world order, data arrangement, not ComputingNow.computer.org.

marCh/april 2010 71

CISE-12-2-ScientificPro.indd 71 2/8/10 2:23:29 PM This article was featured in

For access to more content from the IEEE Computer Society, see computingnow.computer.org.

Top articles, podcasts, and more.

computingnow.computer.org