Technical White Paper www.novell.com DATA CENTER

Optimizing Linux* for Dual-Core AMD ™ Processors Optimizing Linux for Dual-Core AMD Opteron Processors

Table of Contents:

2 . . . . SUSE Linux Enterprise and the 7 . . . . How Can Applications Be Tuned AMD Opteron Platform for Dual-core?

2 . . . . Dual-processor, Dual-core and 7 . . . . What Tuning Is Possible to Control Multi-core Definitions Performance on a System Using Dual-core CPUs? 2 . . . . Dual-core Background 7 . . . . How Does the Linux 2.6 CPU 2 . . . . AMD Specific Scheduler Handle Dual-core CPUs? . . . . NUMA Background 3 8 . . . . What User Mechanisms, Commands or System Calls (e.g., , 3 . . . . NUMA Memory Access and taskset , etc.) Are Available to Allocation in Linux Basics numactl Allow Application Control on a 3 . . . . What Are Characteristics of System with Dual-core CPUs? Workloads that Would Be Helped . . . . How Does Dual-core Interact with by Using Dual-core CPUs? 9 AMD PowerNow! Technology? 4 . . . . What Class of Parallel Workloads . . . . How Does Dual-core Interact with Can Specifically Take Advantage 9 I/O and Async I/O? of Dual-core Systems?

9 . . . . Should Hyper-threading Be Disabled 4 . . . . How Can I Tell If My Workload or on Dual-core? Application Would Benefit?

10 . . . . Conclusion 4 . . . . What Tools Exist to Tell Me Whether My Application Would Benefit? 11 . . . . Glossary

p. 1 SUSE® Linux Enterprise and the AMD Opteron™ Platform

Dual Core technologies are the latest evolu- core processor appears much the same as tion in processor design. Novell® enterprise a traditional multiprocessor system with two products are ready to take advantage of the single core chips (symmetric multiprocess- state of the art AMD64 dual-core technology. ing [SMP]), and all the basic multiprocessor Read this white paper to learn how to optimize optimization techniques apply. For the most your Linux applications with SUSE® Linux on part, single-core SMP-optimized applications the AMD Opteron™ platform. will also be optimized for dual-core.

Dual-processor, Dual-core and Dual-core differs from single-core SMP Multi-core Definitions because memory access occurs per chip. This means that dual-core systems will have Dual-processor (DP) systems are those that contain two separate physical computer lower memory latency but also lower memory processors in the same chassis. In dual- bandwidth. Each core on a dual-core AMD processor systems, the two processors can Opteron processor will have a 128 KB L1 either be located on the same motherboard cache (256 KB/chip) and a 1 MB L2 cache or on separate boards. (2 MB/chip), but future versions may also provide a shared L3 cache, further improving Native dual-core technology refers to a performance. As the technology matures, CPU () that includes operating systems might adopt - two complete execution cores per physical algorithms that can take advantage processor. It combines two processors and of these differences to improve performance their caches and cache controllers onto a for each multiple-processor model. single integrated circuit (silicon chip). It is basically two processors residing side by Complete optimization for the dual-core side on the same die. processor requires both the and applications running on the computer A dual-core processor has many advantages, to support a technology called -level especially for those looking to boost their parallelism (TLP). TLP is the part of the OS system’s multitasking computing power. or application that runs multiple threads Since each core has its own cache, the simultaneously, where threads refer to the operating system has sufficient resources part of a program that can execute indepen- to handle intensive tasks in parallel, which dently of other parts. noticeably improves multitasking. Dual-core technology is quite often referred to as multi- AMD Specific core technology, yet multi-core can also Things are never quite as simple as they seem. mean more than two separate processors As a first order approximation, a system with (i.e., quad-core). two cores on one die will behave similarly to Dual-core Background a system with two separate processors. As said before, a dual-core processor is However, to get the most performance out comprised of two independent execution of the hardware, software might have to be cores. From a software perspective, a dual- aware of the the location of each core.

p. 2 Optimizing Linux for Dual-Core AMD Opteron Processors www.novell.com

First touch means memory is allocated from AMD Opteron processors with direct connect the local memory of the CPU that first architecture, being a Non-Uniform Memory touched that memory. Access, or NUMA, system architecture, have the concept of a node, and local A thread which calls malloc() to allocate and remote memory. Each AMD Opteron memory will not actually cause physical mem- processor is a node, and has local memory ory pages to be allocated, only reserved. that it can access with the greatest speed. Physical memory is only allocated when the The local memory of one node is termed program reads from or writes to that newly remote memory when accessed from another malloc()-ed data. This may occur from a node. NUMA architectures allow software different thread or CPU than the one that to transparently access local and remote called malloc()! memory; however, performance will be greatest when remote memory accesses Once memory is allocated to a node, it will are minimized. not move even if the program changes to another CPU and all subsequent accesses The direct connect architecture of the AMD to that memory are ‘remote’. For this reason Opteron processor implements an indepen- it can be important for performance for thread dent on each node (chip), and process migration between nodes to be whether it be single- or dual-core. This means minimized, and for memory placement to be that both cores of a dual-core chip share carefully considered and tuned. memory resources and have the same local memory, while two single-core AMD What Are Characteristics Opteron processors each have their own of Workloads that Would memory resources. Be Helped by Using Dual- core CPUs? NUMA Background For a workload to benefit from using NUMA technology is a scalable memory dual-core CPUs, it must first and foremost architecture for multiprocessor systems be able to run in parallel (parallelizable). that has long been used by the most power- This is a fundamental requirement of any ful servers and . A NUMA multiprocessor system. API for Linux (www.novell.com/collateral/ 4621437/4621437.pdf) has a good introduc- Any workload can basically be divided into a tion to NUMA technology and terminology. serial component and a parallel component This paper also details the interfaces and such that the parallel component may be methods used to fine tune NUMA memory executed and run by any number of proces- control on Linux, and will be referenced in sors while the serial component can only be regard to this subject later in this document. run on a single processor. A workload is said to be able to run in parallel (parallelizable) if NUMA Memory Access and the parallel component dominates. A workload Allocation in Linux Basics that can only run one processor has a zero- sized parallel component and is called serial. Memory management of NUMA systems on Linux comprises a number of policies. A workload for a given system might be parallelized in one of two ways: either it is The default policy for program heap memory running an application that is specifically (also called anonymous memory: e.g., written to take advantage of a multiprocessor malloc() memory) is so called first touch. system, or it is running multiple independent/

p. 3 semi-independent applications that may or workload is already taking good advantage of may not be individually parallelized. multiple processors, it will likely benefit from dual-core chips. However if the workload is What Class of Parallel very memory intensive, this test won’t be Workloads Can Specifically very indicative. Take Advantage of Dual-core Systems? If your application or workload has inherent limits in its parallelization (for example, if only Dual-core systems increase some specific four processes are run at once or if there is types of computing resources in a system a hot lock hindering scalability), then adding versus single-core chips. Workloads where more cores may not help even if your appli- the computing resource causing a bottle- cation is not at all memory intensive. neck is one of those increased by dual-core chips will benefit from dual-core systems. What Tools Exist to Tell Me Whether My Application Remember that both cores of a Dual-Core Would Benefit? AMD Opteron processor share the same memory controller. Dual-core chips will On Linux, some basic monitoring programs double the basic CPU processing capability like vmstat and top can be used to see of the system, and also double the cache whether your system is compute bound. size. They do not introduce more memory If idle plus iowait combined (or just idle, subsystem resources (memory bandwidth). if there is no iowait field) is approaching 0 Thus, the main applications that will benefit for some or most of the time, then the system from dual-core processors are those that could be compute bound.’ are compute bound and have spare memory These programs can also indicate the bandwidth available for the second core. number of runnable (executable) processes How Can I Tell If My Workload in the system. For multiple cores to be of or Application Would Benefit? use, there need to be processes available to run on them. The caveat here is that some As outlined above, a good first-order estim- algorithms and applications tune themselves ation of the performance characteristics of automatically to utilize the available CPUs. moving to dual-core chips is simply doubling This is common in, but not limited to, high- the number of single-core chips. If your performance computing (HPC) applications.

p. 4 Optimizing Linux for Dual-Core AMD Opteron Processors www.novell.com

The first field indicates there were an average not an application is memory bound. The per- of four processes running, and the system formance counters CPU_CLK_UNHALTED is completely compute bound. This gives a and MEM_PAGE_ACCESS can be used preliminary indication that such a workload together to assess the amount of memory could utilize up to four cores (e.g., two dual- accesses per clock that the system is core processors), but will not benefit from not idle. more cores. Oprofile yields the following output when Using OProfile (http://oprofile.sourceforge. profiling a kernel compile on a two-way net/news/) to monitor memory controller AMD Opteron processor-based system activity can give an indication of whether or with single-core CPUs:

Note that MEM_PAGE_ACCESS events are in units of 1,000, while CPU_CLK_UNHALTED events are in units of 10,000, and the numbers are cumulative for all cores.

After doing some calculations, we can derive that this workload is causing one memory access per 119 work cycles. We know that compilation with gcc tends not to be memory bound, with lots of work being done from the cache, so we estimate that 119 is indicative of a workload that is not memory bound on this platform.

p. 5 Next, a memory intensive workload, STREAM (www.cs.virginia.edu/stream/), is profiled:

STREAM is memory bound by design, and we can see it causing one memory access per 14 cycles. That is almost 10x more memory inten- sive than the kernel compile. We can estimate that 14 is indicative of a memo- ry-bound application on this platform.

How well does dual- core performance correlate with the numbers we have?

Relative Dual-core Performance So we can see that Figure 1. moving to two dual- core CPUs has improved kernel compilation The numbers ‘119’ and ‘14’ will vary from times by a factor of 1.91, which is as good system to system, so a good strategy is to as moving to four single-core CPUs. On the get two baseline results from known work- other hand, STREAM performance was loads such as STREAM and gcc against basically unchanged when moving to dual- which you can compare your workload. core CPUs, in contrast to a very significant gain with the move to four single-core CPUs (due to their greater memory resources).

p. 6 Optimizing Linux for Dual-Core AMD Opteron Processors www.novell.com

How Can Applications Be worth parallelizing if threads are bound to the Tuned for Dual-core? cores of a single physical chip, due to the lower latency of local memory access from Many SMP-capable applications should see both cores on a chip. a performance improvement when moving from single- to dual-core processors, without Explicitly binding processes or threads any specific tuning. to CPUs and memory to nodes might be useful in getting the full potential from a The most basic thing to keep in mind is dual-core system. to ensure good data access patterns and thereby reduce the memory controller load. What Tuning Is Possible to This kind of optimization is usually applicable Control Performance on a to all software. System Using Dual-core CPUs? Specifically, tuning an application for dual- In Linux, the CPU scheduler behavior itself is core systems will require an understanding not tunable at runtime. The philosophy is that of the performance differences involved with performance should be good enough in all running tasks on the same core, the other cases that specific tuning isn’t needed. That core, or a different physical chip. said, there are cases where the scheduler won’t make optimal choices. These can be Tasks that are independent of one another in handled by explicit binding of tasks to CPUs. terms of what memory they operate on and their level of synchronization or communica- Explicit CPU or memory binding may help, tion are ideal candidates to place on different although this reduces the effectiveness of the physical chips. This will maximize utilization CPU scheduler’s multiprocessor balancer. It of memory resources. would be easy to reduce performance unless the system behavior is very well understood. Tasks that are closely coupled, operating on the same memory or doing a lot of synchro- How Does the Linux 2.6 nization or communication should be placed CPU Scheduler Handle together, either on each core of the same Dual-core CPUs? chip or both on the same core. Exact behavior differs depending on the ver- sion of Linux used. Generally, behavior is bet- Putting two or more closely coupled tasks on ter in more recent kernels. the same core should be carefully weighed against the possible negative impact of allow- There are a number of “behavioral goals” ing other cores to go idle. If there are two shared by all multiprocessor systems, includ- pairs of such tasks running on a four-core ing those with dual-core chips (even a system system, the best performance will usually with one dual-core chip is a multiprocessor): be with each of the four tasks running on a single core. In the case of four pairs of tasks Utilize the Maximum Amount of on the same four-core system, having each CPU Resources Available pair of tasks together on their own core will usually be best. In a dual-core system it is possible for one core to be completely idle while the other Sub-problems which weren’t worth parallelizing core has two processes running on it. on multiprocessor AMD Opteron processor- Actually, only one is running at any given based systems, due to the cost of synchroni- time, with one “runnable” (executable). zation or remote memory accesses, may be

p. 7 Linux’s scheduler attempts to keep all CPUs one application (all its processes and threads) busy by moving runnable processes to idle can be made to run on a specific set of CPUs, CPUs and keeping the number of running and another application in the same system tasks on each core in reasonable balance. can run on another (possibly intersecting) set of CPUs. Minimize the Amount of Process Movement Between CPUs The command ‘taskset’ controls which CPUs an application may run on. taskset When a process runs on a CPU, it builds can modify the CPU affinity of a currently up a footprint in cache of its most frequently running process, or run a new command used memory. If a task has to be moved to with a given CPU affinity. For full details, another CPU, it loses the benefit of this view the man . cache and must refill from memory.

taskset example: On NUMA systems like AMD Opteron processor-based systems with more than one ‘taskset -c 0-3,7 startserver’ physical chip, moving a process to another node means that all the local memory it had This invocation will run the command allocated while running on the original node ‘startserver’ bound to the given set of is now effectively remote memory. CPUs (0, 1, 2, 3, 7). So, moving processes between CPUs is The ‘sched_setaffinity’ and always costly, but ranges from hugely expen- ‘sched_getaffinity’ system calls are the sive when moving between NUMA nodes to programmatic interface to set task CPU a fairly small cost when moving between the affinities. They are the interfaces taskset cores of a dual-core chip. uses, and can perform any of the same functions. The man page has full API details. Minimizing process movement directly conflicts with the goal of maximizing CPU Problem with CPU Affinity utilization, so a balance must be struck. The Linux scheduler is able to recognize the The main problem of CPU binding with affini- topology of a system (cores, chips, nodes, ties is that it can cause Linux’s multiprocessor etc.) and allows tasks to move more freely CPU scheduler to become inefficient and between cores than it does between nodes, possibly stop working completely. thus maximizing CPU utilization without moving tasks away from their local memory. This problem will become noticeable if there are many processes bound to a small set What User Mechanisms, of CPUs, causing those CPUs to be much Commands or System Calls more highly loaded than other CPUs in the (e.g., taskset, numactl, etc.) Are system. This situation will cause load balanc- Available to Allow Application ing between the less loaded CPUs to stop Control on a System with Dual- functioning or function less efficiently. core CPUs? This problem is not an issue if all processes CPU Binding in the system are carefully bound; however, Processes can be bound to any subset of it is not a good tool for more ad hoc adminis- the CPUs in the system by modifying their tration. For example, assigning all low priority CPU affinity mask. A child process inherits batch tasks to a single CPU (‘cpusets’) can the CPU affinity of its parent. In this way, be a better option.

p. 8 Optimizing Linux for Dual-Core AMD Opteron Processors www.novell.com

cpusets exec(), Linux will try to move the process to an idle chip even if the other core of the cur- cpusets allow a machine to be partitioned rent chip is idle. Such a strategy is usually into subsets. These sets may overlap, and in the right one for best performance, but not that case they suffer from the same problem for power saving. as CPU affinities. However, when the cpusets do not overlap, they can be switched to Note: AMD Opteron processor-based systems “exclusive” mode, which effectively causes with one dual-core chip are not affected by the scheduler to behave as though they this trade-off. are two entirely different systems. As such, scheduling behavior is very good; however, Explicit CPU affinities may be required in an idle cpuset will remain idle even while order to achieve the best power usage for others are very loaded. a given workload. Generally, this won’t be a concern for servers and workstations; how- cpuset can also be used to bind groups of tasks to specific memory nodes. ever, for specialized applications such as embedded systems or clusters, power usage cpuset documentation can be found in can be a factor. Documentation/cpusets.txt in the Linux kernel source. As described above, it is difficult for applica- tions to control CPU affinities in the presence Memory Binding of changing demand or CPU usage patterns. Careful analysis of the workload and CPU Linux supports explicit control and binding usage patterns will be required, and such a of tasks to NUMA memory nodes in a similar task may simply be infeasible. manner to CPU binding. A NUMA API for Linux (www.novell.com/collateral/4621437/4621437. How Does Dual-core Interact pdf), referenced earlier, provides a good with I/O and Async I/O? introduction to memory binding and control facilities in Linux. I/O and async I/O are typically very kernel- intensive activities. All recent versions of the How Does Dual-core Interact Linux kernel (including SUSE Linux Enterprise with AMD PowerNow! Server 9 from Novell, and the upcoming SUSE Technology? Linux Enterprise Server 10) are generally very well parallelized and NUMA aware. As such, In a dual-core system, both cores of each the I/O capability of a system is likely to be chip will always run at the same frequency. improved with dual-core systems, though This means frequency will only be lowered exact behavior is always workload dependent. when both cores are idle. If power saving is a priority on a multiprocessor system, then Use of NUMA and multiprocessor tuning both cores of a chip should be highly utilized techniques in the application, as described before any load is moved to another chip. above, will often assist the kernel in taking advantage of dual-cores (and multiprocessor The Linux kernel scheduler currently will not and NUMA systems in general). schedule tasks with power saving as a goal. In fact, Linux tries to make use of the AMD Should Hyper-threading Be Opteron processor’s memory resources where Disabled on Dual-core? possible, which can be in direct conflict with a power-saving scheduling strategy. For exam- The question if hyper-threading should be ple, when a new process is forked and calls disabled on dual-core is difficult to answer.

p. 9 It will be very dependent on the workload and ownership (TCO), and maximum processor the CPU and system architectures in ques- performance. The AMD Opteron processor tion. The POWER5 chip is dual-core and has with Direct Connect Architecture enables one multi-threading (similar to hyperthreading). platform to meet the needs of multi-tasking IBM often submits the commercial perform- environments, providing platform longevity. ance results of its POWER5-based systems with their version of multi-threading turned on. Dual or multi-core processors can immedi- ately benefit businesses and general con- On chips with hyperthreading, perform- sumers by providing the capability to run ance gains are often reported on workloads multimedia and security applications with where there is a significant floating-point increased performance. With multi-core element. processors, a new era of true multitasking emerges. AMD ™ 64 X2 Dual-Core Similarly, for single-core systems, the best processors will take computing to a new advice is to test both hyperthreading enabled level by enabling people to simultaneously and disabled to see which performs better burn a CD, check e-mail, edit digital photos for your specific workload. and run virus protection software—all with increased performance. Multi-core proces- Conclusion sors also will enable the growth of the digital home: a centralized PC will be able to serve The introduction of dual- or multi-core multiple rooms and people in the home. technology will change commercial and consumer computing while offering new Software professionals regularly push the opportunities for software developers. limits of current processor capacity. Multi-core The evolution to multi-core processors is processors will solve many of the challenges an exciting technological advancement that currently facing software designers by deliv- will play a central role in driving relevant ering significant performance increases at advancements, providing greater security, a time when they need it most. Multi-core resource utilization and value for businesses processors, in combination with new compiler and consumers. The client and consumer optimizers, will reduce compiling times by markets will have access to superior perform- as much as 50 percent, giving developers a ance and efficiency compared to single-core critical advantage in meeting time-to-market processors, as the next generation of software demands. Software vendors also can use applications are developed to take advantage more multi-threaded design methods for of multi-core processor technology. delivering enhanced features. With the advent of multi-core processors and adoption of Corporate IT systems currently optimized for multi-core computer platforms by businesses SMP multi-threaded applications should see and consumers, software vendors will have a significant performance increases by using much larger marketplace in which to distrib- dual- or multi-core processors. This logical ute new and improved applications. performance boost will take place within cur- rent, available hardware and socket designs, Novell has optimized its SUSE Linux enter- enabling corporate IT managers to add more prise products for AMD64 dual-core technol- sophisticated system layers, like virtualization ogy. Your customers will be able to extract and security, without significant disruption the most in performance and reliability when to legacy systems. Other key benefits are they run your applications on SUSE Linux simplified manageability, lower total cost of and the AMD Opteron platform.

p. 10 Optimizing Linux for Dual-Core AMD Opteron Processors www.novell.com

Glossary Memory bound Key Terms To Understanding Dual-core: If performance is limited by a system’s mem- ory bandwidth, it is called memory bound. Cache (Memory bound often refers to applications with a working set too big to fit in physical A cache is a pool of entries or collection of memory, but that is not the case here.) data duplicating original values stored else- where or computed earlier. Node Chip A node is a basic unit used to build data structures, such as linked lists and tree The package that contains one or more cores. data structures. Compute bound NUMA If the performance of an application is limited Non-Uniform Memory Access (NUMA) is by a system’s CPU resources, it is called a computer memory design used in multi- compute bound. processors, where the memory access time depends on the memory location relative Core to a processor. Under NUMA, a processor This is the basic unit of execution. can access its own local memory faster than non-local memory, i.e., memory that CPU is local to another processor or shared between processors. CPU is the abbreviation of “central processing unit,” and is pronounced as separate letters. processor The CPU is the brains of the computer. Some- times referred to simply as the processor or “Processor” is the shortened term for micro- central processor, the CPU is where most processor or central processing unit (CPU). calculations take place. The term can refer- ence a chip or a core (e.g., Linux calls each SMP core a CPU). Symmetric multiprocessing (SMP) is the processing of program code by multiple Dual-core processors that share a common operating Dual-core refers to a CPU that includes system and memory subsystem. two complete execution cores per physical processor.

Integrated circuit Another name for a chip, an integrated circuit (IC) is a small electronic device made out of a semiconductor material.

p. 11 SUSE Linux Enterprise Server 9 from Novell features the most advanced Linux technology available and can support the services, applications and databases that drive your business.

www.novell.com

Contact your local Novell office, or call Novell at:

1 888 321 4272 U.S./Canada 1 801 861 4272 Worldwide 1 801 861 8473 Facsimile

Novell, Inc. 404 Wyman Street Waltham, MA 02451 USA

462-002016-001 | 05/06 | © 2006 Novell, Inc. All rights reserved. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0, 1999 or later. Novell, the Novell logo, the N logo and SUSE are registered trademarks of Novell, Inc. in the United States and other countries. AMD, Athlon and Opteron are trademarks of .

*Linux is a registered trademark of Linus Torvalds. All other third-party trademarks are the property of their respective owners.

Novell would like to thank AMD for their contributions to and their support of the creation of this document.