<<

Front cover

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

Last Update: October 2017

Introduces three balanced Compares the performance of memory guidelines for balanced and unbalanced memory processors configurations

Explains memory interleaving and Provides tips on how to balance its importance memory

Dan Colglazier Joseph Jakubowski Tristian “Truth” Brown Abstract

Configuring a server with balanced memory is important for maximizing its memory bandwidth and overall performance. Lenovo servers running Intel Xeon processors have 4 or 8 memory channels per and up to three per channel, so it is important to understand what is considered a balanced configuration and what is not.

In this paper, we introduce three balanced memory guidelines that will guide you to select a balanced memory configuration. Balanced and unbalanced memory configurations are presented along with their relative measured memory bandwidths to show the effect of unbalanced memory. Suggestions are also provided on how to produce balanced memory configurations.

This paper is for System x and ThinkServer customers and for business partners and sellers wishing to understand how to maximize the performance of Lenovo servers.

At Lenovo Press, we bring together experts to produce technical publications around topics of importance to you, providing information and best practices for using Lenovo products and solutions to solve IT challenges.

See a list of our most recent publications at the Lenovo Press web site: http://lenovopress.com

Do you have the latest version? We update our papers from time to time, so check whether you have the latest version of this document by clicking the Check for Updates button on the front page of the PDF. Pressing this button will take you to a web page that will tell you if you are reading the latest version of the document and give you a link to the latest if needed. While you’re there, you can also sign up to get notified via email whenever we make an update.

Contents

Introduction ...... 3 Memory interleaving...... 3 Balanced memory configurations...... 4 About the STREAM test ...... 5 Memory topology ...... 5 Applying the balanced memory guidelines - E5 processors...... 6 Applying the balanced memory guidelines - E7 processors...... 11 Maximizing memory bandwidth ...... 15 Cluster on Die option impact on memory bandwidth ...... 15 Summary ...... 16 Change history ...... 16 Authors...... 17 Notices ...... 18 Trademarks ...... 19

2 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration Introduction

The memory subsystem is a key component of the Intel Xeon architecture of Lenovo® servers which can affect overall server performance. When properly configured, the memory subsystem can deliver extremely high memory bandwidth and low memory access latency. When the memory subsystem is incorrectly configured, however, the memory bandwidth available to the server can become limited and overall server performance can be reduced.

This brief explains the concept of balanced memory configurations that yield the highest possible memory bandwidth from the Intel Xeon architecture. Examples of balanced and unbalanced memory configurations are shown to illustrate their effect on memory subsystem performance.

This paper is applicable to servers based on Intel Xeon E5 v4 and upcoming E7 v4 processors as well as the previous generation Xeon E5 v3 and E7 v3 processor families.

Memory interleaving

Access to the information stored on memory DIMMs is controlled by the memory controller(s) within the Intel Xeon processor. Depending on the specific processor, one or two memory controllers are present in each processor. Each memory controller is attached to memory channels that are connected to the physical slot connectors that interface to memory DIMMs.

Figure 1 on page 3 illustrates how an Intel E5 processor’s memory controllers are connected to DIMM slots.

Channel 0 Channel 0 Slot 0 Slot 1 Slot 2 Slot 2 Slot 1 Slot 0

Channel 1 Memory Memory Channel 1 Controller Slot 0 Slot 1 Slot 2 Controller 1 Slot 2 Slot 1 Slot 0

Xeon E5 processor

Figure 1 Layout of the Intel Xeon E5 memory controllers and DIMM slots

The Xeon processor optimizes memory accesses by creating interleave sets across the memory controllers and/or memory channels. For example, if identical DIMMs are populated on both memory channels attached to a memory controller, the memory controller creates a 2-way interleave set across both DIMMs. Figure 2 on page 4 shows two such interleave sets.

© Copyright Lenovo 2017. All rights reserved. 3 Two-way Two-way interleaved set interleaved set

16 GB Empty Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM slot slot slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM Xeon E5 processor

Figure 2 Two-way interleave sets (four identical DIMMs)

Interleaving enables higher memory bandwidth by spreading contiguous memory accesses across both memory channels rather than sending all memory accesses to one memory channel or the other.

If DIMMS with different memory capacities are populated on the memory channels attached to a memory controller or if different numbers of identical capacity DIMMs are populated on the memory channels, the memory controller has to create multiple interleave sets. Managing multiple interleave sets creates overhead for the memory controller which can reduce memory bandwidth.

Figure 2 illustrates a 2-way memory interleave set on an Intel E5 processor which results from populating identical memory DIMMs on each memory channel. These 2-way memory interleave sets can be further interleaved across the memory controllers. Consecutive addresses would alternate between the memory controllers with every fourth address going to each memory channel.

Within a memory channel, a second level of interleaving called interleaving can occur. A memory rank is a block of data created from the memory chips on a memory DIMM. A memory rank is typically 64 bits wide. If ECC is supported, an additional 8 bits are added for a total of 72 bits. A DIMM may contain multiple memory ranks, for example, one, two and four ranks per DIMMs. Memory rank interleaving is optimal when each DIMM on a memory channel has the same number of memory ranks. For example, each DIMM on a memory channel is a 2-rank DIMM.

When installing memory DIMMs into your server, follow the DIMM installation sequence for your particular server. The configurations shown in this paper did not always follow the sequences for the servers they were implemented and measured on as a number of these configurations were put together just for demonstration purposes.

Balanced memory configurations

Balanced memory configurations enable optimal interleaving across all attached memory channels so memory bandwidth is maximized. Bandwidth can be optimized if both memory controllers on the same physical processor socket are identically configured. System level memory performance can be further optimized if each physical processor socket has the same physical memory capacity.

4 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration As a result, the basic guidelines for a balanced memory subsystem are as follows: 1. All populated memory channels should have the same total memory capacity and the same total number of ranks 2. All memory controllers on a processor socket should have the same configuration of memory DIMMs 3. All processor sockets on the same physical server should have the same configuration of memory DIMMs

Tip: We refer to the above three bullets throughout this paper as the balanced memory guidelines 1, 2 and 3.

About the STREAM test

STREAM Triad is a simple, synthetic benchmark designed to measure sustainable memory bandwidth. Its intent is to measure the best memory bandwidth available. STREAM Triad will be used to measure the sustained memory bandwidth of various memory configurations to see the effect of unbalanced memory configurations on memory bandwidth.

For more information about STREAM Triad, see: http://www.cs.virginia.edu/stream/

Memory topology

In this paper we are testing with a server that has an E5-2600 v3 processor with two memory controllers, however the results equally apply to servers with E5-2600 v4 processors.

Low-end Xeon E5 processors: Intel Xeon E5 v3 processors with fewer than 10 cores and E5 v4 processors with fewer than 12 cores only have one memory controller in the processor package, with all four memory channels connected to the one memory controller. As a result, balanced memory guideline 2 does not apply for processors with only one memory controller.

To illustrate various memory topologies for a processor with two memory controllers, different memory configurations will be designated as A:B:C:D where each letter indicates the number of DIMMs populated on each memory channel.  A refers to Memory Channel 0 on Memory Controller 0  B refers to Memory Channel 1 on Memory Controller 0  C refers to Memory Channel 0 on Memory Controller 1  D refers to Memory Channel 1 on Memory Controller 1

This designation is shown in Figure 3 on page 6.

5 Channel 0 Channel 0 A C

Channel 1 Memory Memory Channel 1 B Controller 0 Controller 1 D

Xeon E5 processor

Figure 3 Channel designation

As an example, a 3:2:1:0 memory configuration has:  3 DIMMs on Memory Channel 0 on Memory Controller 0  2 DIMMs on Memory Channel 1 on Memory Controller 0  1 DIMM on Memory Channel 0 on Memory Controller 1  0 DIMMs on Memory Channel 1 on Memory Controller 1

Applying the balanced memory guidelines - E5 processors

We will start with the assumption that balanced memory guideline 3 described in “Balanced memory configurations” on page 4 is followed so that all processor sockets on the same physical server have the same configuration of memory DIMMs.

Configuration of 4 DIMMs - balanced

Therefore, we only have to look at one processor socket for each memory configuration. We will start with one DIMM in each memory channel which yields the 1:1:1:1 memory configuration shown in Figure 4.

16 GB Empty Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM slot slot slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM Xeon E5 processor

Figure 4 One memory DIMM in each channel (STREAM Triad relative memory bandwidth = 100)

This is a balanced memory configuration as it follows balanced memory guideline 1 with all memory channels having 16 GB of memory capacity and two ranks total. It also follows balanced memory guideline 2 with both memory controllers having the same configuration of DIMMs.

Interleaving can be done across the memory controllers and across both pairs of memory channels. The 1:1:1:1 memory configuration provides a maximum amount of memory bandwidth with a relative STREAM Triad score of 100.

6 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration Configuration of 5 DIMMs - unbalanced

Next we will add one DIMM to see its effect on memory bandwidth. This 2:1:1:1 memory configuration is shown in Figure 5.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM slot slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM Xeon E5 processor

Figure 5 2:1:1:1 memory configuration (STREAM Triad relative memory bandwidth = 26)

This is not a balanced memory configuration as it breaks both balanced memory guideline 1 with a different memory capacity and number of ranks on one memory channel than the others and balanced memory guideline 2 with different DIMM configurations on each memory controller.

While a 2-way interleave set is created on Controller 1, Controller 0 has to create two separate interleave sets for the three DIMMs on its two memory channels:  A 2-way interleave set is created using one DIMM from each memory channel.  A 1-way interleave set is created using the remaining DIMM on Controller 0 Channel 0.

Interleaving across the memory controllers is not done for this configuration due to the unbalanced memory between them. The inability to interleave across the memory controllers combined with the interleaving overhead on Memory Channel 0 reduces the relative memory bandwidth of the 2:1:1:1 memory configuration to 26 or 26% of the 1:1:1:1 memory configuration.

Configurations of 6 DIMMs - 3 unbalanced and 1 balanced

There are numerous ways to populate six DIMMs in the twelve DIMM slots. We will look at four different ways to see the effect on memory bandwidth.

The first is the 2:1:1:2 memory configuration shown in Figure 6.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM slot slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty 16 GB 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot 2R DIMM 2R DIMM Xeon E5 processor

Figure 6 2:1:1:2 memory configuration (STREAM Triad relative memory bandwidth = 52)

This is an unbalanced memory configuration as it does not follow balanced memory guideline 1 with two memory channels having 32 GB of memory and four total ranks while the other two have only 16GB of memory and two ranks. Balanced memory guideline 2 is followed as both memory controllers have the same configuration of memory DIMMs allowing interleaving to be done across the memory controllers.

7 Each memory controller has to create two separate interleave sets for the three DIMMs on their two memory channels as described above for Memory Controller 0 in the 2:1:1:1 memory configuration. Even though balanced memory guideline 2 is followed with each memory controller having the same configuration of DIMMs, the interleaving overhead of multiple sets on the channels reduces the relative memory bandwidth of the 2:1:1:2 memory configuration to only 52.

Another way to populate memory with six DIMMs is in a 3:1:1:1 configuration as shown in Figure 7 on page 8.

16 GB 16 GB 16 GB Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM 2R DIMM slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM Xeon E5 processor

Figure 7 3:1:1:1 memory configuration (STREAM Triad relative memory bandwidth = 20)

This unbalanced memory configuration does not follow balanced memory guideline 1 as one channel has more memory capacity and ranks than the others. It also does not follow balanced memory guideline 2 as the memory controllers do not have the same DIMM configurations.

Interleaving cannot be done across the memory controllers. Memory Controller 1 can interleave its memory channels as they follow balanced memory guideline 1. Memory Controller 0 will need to create a 2-way interleave set across its memory channels and two 1-way interleave sets for the two other DIMMs on its Memory Channel 0.

Since most of this configuration’s memory is on Memory Controller 0, this interleaving overhead is very detrimental. The result is a relative memory bandwidth of 20.

One way to fulfill balanced memory guideline 2 with six DIMMs is with a 3:0:3:0 memory configuration shown in Figure 8.

16 GB 16 GB 16 GB Ch. 0 Ch. 0 16 GB 16 GB 16 GB 2R DIMM 2R DIMM 2R DIMM 2R DIMM 2R DIMM 2R DIMM

Empty Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty Empty slot slot slot Controller 0 Controller 1 slot slot slot Xeon E5 processor

Figure 8 3:0:3:0 memory configuration (STREAM Triad relative memory bandwidth = 39)

This memory configuration also follows balanced memory guideline 1 and is a balanced memory configuration. Even so, all memory channels should be populated to achieve maximum memory bandwidth.

Interleaving is done across the memory controllers. Each memory controller will create three 1-way interleave sets on Memory Channel 0.

Even though balanced memory guideline 2 is met, this memory configuration produces a relative memory bandwidth of 39 mostly due to no contribution from either Memory Channel 1.

8 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration The final way to arrange six DIMMs is to put all of them on one memory controller in the 3:3:0:0 memory configuration shown in Figure 9 on page 9.

16 GB 16 GB 16 GB Ch. 0 Ch. 0 Empty Empty Empty 2R DIMM 2R DIMM 2R DIMM slot slot slot

16 GB 16 GB 16 GB Ch. 1 Memory Memory Ch. 1 Empty Empty Empty 2R DIMM 2R DIMM 2R DIMM Controller 0 Controller 1 slot slot slot Xeon E5 processor

Figure 9 3:3:0:0 memory configuration (STREAM Triad relative memory bandwidth = 38)

This memory configuration does not follow balanced memory guideline 1 in the same way as the 3:0:3:0 memory configuration. It also does not follow balanced memory guideline 2 as all the memory is on one memory controller. Having one memory controller unpopulated entirely removes its contribution to memory bandwidth.

Interleaving can occur between the channels on Memory Controller 0. Even so, the 3:3:0:0 memory configuration has a relative memory bandwidth of 38.

Configurations of 8 DIMMs - balanced

Using eight DIMMs, it is easy to create the balanced 2:2:2:2 memory configuration as shown in Figure 10.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot slot 2R DIMM 2R DIMM

16 GB 16 GB Empty Ch. 1 Memory Memory Ch. 1 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot Controller 0 Controller 1 slot 2R DIMM 2R DIMM Xeon E5 processor

Figure 10 2:2:2:2 memory configuration (STREAM Triad relative memory bandwidth = 100)

This memory configuration fulfills balanced memory guideline 1 with all channels populated equally and balanced memory guideline 2 with each memory controller having the same DIMM configuration.

This memory configuration can interleave between both the memory controllers and the memory channels. The 2:2:2:2 memory configuration provides the maximum relative memory bandwidth of 100.

Configurations of 8 DIMMs - unbalanced

The same eight memory DIMMs can also be arranged in the unbalanced memory configuration of 3:1:1:3 as shown in Figure 11 on page 10.

9 16 GB 16 GB 16 GB Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM 2R DIMM slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 16 GB 16 GB 16 GB 2R DIMM slot slot Controller 0 Controller 1 2R DIMM 2R DIMM 2R DIMM Xeon E5 processor

Figure 11 3:1:1:3 memory configuration (STREAM Triad relative memory bandwidth = 53)

This configuration does not follow balanced memory guideline 1 as half the memory channels have 48 GB of memory and 6 total ranks, while the other memory channels have only 16 GB of memory and 2 ranks.

This memory configuration can be interleaved across the memory controllers but not efficiently across the memory channels. Each memory controller will have one 2-way interleave set and two 1-way interleave sets. This limits its relative memory bandwidth to 53.

Configurations of 3 DIMMs - unbalanced

One other memory configuration to look at is having only three memory DIMMs in the 1:1:1:0 memory configuration shown in Figure 12.

16 GB Empty Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM slot slot slot slot 2R DIMM

16 GB Empty Empty Ch. 1 Memory Memory Ch. 1 Empty Empty Empty 2R DIMM slot slot Controller 0 Controller 1 slot slot slot Xeon E5 processor

Figure 12 1:1:1:0 memory configuration (STREAM Triad relative memory bandwidth = 60)

This unbalanced memory configuration does not follow balanced memory guideline 1 with one unpopulated memory channel nor balanced memory guideline 2 with two DIMMs on one memory controller and only one on the other.

This memory configuration cannot be interleaved efficiently across the memory controllers nor at all across the memory channels of Memory Controller 1. It can be interleaved across the memory channels of Memory Controller 0.

It also suffers from lacking any contribution to memory bandwidth from Memory Channel 1 on Memory Controller 1 as it is unpopulated. The result is a relative memory bandwidth of 60.

10 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration Applying the balanced memory guidelines - E7 processors

The Intel Xeon E7 processor uses a slightly different memory architecture to enable support of more DIMMs.

Each processor has two integrated memory controllers, and each memory controller has two Scalable Memory Interconnect generation 2 (SMI2) links that are connected to two scalable memory buffers. Each memory buffer has two memory channels, and each channel supports three DIMMs, for a total of 24 DIMMs per processor. The use of memory buffers doubles the number of memory channels and DIMMs supported per processor.

The E7 memory architecture is shown in Figure 13 on page 11.

Slot 0 Slot 1 Slot 2 Ch. 0 Ch. 0 Slot 2 Slot 1 Slot 0 Memory Memory Buffer 0 Buffer 2 Slot 0 Slot 1 Slot 2 Ch. 1 Ch. 1 Slot 2 Slot 1 Slot 0

Slot 0 Slot 1 Slot 2 Ch. 0 Memory Memory Ch. 0 Slot 2 Slot 1 Slot 0 Controller 0 Controller 1 Memory Memory Buffer 1 Buffer 3

Slot 0 Slot 1 Slot 2 Ch. 1 Ch. 1 Slot 2 Slot 1 Slot 0 Xeon E7 processor

Figure 13 Intel Xeon E7 processor memory DIMM layout

Memory configurations will be designated as A:B:C:D:E:F:G:H where each letter indicates the number of DIMMs populated on each memory channel.  A refers to Memory Channel 0 on Memory Buffer 0 on Memory Controller 0  B refers to Memory Channel 1 on Memory Buffer 0 on Memory Controller 0  C refers to Memory Channel 0 on Memory Buffer 1 on Memory Controller 0  D refers to Memory Channel 1 on Memory Buffer 1 on Memory Controller 0  E refers to Memory Channel 0 on Memory Buffer 2 on Memory Controller 1  F refers to Memory Channel 1 on Memory Buffer 2 on Memory Controller 1  G refers to Memory Channel 0 on Memory Buffer 3 on Memory Controller 1  H refers to Memory Channel 1 on Memory Buffer 3 on Memory Controller 1

This designation is shown in Figure 14 on page 11.

Channel 0 Channel 0 A E Memory Memory Buffer 0 Buffer 2 B Channel 1 Channel 1 F Channel 0 Channel 0 Memory Memory

C Controller 0 Controller 1 G Memory Memory Buffer 1 Buffer 3 Channel 1 Channel 1 D Xeon E7 H processor

Figure 14 Channel designation for E7 processors

11 The memory buffers provide another level that can be interleaved. Balanced memory guideline 2 now equally applies to memory buffers as it does to memory controllers in that all memory buffers on a processor socket should have the same configuration of DIMMs to provide for efficient interleaving across them.

Configuration of 8 DIMMs - balanced

The first memory configuration we will look at is a balanced 1:1:1:1:1:1:1:1 memory configuration with one DIMM on each memory channel as shown in Figure 15. This memory configuration follows all the balanced memory guidelines and can be interleaved efficiently across the memory controllers, memory buffers, and memory channels. It provides the maximum relative memory bandwidth of 100.

16 GB Empty Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM slot slot slot slot 2R DIMM MB0 MB2 16 GB Empty Empty Empty Empty 16 GB 2R DIMM slot slot Ch. 1 Ch. 1 slot slot 2R DIMM

16 GB Empty Empty Ch. 0 Memory Memory Ch. 0 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM MB1 MB3 16 GB Empty Empty Empty Empty 16 GB Ch. 1 Ch. 1 2R DIMM slot slot Xeon E7 slot slot 2R DIMM processor

Figure 15 1:1:1:1:1:1:1:1 memory configuration (STREAM Triad relative memory bandwidth = 100)

Configuration of 9 DIMMs - unbalanced

Adding one additional memory DIMM produces the unbalanced 2:1:1:1:1:1:1:1 memory configuration shown in Figure 16 on page 12.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM slot slot slot 2R DIMM MB0 MB2 16 GB Empty Empty Empty Empty 16 GB 2R DIMM slot slot Ch. 1 Ch. 1 slot slot 2R DIMM

16 GB Empty Empty Ch. 0 Memory Memory Ch. 0 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM MB1 MB3 16 GB Empty Empty Empty Empty 16 GB Ch. 1 Ch. 1 2R DIMM slot slot Xeon E7 slot slot 2R DIMM processor

Figure 16 2:1:1:1:1:1:1:1 memory configuration (STREAM Triad relative memory bandwidth = 16)

This configuration does not follow balanced memory guideline 1 with one memory channel having twice the memory capacity and ranks of all the others. It also does not follow balanced memory guideline 2 with different memory DIMM configurations on the memory controllers, but it is balanced across the memory buffers on Memory Controller 1.

Efficient interleaving occurs between Memory Buffer 2 and Memory Buffer 3 as they both have the same configurations of DIMMs attached to them.

12 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration It is difficult to create balances with an odd number of DIMMs. The result is a very poor relative memory bandwidth of 16.

Configuration of 10 DIMMs - unbalanced

Adding one more DIMM produces the unbalanced 2:1:1:1:2:1:1:1 memory configuration shown in Figure 17.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot slot 2R DIMM 2R DIMM MB0 MB2 16 GB Empty Empty Empty Empty 16 GB 2R DIMM slot slot Ch. 1 Ch. 1 slot slot 2R DIMM

16 GB Empty Empty Ch. 0 Memory Memory Ch. 0 Empty Empty 16 GB 2R DIMM slot slot Controller 0 Controller 1 slot slot 2R DIMM MB1 MB3 16 GB Empty Empty Empty Empty 16 GB Ch. 1 Ch. 1 2R DIMM slot slot Xeon E7 slot slot 2R DIMM processor

Figure 17 2:1:1:1:2:1:1:1 memory configuration (STREAM Triad relative memory bandwidth = 48)

This configuration does follow balanced memory guideline 2 with both memory controllers having the same DIMM configurations allowing interleaving across the memory controllers. This balanced memory guideline is not followed for the memory buffers. Balanced memory guideline 1 is not followed as two memory channels have twice the memory capacity and ranks as the other six. Memory Buffers 1 and 3 do follow balanced memory guideline 1 allowing efficient interleaving to be done across their memory channels. The result is a relative memory bandwidth of 48.

Configuration of 12 DIMMs - unbalanced

Adding two more DIMMs produces the unbalanced 2:1:2:1:2:1:2:1 memory configuration shown in Figure 18 on page 13.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot slot 2R DIMM 2R DIMM MB0 MB2 16 GB Empty Empty Empty Empty 16 GB 2R DIMM slot slot Ch. 1 Ch. 1 slot slot 2R DIMM

16 GB 16 GB Empty Ch. 0 Memory Memory Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot Controller 0 Controller 1 slot 2R DIMM 2R DIMM MB1 MB3 16 GB Empty Empty Empty Empty 16 GB Ch. 1 Ch. 1 2R DIMM slot slot Xeon E7 slot slot 2R DIMM processor

Figure 18 2:1:2:1:2:1:2:1 memory configuration (STREAM Triad relative memory bandwidth = 64)

Balanced memory guideline 2 is followed for both memory controllers and all memory buffers with matching DIMM configurations. This allows for interleaving between all of them. Unfortunately, balanced memory guideline 1 is not followed as half the memory channels have twice the memory capacity and ranks as the other half. The result is less efficient

13 interleaving across the channels of each memory buffer and a relative memory bandwidth of 64.

Configuration of 14 DIMMs - unbalanced

Using 14 DIMMs produces the unbalanced 2:1:2:2:2:1:2:2 memory configuration shown in Figure 19.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot slot 2R DIMM 2R DIMM MB0 MB2 16 GB Empty Empty Empty Empty 16 GB 2R DIMM slot slot Ch. 1 Ch. 1 slot slot 2R DIMM

16 GB 16 GB Empty Ch. 0 Memory Memory Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot Controller 0 Controller 1 slot 2R DIMM 2R DIMM MB1 MB3 16 GB 16 GB Empty Empty 16 GB 16 GB Ch. 1 Ch. 1 2R DIMM 2R DIMM slot Xeon E7 slot 2R DIMM 2R DIMM processor

Figure 19 2:1:2:2:2:1:2:2 memory configuration (STREAM Triad relative memory bandwidth = 33)

This configuration follows balanced memory guideline 2 with both memory controllers having the same DIMM configurations. This allows for efficient interleaving across the memory controllers. The memory buffers do not follow balanced memory guideline 2 as half have three DIMMs and half have four. Balanced memory guideline 1 is not followed as not all memory channels have the same memory capacity and ranks. Pairs of memory buffers do have the same DIMM configurations: Memory Buffers 0 and 2 and Memory Buffers 1 and 3. Interleaving can be done efficiently between these memory buffer pairs. The result is a relative memory bandwidth of 33.

Configuration of 16 DIMMs - balanced

Adding two more DIMMs produces the balanced 2:2:2:2:2:2:2:2 memory configuration shown in Figure 20.

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot slot 2R DIMM 2R DIMM MB0 MB2 16 GB 16 GB Empty Empty 16 GB 16 GB 2R DIMM 2R DIMM slot Ch. 1 Ch. 1 slot 2R DIMM 2R DIMM

16 GB 16 GB Empty Ch. 0 Memory Memory Ch. 0 Empty 16 GB 16 GB 2R DIMM 2R DIMM slot Controller 0 Controller 1 slot 2R DIMM 2R DIMM MB1 MB3 16 GB 16 GB Empty Empty 16 GB 16 GB Ch. 1 Ch. 1 2R DIMM 2R DIMM slot Xeon E7 slot 2R DIMM 2R DIMM processor

Figure 20 2:2:2:2:2:2:2:2 memory configuration (STREAM Triad relative memory bandwidth = 100)

This follows balanced memory guideline 1 as all memory channels have the same memory capacity and total ranks. It also follows balanced memory guideline 2 as both memory

14 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration controllers and all memory buffers have the same DIMM configurations. This allows for efficient interleaving at all levels and provides a relative memory bandwidth of 100.

Maximizing memory bandwidth

In order to maximize the memory bandwidth of a server, the following rules should be followed:  Balance the memory across the processor sockets - all processor sockets on the same physical server should have the same configuration of memory DIMMs  Balance the memory across the memory controllers - all memory controllers on a processor socket should have the same configuration of memory DIMMs  Balance the memory across the memory channels - all memory channels should be populated and have the same total memory capacity and the same total number of ranks

If a server requires a specific memory capacity, sometimes smaller DIMMs or a mix of DIMM sizes are needed to produce a balanced memory configuration.

For example, you may require 96 GB of memory on a server with one Intel Xeon E5 processor. This can be accomplished using six 16 GB DIMMs. Unfortunately there is no way to follow all the rules above with six DIMMs on four memory channels. Options in this situation are:  Use twelve 8 GB DIMMs. These would be arranged as a 3:3:3:3 configuration which does follow all the rules.  Use one 16 GB DIMM and one 8 GB DIMM on each channel.  Have more memory than required to produce a balanced memory configuration. In this case, eight 16 GB can be arranged as a 2:2:2:2 balanced memory configuration.

The important thing is to create a balanced memory configuration following all the above rules to maximize memory bandwidth and overall server performance.

Cluster on Die option impact on memory bandwidth

Cluster on Die is a parameter that is set in UEFI and effectively splits a processor into two parts so that it acts like two processors. The CoD option is only available on Intel Xeon E5-2600 v3 and Intel E5-2600 v4 processors with two memory controllers - Intel Xeon E5 v3 processors with 10 or more cores and E5 v4 processors with 12 or more cores.

Configurations with the CoD option can improve performance in workloads that are highly Non- (NUMA) optimized. Each half of the processor is treated as a separate NUMA node with ownership of half of the processor cores, a memory controller, and locally attached memory. Such a configuration can improve performance for workloads whose data is found almost entirely within the cluster it is running on.

Figure 21 on page 16 shows an Intel E5 processor with the CoD option producing two clusters.

15 Cluster 0 Cluster 1

16 GB 16 GB Empty Ch. 0 Ch. 0 Empty Empty 16 GB 2R DIMM 2R DIMM slot slot slot 2R DIMM

16 GB 16 GB Empty Ch. 1 Memory Memory Ch. 1 Empty Empty 16 GB 2R DIMM 2R DIMM slot Controller 0 Controller 1 slot slot 2R DIMM Xeon E5 processor

Figure 21 2:2:1:1 memory configuration with Cluster on Die making two clusters (STREAM = 100)

The CoD option can reduce the impact of not following balanced memory guideline 2. More memory capacity can be attached to one memory controller than the other and still produce maximum memory bandwidth as long as memory is balanced across the memory channels of each memory controller.

Without the CoD option, balanced memory guideline 2 is not followed in the 2:2:1:1 configuration making it an unbalanced memory configuration with reduced memory bandwidth. With the CoD option, the 2:2:1:1 configuration provides maximum memory bandwidth. Even so, it is recommended that all the balanced memory guidelines be followed even with configurations with the CoD option.

Summary

Overall server performance is affected by the memory subsystem which can provide both high memory bandwidth and low memory access latency when properly configured. Balancing memory across the memory controllers and the memory channels produces memory configurations which can efficiently interleave memory references among its DIMMs producing the highest possible memory bandwidth. An unbalanced memory configuration can reduce the total memory bandwidth to as low as 16% that of a balanced memory configuration.

Implementing all three of the balanced memory guidelines described in this paper results in balanced memory configurations producing the best possible memory bandwidth and overall performance.

Change history

October 2017:  Correction that one of the 6-DIMM configurations is balanced

April 2016  Initial publication

16 Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration Authors

This paper was produced by the following team of specialists:

Dan Colglazier is a Senior Engineer in the Lenovo Data Center Performance Group in Morrisville, NC. He had previously spent over 30 years at IBM, where he started out in the IBM Data Processing Division performance team helping improve the designs of future mainframe storage hierarchies. Assessing the performance of future System x® servers in both IBM and Lenovo has been his main focus. He is currently the leader of the Performance Design Guidance team. Dan holds a Bachelor of Science degree in Computer Engineering and a Master of Science degree in Electrical Engineering both from the University of Illinois.

Joe Jakubowski is a Senior Technical Staff Member in the Lenovo Server Performance Laboratory in Morrisville, NC. Previously, he spent 30 years at IBM. He started his career in the IBM Networking Hardware Division test organization and worked on various Token Ring adapters, and test tool development projects. He has spent the past 21 years in the server performance organization focusing primarily on database, virtualization and new technology performance. His current role includes all aspects of server architecture and performance. Joe holds Bachelor of Science degrees in Electrical Engineering and Engineering Operations from North Carolina State University and a Master of Science degree in Telecommunications from Pace University.

Tristian "Truth" Brown is a Hardware Performance Engineer on the Lenovo Server Performance Team in Raleigh, NC. He is responsible for the hardware analysis of high-performance, flash-based storage solutions for System x servers. Truth earned a bachelor's degree in Electrical Engineer from Tennessee State University and a master's degree in Electrical Engineering from North Carolina State University. His focus areas were in and System-on-Chip (SoC) design and validation.

Thanks to the following people for their contributions to this project:  Charles Stephan, Lenovo Server Performance Team  David Watts, Lenovo Press

17 Notices

Lenovo may not offer the products, services, or features discussed in this document in all countries. Consult your local Lenovo representative for information on the products and services currently available in your area. Any reference to a Lenovo product, program, or service is not intended to state or imply that only that Lenovo product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any Lenovo intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any other product, program, or service.

Lenovo may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

Lenovo (United States), Inc. 1009 Think Place - Building One Morrisville, NC 27560 U.S.A. Attention: Lenovo Director of Licensing

LENOVO PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. Lenovo may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

The products described in this document are not intended for use in implantation or other life support applications where malfunction may result in injury or death to persons. The information contained in this document does not affect or change Lenovo product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of Lenovo or third parties. All information contained in this document was obtained in specific environments and is presented as an illustration. The result obtained in other operating environments may vary.

Lenovo may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this Lenovo product, and use of those Web sites is at your own risk.

Any performance data contained herein was determined in a controlled environment. Therefore, the result obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

© Copyright Lenovo 2017. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by Global Services Administration (GSA) ADP Schedule Contract 18 This document was created or updated on October 4, 2017.

Send us your comments via the Rate & Provide Feedback form found at http://lenovopress.com/lp0501

Trademarks

Lenovo, the Lenovo logo, and For Those Who Do are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. These and other Lenovo trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by Lenovo at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of Lenovo trademarks is available on the Web at http://www.lenovo.com/legal/copytrade.html.

The following terms are trademarks of Lenovo in the United States, other countries, or both:

Lenovo® Lenovo(logo)® System x®

The following terms are trademarks of other companies:

Intel, Xeon, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

19