LHCb-PUB-2010-017 04/11/2010 rae:Spebr6 2010 6, September 2010 4, November LHCb-PUB-2010-017 1 0 by: Prepared modified: Last Created: Reference: Revision: Issue: efficiency run-time for code HLT Optimizing ulcNote Public xlThuresson Axel b a Lund,Sweden EN PH CERN, a ioNeufeld Niko , b

Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010

Abstract

An upgrade of the High level trigger (HLT) farm at LHCb will be inevitable due to the increase in luminosity at the LHC. The upgrade will be done in two main ways. The first way is to make the software more efficient and faster. The second way is to increase the number of servers in the farm. This paper will concern both of these two ways divided into three parts. The first part is about NUMA, modern servers are all built with NUMA so an upgrade of the HLT farm will consist of this new architecture. The present HLT farm servers consists of the architecture UMA. After several tests it turned out that the Intel-servers that was used for testing (having 2 nodes) had very little penalty when comparing the worst-case the optimal-case. The conclusions for Intel-servers are that the NUMA architecture doesn’t affect the existing software negative. Several tests was done on an AMD-server having 8 nodes, and hence a more complicated structure. Non-optimal effects could be observed for this server and when comparing the worst-case with the optimal-case a big difference was found. So for the AMD-server the NUMA architecture can affect the existing software negative under certain circumstances. In the second part of the paper a program was made in order to help programmers find bottlenecks, optimize code and still maintain correctness. In the third part, the program from part two was used to optimize a bottleneck in the HLT software. This optimization gained around 1-10

Document Status Sheet

1. Document Title: Optimizing HLT code for run-time efficiency 2. Document Reference Number: LHCb-PUB-2010-017 3. Issue 4. Revision 5. Date 6. Reason for change Draft 1 September 6, 2010 First version. Pasting from NUMA Battle-reports.

Contents

1 Introduction ...... 4

2 NUMA ...... 4 2.1 Introduction ...... 4 2.2 file-system definitions ...... 5 2.3 /sys/devices/system ...... 6 2.4 How to control the memory and the cpus ...... 7 2.4.1 NUMA policy library ...... 8 2.4.2 Affinity ...... 8 2.5 Monitoring and the computer scientists uncertainty principle ...... 9 2.5.1 Numastat ...... 9 2.5.2 /proc/#/numa maps ...... 9 2.5.3 The computer scientists uncertainty principle ...... 10 2.6 Test-cases ...... 10 2.6.1 Intel-case ...... 10 2.6.2 AMD-case ...... 12 2.6.3 Reading from RAW-file on disk ...... 13 2.6.4 Reading from shared memory ...... 13

page 1 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010

2.7 Results ...... 14 2.7.1 INTEL: Worst case vs Optimal case: 8 virtual cpus, input from RAW-file on disk 14 2.7.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk ...... 15 2.7.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager ...... 15 2.7.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw-file ...... 16 2.7.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager ...... 17 2.7.6 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer . . . . 18 2.7.7 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer ...... 19 2.7.8 AMD: full run, 48 virtual cpus, input from file ...... 20 2.7.9 Shared libraries ...... 20 2.7.10 Intel: Shared libraries ...... 21 2.7.11 AMD: Shared libraries ...... 22 2.8 Conclusions Intel ...... 24 2.8.1 INTEL: Worst case vs Optimal case on 8 virtual cpus ...... 24 2.8.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk ...... 25 2.8.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager ...... 25 2.8.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw-file ...... 25 2.8.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager ...... 26 2.8.6 Summary ...... 26 2.9 Conclusions AMD ...... 26 2.9.1 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer . . . . 26 2.9.2 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer ...... 26 2.9.3 Summary AMD ...... 26 2.10 Shared libraries ...... 27

3 Profiling, correctness and evaluating the speed of the HLT ...... 27 3.1 Introduction ...... 27 3.2 Three compulsory steps ...... 27 3.2.1 Step 1: Profiling and optimizing ...... 27 3.2.2 Step 2: Correctness ...... 28 3.2.3 Step 3: Evaluating the speed ...... 28

4 Optimizing the code of HLT ...... 29 4.1 Introduction ...... 29 4.2 Finding bottlenecks ...... 29 4.3 Optimizing in general ...... 29 4.3.1 Choose data structure ...... 29 4.3.2 Choose algorithm ...... 29 page 2 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010

4.3.3 Speed as a trade-off for size ...... 30 4.4 Optimizing PatFwdTool ...... 30 4.4.1 Profiling ...... 30 4.4.2 New data structure: Pre-sorted lists ...... 31 4.4.3 Math ...... 33 4.4.4 Avoiding divisions ...... 33 4.5 Conclusions ...... 34

5 References ...... 34

List of Figures

1 The UMA (Uniform Memory Access) architecture [12]...... 5 2 The NUMA (Non-Uniform Memory Access) architecture [12]...... 5 3 Illustration over cache layout on a mono-processor (UMA) [2]...... 6 4 Illustration over cache layout on a multi-processor (UMA) [2]...... 7 5 Profiling picture with GoogleProfiler, visualized with kcachegrind and compiled with ”-O2 -g”...... 30 6 Fast deletion when the order of the elements doesnt matter...... 32

List of Tables

1 Information from ”blade-test-01” at location /sys/devices/system/node#/cpu#/. . . . . 12 2 Information from ”blade-test-10” at location /sys/devices/system/node0/cpu#/. . . . . 13 3 Worst case vs Optimal case: 8 virtual cpus ...... 14 4 NUMASTAT 8 virtual cpus , memory on node1 ...... 14 5 NUMASTAT 8 virtual cpus , memory on both nodes ...... 14 6 Interleaved: 8 virtual cpus ...... 15 7 NUMASTAT 8 virtual cpus , interleaved ...... 15 8 Worst case vs Optimal case: 6 virtual cpus, buffer manager ...... 15 9 NUMASTAT 6 virtual cpus , memory on node1 ...... 16 10 NUMASTAT 6 virtual cpus , memory on both nodes ...... 16 11 Worst case vs Optimal case: 6 virtual cpus, raw-file ...... 16 12 NUMASTAT 6 virtual cpus , memory on node1 ...... 16 13 NUMASTAT 6 virtual cpus , memory on both nodes ...... 17 14 Worst case vs Optimal case: 10 virtual cpus, buffer manager ...... 17 15 NUMASTAT 10 virtual cpus , memory on node1 ...... 17 16 NUMASTAT 10 virtual cpus , memory on both nodes ...... 18 17 Worst case vs Optimal case: 8 virtual cpus, buffer ...... 18 18 NUMASTAT 8 virtual cpus , memory on node6,7 ...... 18 19 NUMASTAT 8 virtual cpus , memory on all nodes ...... 19 20 Five nodes available, 8 virtual cpus, buffer ...... 19

page 3 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

21 NUMASTAT 8 virtual cpus , memory on node0,2,5,6,7 ...... 19 22 NUMASTAT 8 virtual cpus , memory on node0,4,5,6,7 ...... 19 23 NUMASTAT 8 virtual cpus , memory on node3,4,5,6,7 ...... 20 24 NUMASTAT 48 virtual cpus , memory on all nodes ...... 20 25 Intel: Optimal case vs Worst case ...... 21 26 Numa maps summation: No shared lib ...... 21 27 Numa maps summation: One shared lib to all nodes ...... 22 28 Intel: Optimal case vs Worst case ...... 22 29 Numa maps summation: No shared lib ...... 23 30 Numa maps summation: One shared lib to all nodes ...... 24 31 Invoking of numa miss by filling a node ...... 24 32 NUMASTAT 8 virtual cpus, both nodes available, but no memory left on node 0 . . 25

1 Introduction

An upgrade of the High level trigger (HLT) farm at LHCb will be inevitable due to the increase in lumi- nosity at the LHC. The upgrade will be done in two main ways. The first way is to make the software more efficient and faster. The second way is to increase the number of servers in the farm. This paper will be about both the HLT software and an investigation about new techniques that modern servers is built on. This is divided into three parts all of which the goal is to optimize the HLT. The first part is about NUMA effects. Modern servers today are all built with NUMA (Non-Uniform Memory Access). The upgrade will only consist of servers with NUMA in contrast to the present farm that consists of servers with UMA. The HLT software is built and tested on UMA servers so there might be a non-optimal situation when transitioning into NUMA. The second part is about three compulsory steps that a programmer must fulfil in order to optimize the code of the HLT software. A program has been created in order to try to unite these three steps into one and same program. This will be a tool that will make it faster for programmers to create op- timized code. The third and last part will be about optimizing the HLT software itself. The three steps where used to obtain 1-15% better performance (depending on the indata to the software). Tips and tricks that can be used in order to make code run faster is explained.

2 NUMA

2.1 Introduction

NUMA stands for ”Non-Uniform Memory Access”, it is a shared memory architecture that has cpus attached to nodes. The memory access time (distance) from the local node is closer than accessing memory on a foreign node. NUMA can be compared with the UMA that has equal distance to memory for all processors [12]. UMA is illustrated in figure 1. NUMA is illustrated in figure 2. What can be seen in figure 2 is that 4 processors belongs to one node. It can be seen by this simple illustration that accessing memory from a foreign node is more distant than accessing memory of the local node. The mechanism that is transferring memory from a foreign node to a physical core is called the Interconnector. Today (2010-Sep) the HLT-farm at LHCB consists of approximately 500 servers that are all built on the UMA architecture. But modern servers is now using NUMA instead of UMA. So when the HLT-farm is going to be upgraded with new servers it is for certain going to be with the NUMA architecture. This is the main motivation to investigate this new architecture with the HLT-software. page 4 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Figure 1 The UMA (Uniform Memory Access) architecture [12].

Figure 2 The NUMA (Non-Uniform Memory Access) architecture [12].

2.2 Linux file-system definitions

A very confusing part when discussing NUMA-effect with my supervisor and other people that helped me is that our definitions of the words ”core”, ”cpu”, ”processor”, ”chip”, ”memory” etc. doesn’t coincide. One of the reasons for this is because the Linux file-system defines cpu as a virtual cpu (/proc/cpuinfo, /sys/devices/system/cpu/cpu# and /sys/devices/system/node#/cpu#/). This paper should not give rise to the same confusion. Therefore a couple of definitions will be written down here so that everyone can understand what is said. To see all the details of how the information is presented in the Linux file-system jump directly to subsection 2.3. Definition A physical cpu (physical processor or multi-core processor) is a processing system that consists of a number of independent physical cores. Definition A physical core (individual processor) is a device that performes the instructions of a pro- gram. Definition A virtual cpu (virtual processor) is the same thing as a physical core, with exception to HyperThreading where of one physical core can give rise to several virtual cpus. Definition A node consists of a set of virtual cpus together with many different memories. What is characteristic about a node is that it has one big memory. Every virtual cpu in the node has the same distance to this memory. In the Linux file-system at locations /sys/devices/system/cpu/cpu# and /sys/devices/system/node#/cpu#/ the term cpu# is a virtual cpu labelled with an id-number. To show some examples how to separate these definitions once you have a linux console in front of you: The number of physical cpus can be obtained by

grep ’physical id’ /proc/cpuinfo | sort | uniq | wc -l

The number of virtual cpus can be obtained by

grep ˆprocessor /proc/cpuinfo | wc -l

The number of physical cores per physical cpu can be obtained by

grep ’cpu cores’ /proc/cpuinfo

page 5 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Since this paper only deals with multi-core processors there will be no need to define what a mono- core processor is. But of course it is clear that if the number of physical cores per physical cpu is one, then it is a mono-core processor [7].

2.3 /sys/devices/system

There are two folders of interest inside the folder /sys/devices/system. The first folder is called cpu. Inside this folder we find a list of all the virtual cpus (see definition in subsection 2.2). There are a lot of files inside these folders, all files will not be listed here only the ones that is used in this paper.

cpu# cache/index# level shared cpu map size type ways of associativity topology core id core siblings physical package id thread siblings

cache/index#¯ gives information about the cache layout. In general a modern computer has around four caches. Two level 1 caches, one for instructions and one for data, the size of these memories are in general small and hence fast. Then there usually is a level 2 cache that is larger than the level 1 caches and finally before the big RAM-memory there might be a level 3 cache as well. The situation is sim- plified but well illustrated in the figures 3 and 4.

Figure 3 Illustration over cache layout on a mono-processor (UMA) [2].

The level in folder cache/index#¯ says how close the cache is to the physical core. A general rule is that the size is smaller with smaller level since a smaller size means faster memory. It’s obvious that the fastest memory should be as close to the physical core as possible. The size is simply the size of the cache. The type is either ”data”, ”instruction” or ”unified” depending if the cache is supposed to hold data or the assembly instructions or both. The ways of associativity contains a number. Associativity is a trade-off (like most things in life). The trade-off is between power, chip area and potentially time against misses [13]. The last file that is brought up here is shared cpu map this tells us how the memory page 6 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Figure 4 Illustration over cache layout on a multi-processor (UMA) [2].

is shared amongst the virtual cpus. As an example in figure 4 we can see that there are two physi- cal processors (light grey boxes). In each physical processor we have two physical cores (dark grey boxes), the two physical cores share level 2 and level 3 caches. In each core we have two threads that share the level 1 caches. The core id in folder topology represent the hardware platform’s identifier rather than the kernel’s identifier. The core siblings contains the kernel map of cpu#’s hardware threads within the same physi- cal package id. physical package id contains the physical socket number and thread siblings is the internal map of cpu#’s hardware thread with the same physical core [9].

Next folder of interest is called node, it has the following structure.

node# cpu# distance meminfo numastat

Every folder of cpu# inside node# says that these virtual cpu’s are sharing the big RAM. That is, all these virtual cpu’s has the same distance to the nodes memory. In contrary to all other existing nodes that are further away from these cpu’s with respect to distance and hence access-time. The folder cpu# contains all information as explained in the beginning of this subsection. The distance contains a vec- tor of numbers (the size of the vector is the same as the number of nodes). The first number gives the distance for cpu’s inside the node to the big RAM. The other numbers correspond to the distance for cpu’s inside the node to the other nodes big RAM. This number should not be taken too seriously but as a hand-waving number. For example if we have two nodes and the output from distance is: 10 20 This doesn’t mean that it takes the double amount of time to access a foreign nodes memory. It only means that it will take longer time to access the foreign memory than the local memory [3]. meminfo contains all information about the big RAM, how big it is, how much is free etc. numastat is an inter- esting file since it returns statistics about hits and misses with respect the Numa policy (we will see what this is later in subsection 2.5.1) of allocating memory on the nodes.

2.4 How to control the memory and the cpus

There are a number of functions to control the cpus and controlling the NUMA policy. All these func- tions can be used in C so if one wants to use these controlling function in for example Python a wrapper must be created.

page 7 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

2.4.1 NUMA policy library

NUMA policy library is a simple programming interface that allows the programmer to change the NUMA policies. There are 4 different policies to choose from: [1]

• Default policy - Allocate on the local node, when node is out of memory go to neighbouring node.

• Bind policy - Allocate on a specific set of nodes, when node is out of memory allocation fails.

• Preferred policy - Allocate on a specific set of nodes, when node is out of memory allocate on neighbouring node.

• Interleave policy - Interleave memory allocation on a set of nodes. That is spread out the memory evenly among the set of nodes.

By default the default policy is used (obviously). So this means that all the virtual cpus will always allocate memory on the RAM closest to them. When this memory is full it looks for the available mem- ory on the closest RAM that is not full etc. Interleaved memory is an interesting option for software that shares a big memory segment. This option was tested for the HLT (see results). The following definitions comes from [5]. The first thing to do when using these functions is to have a nodemask (a set of nodes). The nodemask is initialized by void nodemask zero(nodemask t *mask) and then to add nodes into the set the function void nodemask set(nodemask t *mask, int node) is used. This int node correspond to the number in the linux file-system described in subsection 2.3. Once a nodemask is created there are a big number of functions that can be used and only a subset of useful function will be explained in the following. void numa set membind(nodemask t *nodemask) sets the memory allocation mask. The thread will only allocate memory from the nodes set in node- mask according to the bind policy. Passing an argument of numa no nodes or numa all nodes turns off memory binding to specific nodes. numa bind(nodemask t *nodemask) binds the current thread and its children to the nodes specified in nodemask. They will only run on the cpus of the specified nodes and only be able to allocate memory from them. void numa set interleave mask(nodemask t *nodemask) sets the memory interleave mask for the current thread to nodemask. All new memory allocations are page interleaved over all nodes in the interleave mask.

2.4.2 Affinity

In the last subsection there where function to bind threads to nodes and bind all the virtual cpus in one node to that node. But sometimes it’s important to bind a specific thread or process to a set of virtual cpus. This can be done by sched setaffinity. First we need a cpu set, this can be initialized by void CPU ZERO(cpu set t *set) and then to add cpus into the set the function void CPU SET(int cpu, cpu set t *set) can be used. This int cpu correspond to the number in the linux file-system described in subsection 2.3. When we have a cpu set the binding is created by int sched setaffinity(pid t pid, unsigned int cpusetsize, cpu set t *mask) A process’s cpu affinity mask determines the set of cpus on which it is eligible to run [11]. page 8 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

2.5 Monitoring and the computer scientists uncertainty principle

To monitor that the controlling functions above works properly some monitoring functions exists in the linux kernel. The first obvious program to use is top. When a process is bounded to a certain virtual cpu it can be seen in the list of virtual cpus in top. In that list it’s easy to see what cpu that has the biggest load and thus where the process is running.

2.5.1 Numastat

Numastat was mentioned in subsection 2.3. Here is an explanation of what the fields in numastat means [8].

• numa hit - A process wanted to allocate memory from this node, and succeeded. • numa miss - A process wanted to allocate memory from another node, but ended up with mem- ory from this node. • numa foreign - A process wanted to allocate on this node, but ended up with memory from another one. • local node - A process ran on this node and got memory from it. • other node - A process ran on this node and got memory from another node. • interleave hit - Interleaving wanted to allocate from this node and succeeded.

There is also a command named numastat. This command simply puts all numastat files from all nodes in a list. One way to use numastat in order to monitor the computer processes is simply by saving the output of numastat once before the processes start. After all processes has finished the output of numastat will be saved once again. Then take the difference between the end output from the output before the processes start to get the total statistics of what has occurred during the run.

2.5.2 /proc/#/numa maps

Numa maps is a very interesting file. An example of output from numa maps when the HLT software is running is shown below (the whole file is not shown, this is too much information):

bind=1 file=/sw/lib/lcg/external/Python/2.5.4p2/ 64-slc5-gcc43-opt/bin/python mapped=231 mapmax=3 N0=120 N1=111 bind=1 file=/sw/lib/lcg/external/Python/2.5.4p2/x86 64-slc5-gcc43-opt/bin/python anon=42 dirty=42 mapped=44 mapmax=3 N1=44 bind=1 anon=8 dirty=8 N1=8 bind=1 heap anon=74405 dirty=74405 N1=74405 bind=1 anon=2 dirty=2 N1=2 bind=1 anon=10 dirty=10 N1=10 bind=1 file=/sw/lib/lhcb/HLT/HLT v10r4/Hlt/HltTracking/x86 64-slc5-gcc43-opt/libHltTracking.so mapped=170 mapmax=2 N1=170 bind=1 file=/sw/lib/lhcb/HLT/HLT v10r4/Hlt/HltTracking/x86 64-slc5-gcc43-opt/libHltTracking.so anon=50 dirty=50 N1=50 bind=1 anon=4 dirty=4 N1=4 bind=1 anon=10 dirty=10 N1=10 bind=1 file=/sw/lib/lhcb/HLT/HLT v10r4/Hlt/HltL0Conf/x86 64-slc5-gcc43-opt/libHltL0Conf.so mapped=119 mapmax=2 N1=119 bind=1 file=/sw/lib/lhcb/HLT/HLT v10r4/Hlt/HltL0Conf/x86 64-slc5-gcc43-opt/libHltL0Conf.so anon=31 dirty=31 N1=31 bind=1 anon=3 dirty=3 N1=3 bind=1 file=/sw/lib/lhcb/LHCB/LHCB v31r1/Tf/TsaKernel/x86 64-slc5-gcc43-opt/libTsaKernel.so mapped=10 mapmax=2 N1=10

page 9 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

bind=1 file=/sw/lib/lhcb/LHCB/LHCB v31r1/Tf/TsaKernel/x86 64-slc5-gcc43-opt/libTsaKernel.so anon=2 dirty=2 N1=2

The first column contains the word bind. This means that the policy for this program is changed from default to bind policy by the programmer. In some rows we find the word file= this means that it is a library that has been inserted into the RAM. The word mapmax is the number of processes mapping a single page that was encountered during scan. In other words everywhere mapmax is greater than one, this memory is shared by several pro- cesses. The word N# has two meanings. The first thing is on what node the memory is allocated, for example N0 means that is exists on node 0. The number that is after N#= has the second meaning. This is the number of pages (a page is usually 4K) that is loaded into the memory, in other words the size. So as an example N1=2 means that 2*4K of memory is allocated on node 1 [10]. It’s clear that the whole library file has not always been loaded into memory. Otherwise it would be pointless the give the number of pages. If one wants to know what lines that is taken from the library there is a column that has been removed from the example above. This column gives an address that connects the file numa maps with the file maps. This paper will not go into details of the file maps.

2.5.3 The computer scientists uncertainty principle

There is an uncertainty principle in computer science. This principle says that if you use a monitoring program at the same time a you run your test-program you will not get an accurate elapsed time for the run. This can occur for example when the program top is running and also when the kernel is writing to the numa maps file at the same time as the test-program is running. One has to be aware of this uncertainty because the elapsed time for a program is a vital part in seeing if a speed optimization worked or not. The way to fix this uncertainty relation is similar to how one fixes the Heisenberg uncertainty in physics. First start the program with full monitoring activated. If all the settings was as expected close the program. Then start the program again without monitoring and take the time. There is also a natural variance when running the HLT software, that is when running the exact same program twice the time it took is not exactly the same. The way to solve this is to run exactly the same program iteratively, take away the first run and then calculate the variance between all the other runs. This way the variance is known and temporary conditions in the computer will not destroy the measurement.

2.6 Test-cases

2.6.1 Intel-case

There where five different Intel servers available for testing. In the following one of these computers, the so called ”blade-test-01”, specification from the point of view of the linux file-system according to subsection 2.3 is shown.

Virtual cpu Cache Type Level Size Shared cpu map cpu0 index0 Data 1 32K 00000101 index1 Instruction 1 32K 00000101 index2 Unified 2 256K 00000101 index3 Unified 3 8195K 00005555 cpu2 index0 Data 1 32K 00000404 index1 Instruction 1 32K 00000404 index2 Unified 2 256K 00000404 index3 Unified 3 8195K 00005555 cpu4 index0 Data 1 32K 00001010 page 10 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

index1 Instruction 1 32K 00001010 index2 Unified 2 256K 00001010 index3 Unified 3 8195K 00005555 cpu6 index0 Data 1 32K 00004040 index1 Instruction 1 32K 00004040 index2 Unified 2 256K 00004040 index3 Unified 3 8195K 00005555 cpu8 index0 Data 1 32K 00000101 index1 Instruction 1 32K 00000101 index2 Unified 2 256K 00000101 index3 Unified 3 8195K 00005555 cpu10 index0 Data 1 32K 00000404 index1 Instruction 1 32K 00000404 index2 Unified 2 256K 00000404 index3 Unified 3 8195K 00005555 cpu12 index0 Data 1 32K 00001010 index1 Instruction 1 32K 00001010 index2 Unified 2 256K 00001010 index3 Unified 3 8195K 00005555 cpu14 index0 Data 1 32K 00004040 index1 Instruction 1 32K 00004040 index2 Unified 2 256K 00004040 index3 Unified 3 8195K 00005555 cpu1 index0 Data 1 32K 00000202 index1 Instruction 1 32K 00000202 index2 Unified 2 256K 00000202 index3 Unified 3 8195K 0000aaaa cpu3 index0 Data 1 32K 00000808 index1 Instruction 1 32K 00000808 index2 Unified 2 256K 00000808 index3 Unified 3 8195K 0000aaaa cpu5 index0 Data 1 32K 00002020 index1 Instruction 1 32K 00002020 index2 Unified 2 256K 00002020 index3 Unified 3 8195K 0000aaaa cpu7 index0 Data 1 32K 00008080 index1 Instruction 1 32K 00008080 index2 Unified 2 256K 00008080 index3 Unified 3 8195K 0000aaaa cpu9 index0 Data 1 32K 00000202 index1 Instruction 1 32K 00000202 index2 Unified 2 256K 00000202 index3 Unified 3 8195K 0000aaaa cpu11 index0 Data 1 32K 00000808 index1 Instruction 1 32K 00000808 index2 Unified 2 256K 00000808 index3 Unified 3 8195K 0000aaaa cpu13 index0 Data 1 32K 00002020 index1 Instruction 1 32K 00002020

page 11 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

index2 Unified 2 256K 00002020 index3 Unified 3 8195K 0000aaaa cpu15 index0 Data 1 32K 00008080 index1 Instruction 1 32K 00008080 index2 Unified 2 256K 00008080 index3 Unified 3 8195K 0000aaaa

Table 1: Information from ”blade-test-01” at location /sys/devices/system/node#/cpu#/.

From table 1 it can be seen that there exists 16 virtual cpus. Each virtual cpu has access to four different caches. This situation is illustrated in figure 3. From shared cpu map the sharing of memories can be seen. First thing to notice is that cpu0,8 (cpu0 and cpu8), cpu1,9, cpu2,10, cpu3,11, cpu4,12, cpu5,13, cpu6,14 and cpu7,15 all share the same index0 and index1 cache of size 32K each. This is due to the fact that HyperThreading is enabled. So every physical core has two virtual cpus. They also share level 2 cache (obviously if they share level 1 cache they share level 2 cache as well) of 256K. The second thing to note is that cpu0, cpu2, cpu4, cpu6, cpu8, cpu10, cpu12 and cpu14 all share the same level 3 cache of 8195K. The same goes for the odd numbers of virtual cpus. The computer has two nodes so every set of virtual cpus share level 3 cache also share a big RAM of 12G. The physical processor type is: ”Intel(R) Xeon(R) CPU X5560 @ 2.80GHz”. From section 2.2 (/proc/cpuinfo) it can be seen that there exists 2 physical processors that are both quad- core, resulting in 16 virtual processor with HyperThreading enabled.

2.6.2 AMD-case

The AMD-server is a little more complicated that the Intel-server. The Intel-server has 2 nodes com- pared to the AMD-server that has 8 nodes. There are 4 physical processors with 48 virtual processors. These 48 virtual processors are distributed over 8 nodes of 8G each. The physical processor type is: ”AMD Opteron(tm) Processor 6172”. In the ”NUMA-war” between Intel and AMD, Intel has Hyperthreading. AMD counters this by split- ting each chip into two parts (that is splitting the ram of 16 Gb and the L3 cache of 12 Mb into two parts) [6]. It’s structured in the following way: Node0: contains cpu0,4,8,12,16,20 Node1: contains cpu24,28,32,36,40,44 Node2: contains cpu1,5,9,13,17,21 Node3: contains cpu25,29,33,37,41,45 Node4: contains cpu2,6,10,14,18,22 Node5: contains cpu26,30,34,38,42,46 Node6: contains cpu3,7,11,15,19,23 Node7: contains cpu27,31,35,39,43,47

Virtual cpu Cache Type Level Size Shared cpu map cpu0 index0 Data 1 64K 00000000,00000001 index1 Instruction 1 64K 00000000,00000001 index2 Unified 2 512K 00000000,00000001 index3 Unified 3 5118K 00000000,00111111 cpu4 index0 Data 1 64K 00000000,00000010

page 12 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

index1 Instruction 1 64K 00000000,00000010 index2 Unified 2 512K 00000000,00000010 index3 Unified 3 5118K 00000000,00111111 cpu8 index0 Data 1 64K 00000000,00000100 index1 Instruction 1 64K 00000000,00000100 index2 Unified 2 512K 00000000,00000100 index3 Unified 3 5118K 00000000,00111111 cpu12 index0 Data 1 64K 00000000,00001000 index1 Instruction 1 64K 00000000,00001000 index2 Unified 2 512K 00000000,00001000 index3 Unified 3 5118K 00000000,00111111 cpu16 index0 Data 1 64K 00000000,00010000 index1 Instruction 1 64K 00000000,00010000 index2 Unified 2 512K 00000000,00010000 index3 Unified 3 5118K 00000000,00111111 cpu20 index0 Data 1 64K 00000000,00100000 index1 Instruction 1 64K 00000000,00100000 index2 Unified 2 512K 00000000,00100000 index3 Unified 3 5118K 00000000,00111111 Table 2: Information from ”blade-test-10” at location /sys/devices/system/node0/cpu#/.

The structure is the same for all nodes. That is, they share level 3 cache of 5118K and have their own level 2 of 512K and level 1 of 64K caches.

2.6.3 Reading from RAW-file on disk

Two different ways of simulating data input to the HLT has been used. One way is simply by reading input from a raw-file on disk. This was done by the option: Moore().inputFiles= [ ”/group/trg/probbe/data/072406 0000000004.raw” ] This raw-file was used for all testing for NUMA.

2.6.4 Reading from shared memory

The second way of simulating data input to the HLT is by simulation of a buffer manager. This is done by putting up a shared memory and then a program (programmed by Jean-Christophe Garnier) feeds data into the HLT. In MOORE the following was used: input = ”TestWriter” mepMgr = OnlineEnv.mepManager(OnlineEnv.PartitionID,OnlineEnv.PartitionName,[input],True) app.Runable = OnlineEnv.evtRunable(mepMgr) app.ExtSvc.append(mepMgr) eventSelector = OnlineEnv.mbmSelector(input=input,decode=False) app.ExtSvc.append(eventSelector) OnlineEnv.evtDataSvc() eventSelector.REQ1 = ”EvType=2;TriggerMask=0xffffffff,0xffffffff,0xffffffff,0xffffffff; VetoMask=0,0,0,0;MaskType=ANY;UserType=ALL;Frequency=PERC;Perc=100.0”

page 13 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

2.7 Results

In this subsection all the important tests will be shown. First there will be a short explanation of how the setup of the test was done, and then the results will be shown.

2.7.1 INTEL: Worst case vs Optimal case: 8 virtual cpus, input from RAW-file on disk

This test is to see if there are any NUMA effects. 8 virtual cpus are used in total (HyperThreading disabled). 4 virtual cpus are from node 0 and 4 virtual cpus are from node 1. There will be two different cases: The worst case is when memory only can be allocated on node 1, this forces the Interconnector to distribute memory to all virtual cpus in node 0. The second case is the optimal case, this is when both nodes are available so the Interconnector will not be used.

8 virtual cpus , memory on node1 8 virtual cpus , memory on both nodes Mean total: 956+- 91 s. Mean total: 883+- 72 s.

Table 3: Worst case vs Optimal case: 8 virtual cpus

NUMASTAT 8 virtual cpus , memory on node1 Type Node 0 Node 1 Numa hit 333723 28154291 Numa miss 0 0 Numa foreign 0 0 Interleave hit 1787 1449 Local node 329537 14124993 Other node 4186 14029298

Table 4: NUMASTAT 8 virtual cpus , memory on node1

NUMASTAT 8 virtual cpus , memory on both nodes Type Node 0 Node 1 Numa hit 14383270 14061513 Numa miss 0 0 Numa foreign 0 0 Interleave hit 2078 809 Local node 14382980 14047638 Other node 290 13875

Table 5: NUMASTAT 8 virtual cpus , memory on both nodes

page 14 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

2.7.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk

In the following the function numa set interleave mask will be turned on. In some cases where pro- grams share memory this will increase performance. The idea with interleaved memory is to spread all memory evenly over the nodes. 8 virtual cpus are used in total (HyperThreading disabled). 4 vir- tual cpus are from node 0 and 4 virtual cpus are from node 1.

8 virtual cpus , interleaved Mean total: 958+- 5 s.

Table 6: Interleaved: 8 virtual cpus

NUMASTAT 8 virtual cpus , interleaved Type Node 0 Node 1 Numa hit 2204872 2207277 Numa miss 0 0 Numa foreign 0 0 Interleave hit 2176163 2174464 Local node 1144930 1081884 Other node 1059942 1125393

Table 7: NUMASTAT 8 virtual cpus , interleaved

2.7.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager

This test is made to see if there are any difference between the buffer manager and reading file from disk. 6 virtual cpus will be used. Then there is one virtual cpu that will be dedicated to run the buffer manager giving the other cpus memory. Node 0 will be filled with useless memory. And in that case the cpus that has node 0 as the local node will have to go to node 1 for allocating memory. The buffer manager has a 150s sleep before starting so this will be subtracted in the conclusions.

6 virtual cpus , memory on node1 6 virtual cpus , memory on both nodes Mean total: 699+- 4 s. Mean total: 689+- 11 s.

Table 8: Worst case vs Optimal case: 6 virtual cpus, buffer manager

NUMASTAT 6 virtual cpus , memory on node1 Type Node 0 Node 1 Numa hit 28633 2970338 Numa miss 0 3502687

page 15 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Numa foreign 3502687 0 Interleave hit 15 837 Local node 28589 2923252 Other node 44 3549773

Table 9: NUMASTAT 6 virtual cpus , memory on node1

NUMASTAT 6 virtual cpus , memory on both nodes Type Node 0 Node 1 Numa hit 3130156 2825703 Numa miss 0 0 Numa foreign 0 0 Interleave hit 752 819 Local node 3124994 2741356 Other node 5162 84347

Table 10: NUMASTAT 6 virtual cpus , memory on both nodes

2.7.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw- file

This test has the same setup as the previous one, except that now the program is reading from a raw- file and the numa bind function is used. This test has a 30s sleep before starting so this will be subtracted in the conclusions.

6 virtual cpus , memory on node1 6 virtual cpus , memory on both nodes Mean total: 960+- 19 s. Mean total: 938+- 23 s.

Table 11: Worst case vs Optimal case: 6 virtual cpus, raw-file

NUMASTAT 6 virtual cpus , memory on node1 Type Node 0 Node 1 Numa hit 5962680 131493 Numa miss 0 0 Numa foreign 0 0 Interleave hit 389 486 Local node 3225534 92053 Other node 2737049 39731

Table 12: NUMASTAT 6 virtual cpus , memory on node1

page 16 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

NUMASTAT 6 virtual cpus , memory on both nodes Type Node 0 Node 1 Numa hit 3217064 2787190 Numa miss 0 0 Numa foreign 0 0 Interleave hit 292 292 Local node 3158864 2758680 Other node 58103 28316

Table 13: NUMASTAT 6 virtual cpus , memory on both nodes

2.7.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager

This test is to see the effect of increasing the number of virtual cpus that is used. In the last test 6 or 8 virtual cpus where used. But this computer (blade-test-05) has in total 12 virtual cpus and 10 of these cpus are used to run MOORE and 1 cpu is used for running the buffer manager. The buffer manager has a 150s sleep before starting so this will be subtracted in the conclusions.

10 virtual cpus , memory on node1 10 virtual cpus , memory on both nodes Mean total: 486+- 2 s. Mean total: 471+- 7 s.

Table 14: Worst case vs Optimal case: 10 virtual cpus, buffer man- ager

NUMASTAT 10 virtual cpus , memory on node1 Type Node 0 Node 1 Numa hit 551705 9720007 Numa miss 0 0 Numa foreign 0 0 Interleave hit 463 425 Local node 539858 4856161 Other node 11847 4863846

Table 15: NUMASTAT 10 virtual cpus , memory on node1

NUMASTAT 10 virtual cpus , memory on both nodes Type Node 0 Node 1 Numa hit 5040697 4578916

page 17 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Numa miss 0 0 Numa foreign 0 0 Interleave hit 369 501 Local node 5022921 4524241 Other node 17776 54675

Table 16: NUMASTAT 10 virtual cpus , memory on both nodes

2.7.6 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer

The worst case for the AMD server is to put one virtual cpu on each node and make one node available for allocation. This will actually not be the worst case in this test since one node has not enough memory to hold 8 different processes of the HLT software. The reason for this is to not make the swap space affect the results. Therefore two nodes will be available to distribute memory to 8 virtual cpus. The optimal case is to have all nodes available for allocation. There will be a 180s sleep every iteration, so to give percentage difference between the runs this must be subtracted in the conclusions. Unfortunately the variance was not saved for this test.

8 virtual cpus, memory on node6 and node7 8 virtual cpus , memory on all nodes Mean total: 1042s. Mean total: 780s.

Table 17: Worst case vs Optimal case: 8 virtual cpus, buffer

NUMASTAT 8 virtual cpus , memory on node6,7 Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 10456 3583 1416 584 719 1616 228501 194820 Numa miss 0 7219 297 2625 363 6750 0 0 Numa foreign 238496 200285 230512 189717 194058 194440 0 0 Interleave hit 8 8 3 3 1 2 54 69 Local node 10389 3523 1405 569 704 1601 228513 192734 Other node 67 7279 308 2640 378 6765 1230369 2086

Table 18: NUMASTAT 8 virtual cpus , memory on node6,7

NUMASTAT 8 virtual cpus , memory on all nodes Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 246316 203590 189200 209123 212944 195191 186551 225853 Numa miss 0 0 0 0 0 0 0 0 Numa foreign 0 0 0 0 0 0 0 0 Interleave hit 48 48 49 46 49 52 54 47 page 18 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Local node 245568 201576 186917 207095 210667 193051 184326 223663 Other node 748 2014 2283 2028 2277 2140 2225 2190

Table 19: NUMASTAT 8 virtual cpus , memory on all nodes

2.7.7 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer

This test is motivated by the result above where only one node give memory to the other cpus when there are two nodes available. In the following five nodes will have memory available. There will be a 180s sleep every iteration, so to give percentage difference between the runs this must be subtracted in the conclusions. Unfortunately the variance was not saved for this test.

8 virtual cpus 8 virtual cpus 8 virtual cpus memory on node0,2,5,6,7 memory on node0,4,5,6,7 memory on node3,4,5,6,7 Mean total: 771s. Mean total: 807s. Mean total: 809s.

Table 20: Five nodes available, 8 virtual cpus, buffer

NUMASTAT 8 virtual cpus , memory on node0,2,5,6,7 Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 185834 39420 233477 17109 13529 194234 233541 253641 Numa miss 0 0 188626 0 547 340506 0 0 Numa foreign 0 188626 0 163140 177913 0 0 0 Interleave hit 21 1 24 7 4 16 27 23 Local node 184863 39392 231942 16897 13371 192898 232097 251963 Other node 971 28 190161 212 705 341842 1444 1678

Table 21: NUMASTAT 8 virtual cpus , memory on node0,2,5,6,7

NUMASTAT 8 virtual cpus , memory on node0,4,5,6,7 Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 188773 38776 8878 9837 189945 192612 243830 247221 Numa miss 0 0 185 37007 549678 0 0 0 Numa foreign 0 178310 223630 184930 0 0 0 0 Interleave hit 22 5 0 3 34 36 31 36 Local node 188014 38757 8866 9774 188670 190126 242338 244869 Other node 759 19 197 37197 550953 2486 1492 2352

Table 22: NUMASTAT 8 virtual cpus , memory on node0,4,5,6,7

page 19 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

NUMASTAT 8 virtual cpus , memory on node3,4,5,6,7 Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 315 38566 2556 230796 196890 202723 187968 287148 Numa miss 0 5335 4319 541434 0 0 0 0 Numa foreign 191345 180069 179674 0 0 0 0 0 Interleave hit 0 3 2 25 30 29 21 26 Local node 315 38445 2549 229565 195858 200340 186437 284359 Other node 0 5456 4326 542665 1032 2383 1531 2789

Table 23: NUMASTAT 8 virtual cpus , memory on node3,4,5,6,7

2.7.8 AMD: full run, 48 virtual cpus, input from file

It’s interesting to see how the kernel behaves when certain nodes are full. But it’s also interesting to see the behaviour when all cpus are used without any constraints. This test only shows numastat since only one run is done therefore the time is not interesting.

NUMASTAT 48 virtual cpus , memory on all nodes Type Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Numa hit 1122709 1544451 1016913 923527 888404 1350215 969061 1101646 Numa miss 0 0 0 0 0 0 0 0 Numa foreign 0 0 0 0 0 0 0 0 Interleave hit 286 280 271 225 294 248 310 226 Local node 1121950 1543447 1015275 921973 886621 1348793 967240 1100052 Other node 759 1004 1683 1554 1783 1422 1821 1594

Table 24: NUMASTAT 48 virtual cpus , memory on all nodes

2.7.9 Shared libraries

In all the above tests the effect of the shared libraries was discarded. The reason for this is because the kernel doesn’t replicate code on node level since this would take to much memory GIVE SOURCE TO THIS FACT. As an example to realize why this is not done think about if every little process one starts in linux had to have around 2mb only for libc. This is clearly not scalable. Nevertheless it is interesting to see if there is any speed performance effect by replicating the HLT source code to all existing nodes on both the Intel and AMD test-servers. In order to analyze the situ- ation a small program that sums up information given by numa maps during runtime will be used. To understand how to interpret the results of numa maps summation program an example will be given:

Total pages Node 0 pages=182206 that is 746315 kb = 746 mb Node 1 pages=216 that is 884 kb = 0 mb Total file pages page 20 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Node 0 pages=30753 that is 125964 kb = 125 mb Node 1 pages=216 that is 884 kb = 0 mb Total heap pages Node 0 pages=126502 that is 518152 kb = 518 mb Node 1 pages=0 that is 0 kb = 0 mb Total anon pages Node 0 pages=24951 that is 102199 kb = 102 mb Node 1 pages=0 that is 0 kb = 0 mb

This example shows when the HLT is running on the Intel server. The logic is given by the simple equation totalpages = filepages + heappages + anonpages. ”File pages” are the shared libraries and ”anon pages” are the anonymous pages. In the example above one can see that the process is with very high probability running on a virtual cpu that has node 0 as the local node. The ”file pages” are around 120mb and this is the size of the shared libraries that the HLT needs for this raw-file. One test for AMD and one test for Intel is shown in the following subsections. Each test is divided into two subtests. The first subtest is the optimal case in the sense that every node on the server has it’s shared libraries with libc excluded. The second subtest is the worst case, this case is when the default policy is on and that is to let all nodes share the libraries with some overlap. Observe that the worst case here is the normal case for the kernel and that the optimal case in the sense of size.

2.7.10 Intel: Shared libraries

No shared lib (each nodes has its own lib) One shared lib to all nodes Mean total: 846+- 4 s. Mean total: 847+- 5 s.

Table 25: Intel: Optimal case vs Worst case

Numa maps summation: No shared lib (each nodes has its own lib) Process of MOORE on node 0: Process of MOORE on node 1: Total pages Total pages Node 0 pages=182206 that is 746315 kb = 746 mb Node 0 pages=621 that is 2543 kb = 2 mb Node 1 pages=216 that is 884 kb = 0 mb Node 1 pages=191359 that is 783806 kb = 783 mb Total file pages Total file pages Node 0 pages=30753 that is 125964 kb = 125 mb Node 0 pages=621 that is 2543 kb = 2 mb Node 1 pages=216 that is 884 kb = 0 mb Node 1 pages=34349 that is 140693 kb = 140 mb Total heap pages Total heap pages Node 0 pages=126502 that is 518152 kb = 518 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=132062 that is 540925 kb = 540 mb Total anon pages Total anon pages Node 0 pages=24951 that is 102199 kb = 102 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=24948 that is 102187 kb = 102 mb

Table 26: Numa maps summation: No shared lib

page 21 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Numa maps summation: One shared lib to all nodes (default) Process of MOORE on node 0: Process of MOORE on node 1: Total pages Total pages Node 0 pages=178371 that is 730607 kb = 730 mb Node 0 pages=15175 that is 62156 kb = 62 mb Node 1 pages=14533 that is 59527 kb = 59 mb Node 1 pages=177760 that is 728104 kb = 728 mb Total file pages Total file pages Node 0 pages=20465 that is 83824 kb = 83 mb Node 0 pages=15175 that is 62156 kb = 62 mb Node 1 pages=14533 that is 59527 kb = 59 mb Node 1 pages=19846 that is 81289 kb = 81 mb Total heap pages Total heap pages Node 0 pages=132624 that is 543227 kb = 543 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=132752 that is 543752 kb = 543 mb Total anon pages Total anon pages Node 0 pages=25282 that is 103555 kb = 103 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=25162 that is 103063 kb = 103 mb

Table 27: Numa maps summation: One shared lib to all nodes

2.7.11 AMD: Shared libraries

No shared lib (each nodes has its own lib) One shared lib to all nodes Mean total: 1993+- 15 s. Mean total: 2007+- 14 s.

Table 28: Intel: Optimal case vs Worst case

Numa maps summation: No shared lib (each nodes has its own lib) HLT that ran on node 3: HLT that ran on node 5: Total pages Total pages Node 0 pages=788 that is 3227 kb = 3 mb Node 0 pages=708 that is 2899 kb = 2 mb Node 1 pages=3 that is 12 kb = 0 mb Node 1 pages=6 that is 24 kb = 0 mb Node 2 pages=45 that is 184 kb = 0 mb Node 2 pages=80 that is 327 kb = 0 mb Node 3 pages=181255 that is 742420 kb = 742 mb Node 3 pages=4 that is 16 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=7 that is 28 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=181233 that is 742330 kb = 742 mb Node 6 pages=7 that is 28 kb = 0 mb Node 6 pages=54 that is 221 kb = 0 mb Node 7 pages=3 that is 12 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Total file pages Total file pages Node 0 pages=788 that is 3227 kb = 3 mb Node 0 pages=708 that is 2899 kb = 2 mb Node 1 pages=3 that is 12 kb = 0 mb Node 1 pages=6 that is 24 kb = 0 mb Node 2 pages=45 that is 184 kb = 0 mb Node 2 pages=80 that is 327 kb = 0 mb Node 3 pages=30089 that is 123244 kb = 123 mb Node 3 pages=4 that is 16 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=7 that is 28 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=30076 that is 123191 kb = 123 mb Node 6 pages=7 that is 28 kb = 0 mb Node 6 pages=54 that is 221 kb = 0 mb Node 7 pages=3 that is 12 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb page 22 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Total heap pages Total heap pages Node 0 pages=0 that is 0 kb = 0 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 3 pages=126485 that is 518082 kb = 518 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=126475 that is 518041 kb = 518 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Total anon pages Total anon pages Node 0 pages=0 that is 0 kb = 0 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 1 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 3 pages=24681 that is 101093 kb = 101 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=24682 that is 101097 kb = 101 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb

Table 29: Numa maps summation: No shared lib

Numa maps summation: One shared lib to all nodes (default) HLT that ran on node 1: HLT that ran on Node 4: Total pages Total pages Node 0 pages=626 that is 2564 kb = 2 mb Node 0 pages=626 that is 2564 kb = 2 mb Node 1 pages=32419 that is 132788 kb = 132 mb Node 1 pages=4 that is 16 kb = 0 mb Node 2 pages=42 that is 172 kb = 0 mb Node 2 pages=42 that is 172 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=7561 that is 30969 kb = 30 mb Node 4 pages=40417 that is 165548 kb = 165 mb Node 5 pages=4 that is 16 kb = 0 mb Node 5 pages=4 that is 16 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=1636 that is 6701 kb = 6 mb Node 7 pages=1636 that is 6701 kb = 6 mb Total file pages Total file pages Node 0 pages=626 that is 2564 kb = 2 mb Node 0 pages=626 that is 2564 kb = 2 mb Node 1 pages=1446 that is 5922 kb = 5 mb Node 1 pages=4 that is 16 kb = 0 mb Node 2 pages=42 that is 172 kb = 0 mb Node 2 pages=42 that is 172 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=7561 that is 30969 kb = 30 mb Node 4 pages=9003 that is 36876 kb = 36 mb Node 5 pages=4 that is 16 kb = 0 mb Node 5 pages=4 that is 16 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=1636 that is 6701 kb = 6 mb Node 7 pages=1636 that is 6701 kb = 6 mb Total heap pages Total heap pages Node 0 pages=0 that is 0 kb = 0 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=26094 that is 106881 kb = 106 mb Node 1 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=26534 that is 108683 kb = 108 mb

page 23 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Total anon pages Node 0 pages=0 that is 0 kb = 0 mb Node 0 pages=0 that is 0 kb = 0 mb Node 1 pages=4879 that is 19984 kb = 19 mb Node 1 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 2 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 3 pages=0 that is 0 kb = 0 mb Node 4 pages=0 that is 0 kb = 0 mb Node 4 pages=4880 that is 19988 kb = 19 mb Node 5 pages=0 that is 0 kb = 0 mb Node 5 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 6 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb Node 7 pages=0 that is 0 kb = 0 mb

Table 30: Numa maps summation: One shared lib to all nodes

2.8 Conclusions Intel

2.8.1 INTEL: Worst case vs Optimal case on 8 virtual cpus

This test only includes the allocation for memory, not the shared libraries. Running only on one node (with memory still available on that node), compared to two nodes has about 8% impact ((956 − 883)/883 ≈ 8%) from table 3), note the quite big variance in this test. It can be seen on the numastat that the ”Other node” on node1 in table 4 has been called as much as ”Local node” on node 0 in table 5. This is a confirmation that the same amount of job has been done is both test-cases. The reason that ”Numa miss” has not been invoked is because no NUMA-policy has been broken. The functions in subsection 2.4.1 has been used and therefore ”Other node” is invoked since it’s defined as ”A process ran on this node and got memory from another node.”. A third case has been performed without the use of numa-functions. Instead was node 0 filled with unused memory and to make sure that this specific node isn’t used by the HLT-software the swap space was disabled. The results can be seen in tables 31 and 32.

8 virtual cpus , both nodes available, but no memory left on node 0 Mean total: 959+- 96 s.

Table 31: Invoking of numa miss by filling a node

NUMASTAT 8 virtual cpus, both nodes available, but no memory left on node 0 Type Node 0 Node 1 Numa hit 127844 14166288 Numa miss 0 14155663 Numa foreign 14155663 0 Interleave hit 392 1459 Local node 127305 14158461 Other node 523 14163587

page 24 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

Table 32: NUMASTAT 8 virtual cpus, both nodes available, but no memory left on node 0

It can be seen that the amount of time it took was approximately the same as when the numa-functions where used. The difference in numastat between table 32 and table 4 is that numa foreign on node 0 and numa miss on node 1 has been invoked. The worst case here is really a case that will almost never happen in reality. In many cases it can of course happen that the local node is full and cpus has to allocate memory on foreign nodes. But this doesn’t necessary mean that the local node will be totally unused. Anyhow it is interesting to see how big the ”never-going-to-happen” case compared to the optimal case (the optimal case is obtained by letting the kernel run on default policy and make sure that the nodes memory is not full).

2.8.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk

It’s a confirmation that the function numa set interleave mask worked correctly by looking at table 7 and see the increase of interleave hit compared to table 4. By looking at the time difference between table 6 and 3 the interleaved policy is even worse than the worst case. Therefore the HLT software can be considered NUMA-lucky in the sense that every process runs independently of all the other processes with it’s own memory. Only when the memory is full on the local node the process has to put memory on other nodes. The only thing that will be shared are the libraries.

2.8.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager

This test only includes the allocation for memory, not the shared libraries. (699−689)/(689−150) ≈ 1.8% slower is the worst case when all memory on only one node is allocated compared to the optimal case. We can see that numa miss is invoked in table 9, this is because the numa policy is broken. The difference in this test is only around 2% compared to 8% in the last test. There was a big variance in the last test so this might explain some of the difference. But there are two other facts that also can explain the difference. The first one might be the fact that 8 virtual cpus where used instead of 6. The second explanation might be the fact that reading from a buffer manager changes the results. This is the motivation for the next two tests.

2.8.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw- file

This test only includes the allocation for memory, not the shared libraries. This test is motivated in the last subsection. Mainly it is to see how much difference the buffer man- ager affects the time compared to reading from a file. (960 − 938)/(938 − 30) ≈ 2, 4% slower is this test case when memory on one node is allocated. This is in the same order as the last test, so the conclusion is that it doesn’t matter much if these simulations gets data from the buffer manager or reading from raw-file on disk. The different absolute time mea- surements are not the same. This is because when the buffer manager is used, all processes is reading different events. When input is given by reading from disk, every process uses exactly the same events as input.

page 25 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010

2.8.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager

The following is a check how the NUMA-effect changes with increasing virtual cpus. This test had to be done on another computer (the so called ”blade-test-05”) with 12 physical cores (physical core is the same as virtual cpu when HyperThreading is off) in total in order to use 10 physical cores. (486 − 471)/(471 − 150) = 4, 6% slower is this test case when all memory is bounded to one node. This is higher than the case with 6 virtual cpus so the conclusion is that the more cpus that are used the bigger the penalty will be for the NUMA-effects. Using 10 virtual cpus was slower than using 6 virtual cpus but still bare in mind that this is the ”never- going-to-happen” worst case. This means that for these Intel servers with two nodes there is not much one can gain even if the ”never-going-to-happen” worst case is actually happening.

2.8.6 Summary

All tests on the Intel servers had only two nodes. The ”never-going-to-happen” worst case was as highest 8% slower compared to the optimal case. This test had a very big variance and thus uncer- tainty in the measurement. The second highest test with 10 virtual cpus had a difference around 4,6% between the worst case and the optimal case. This leads to the final conclusions that there is not much one can do to optimize the Intel servers from a NUMA point of view.

2.9 Conclusions AMD

2.9.1 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer

(1024 − 780)/(1024 − 180) ≈ 28, 9% slower is this test case when all memory is bounded to two nodes compared to when all nodes are available. This is a very big difference from the Intel two node servers. The conclusion of this test is that when increasing the number of nodes the Interconnector has to work a lot more sending memory to cpus on different locations. From table 18 there is one interesting detail to observe. Node 6 and node 7 have memory available so these nodes has the obligation to send memory to the other cpus. But as can be seen in the row ”Other node” it is only node 6 that distribute memory to the other cpus and node 7 only serves its own cpus. This detail will be investigated further.

2.9.2 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer

This test was motivated by the previous one. And that is the non-optimal sharing when more than one node is available. Node 7 only serves itself while node 6 distribute memory to the other virtual cpus. There will be three subtests, in every subtest five nodes has memory available and hence they have to give memory to the other three virtual cpus. By first looking at numastat in table 21 it can be seen that node 2 serves the virtual cpus on node 1 and node 5 serves the virtual cpus on both node 3 and node 4. Compare these results with table 22 where node 4 in principle alone has to serve the virtual cpus on node1,2,3. Clearly the first case is more optimal than the second and this can be seen from the time measurement. The third test in table 23 has the same non-optimal structure as the second test except that node 3 is now the serving node. The optimization that can be done here is to make sure that the serving nodes are distributed over all nodes available.

2.9.3 Summary AMD

The first thing to note about this server is that it is much more complicated than the Intel server. This is due to the fact that there are eight nodes and in total 48 virtual cpus. The worst case vs optimal case page 26 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 3 Profiling, correctness and evaluating the speed of the HLT Date: November 4, 2010

gave a big difference of almost 30%. This is enough in order to think about optimizing for NUMA effects. One non-optimal placement effect was found in subsection 2.9.2. The fact that only one node is serving all other virtual cpus that has no memory available on the local node when in fact different nodes could serve different cpus. The gain can be seen in table 20 for two serving nodes compared to one serving node is approx: (807 − 771)/807 = 4.4%. But this number is of course biased compared to a realistic case since the other nodes where completely unused. In any case this type of non-optimal behaviour should be reported to kernel programmers and AMD. When memory is available in subsection 2.7.8 we can see that the kernel behaves optimal in the sense that almost no allocation is done on other nodes. And the local node is fully exploit.

2.10 Shared libraries

Just to clarify how the results can be interpreted, look first at table 25. It says that on node 0 the file pages are 125mb and on node 1 the file pages are 140mb. This means that all the libraries that is needed by the processes on each node has its own set of libraries. Compare this with table 26 where on node 0 there is 83 mb on node 0 and 59 mb on node 1 of the file pages. That means that 142mb is shared between the two nodes. To understand why 2 mb is on node 0 even when every node has its own shared library (table 29 and 26) is to know how this test was done. Moore and all its libraries was copied on the local server as many times as there where nodes. Before the run began each library was bounded to each node , this hack was used since no command was found to perform this action. The 2 mb are the libraries from when the computer starts up for example the libc. Therefore no replication of libc was done in these tests. In general it is easy to see that this test was done correctly by looking at tables 26, 27 29 and 30. One thing to note is that this test suffers from the computer scientist uncertainty principle. And thus the numa maps data was not taken at the same occasion as the time was measured. The results for the Intel was (847 − 846)/846 ≈ 0.12% which is a very small effect. This result coincide with the other Intel tests and that is the kernel works good for the two node Intel servers. The results for the AMD was (2007 − 1993)/1993 ≈ 0.7%, this result is some multiples higher than the Intel test. This coincides with the other comparisons between the Intel and AMD servers.

3 Profiling, correctness and evaluating the speed of the HLT

3.1 Introduction

Profiling is a vital part in order to make a program run faster. Finding bottlenecks and optimizing these is the best way to get speed performance. In this section there will be three important steps that all are compulsory in order to optimize code.

3.2 Three compulsory steps

3.2.1 Step 1: Profiling and optimizing

The first step is to profile the software. Profiling the HLT for example can be done by the program written by myself. This program uses GoogleProfiler together with Kcachegrind. These are both great software tools that makes life very easy for a programmer. To see the help-file of the program, type: python run tests.py -h A long list of options will appear. The most important examples will be listed here: python run tests.py -n v9r3 -f 100 -e 1000 -v -i /scratch/yourself/74465 0x001F MB bit11.raw -p 1 -g x86 64- slc5-gcc43-dbg Now the HLT software will be profiled with compile option ”-O0 -g” enabled. After the run you will have a line in the console saying ”Kcachegrind file at location: /tmp/profile-$USER-YYYY MM DD HH II/kcachegrind data”

page 27 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 3 Profiling, correctness and evaluating the speed of the HLT Date: November 4, 2010

This file can be opened with the commands: kcachegrind /tmp/profile-$USER-YYYY MM DD HH II/kcachegrind data Once the program starts it’s time to find the bottleneck. All options are described in more detail in the help-file or at https://lhcbonline.cern.ch/bin/view/Online/Profiling and for help with kcachegrind go to https://lhcbonline.cern.ch/bin/view/Online/Kcachegrind. It is recommended to recompile the whole HLT software with at least ”-O2 -g” in order to see if the bottleneck is the same as with ”-O0 -g”. For tips how to optimizing code of the HLT jump to subsection 4.1.

3.2.2 Step 2: Correctness

This step is only for the type of optimization that is not changing the output of the current software. This step can’t unfortunately not be fulfilled by the program ”run tests.py” with 100% satisfaction. The reason why this is not completely satisfied is due to the fact that there is no easy solution to ob- tain the selected events of the HLT. The most simplest way to run the program twice by: python run tests.py -n v9r3 -f 100 -e 1000 -v -i /scratch/yourself/74465 0x001F MB bit11.raw -g x86 64-slc5- gcc43-opt -z python run tests.py -n v9r3 -f 100 -e 1000 -v -i /scratch/yourself/74465 0x001F MB bit11.raw -g x86 64-slc5- gcc43-opt -z -u /tmp The first line uses your own cmtuser folder with the optimization you have created. The second line change your cmtuser to the dummyfolder /tmp this way the original code will be used. After the two programs have finished compare the Moore.Core0.Iter0.out against each other by: less /tmp/profile-$USER-YYYY MM DD HH II/output/Moore.Core0.Iter0.out less /tmp/profile-$USER-YYYY’ MM’ DD’ HH’ II’/output/Moore.Core0.Iter0.out click on the button ”End” and compare the number of selected events. That is the second last column, as an example one of the outputs might look like this (the first and the second last column has been extracted from the list): Hlt2Global 3179 Hlt2GlobalPreScaler 3179 Hlt2GlobalHltFilter 3179 Hlt2GlobalPostScaler 2808 HltEndSequence 30000 HltRoutingBitsWriter 30000 HltGlobalMonitor 30000 HltL0GlobalMonitor 30000 HltDecReportsWriter 3179 HltSelReportsMaker 3179 HltSelReportsWriter 3179 HltVertexReportsMaker 3179 HltVertexReportsWriter 3179 HltLumiWriter 3179 LumiStripper 3179 LumiStripperFilter 3179 LumiStripperPrescaler 2075 bankKiller 2075

But obviously this is a more quantitative compare than qualitative. The second way is to look into the out-raw-files that has been created with the option ”-z”. The file is located at /tmp/profile-$USER-YYYY MM DD HH II/outputfile.raw. This is a little bit tricky since one has to know how the raw-file mode for an event looks like. So for correctness the program ”run tests.py” is not completetly satisfying.

3.2.3 Step 3: Evaluating the speed

The HLT software has a natural variance in how long a run takes. This variance increase of course if the computer that you are testing your improvements on has other processes running at the same time taking a lot of cpu power. page 28 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 4 Optimizing the code of HLT Date: November 4, 2010

The easiest way to check if an speed improvement helped or not is by running the following com- mands: python run tests.py -n v9r3 -f 100 -e 5000 -v -i /scratch/yourself/74465 0x001F MB bit11.raw -g x86 64-slc5- gcc43-opt –iterations 5 python run tests.py -n v9r3 -f 100 -e 5000 -v -i /scratch/yourself/74465 0x001F MB bit11.raw -g x86 64-slc5- gcc43-opt –iterations 5 -u /tmp Here we run exactly the same program five times with 5000 events. The variance will be printed in the end. That way the variance is under control and won’t confuse the programmer.

4 Optimizing the code of HLT

4.1 Introduction

This section will be about actually optimizing code. The main discussing will be about my optimiza- tion of the HLT. The procedure was following the three steps as the section above. The bottleneck that was found and optimized is located inside the library PatFwdTool. Two tutorials has been created on the twiki for how to use the profiler etc. at https://lhcbonline.cern.ch/bin/view/Online/Profiling https://lhcbonline.cern.ch/bin/view/Online/Kcachegrind

4.2 Finding bottlenecks

There is no point in optimizing a piece of code for speed if the code is almost never used. One should always optimize the piece of code that takes the most running time of the total program. This is where the profiler comes into play. By profiling for example the HLT with both ”-O0 -g” and ”-O2 -g” it is very easy to find the bottlenecks.

4.3 Optimizing in general

There are some important parts to think about when optimizing. These parts will be described very quickly.

4.3.1 Choose data structure

The data structure is a very important choice that can make the program both run faster and take less memory if chosen correctly. There are a number different data structures to choose from that are implemented both by BOOST and the standard library. As an example of why the choice of data structure is important read about my choice of data structure that was made in subsection 4.4.2.

4.3.2 Choose algorithm

The choice of algorithm is extremely important. Optimizing the code of a badly chosen algorithm is not a good option. As an example one can look at the variety of sorting algorithms out there. It doesn’t matter even if the programmer writes the whole program in assembly of a sorting algorithm of O(n2) when there exists algorithms of O(nlog(n)). The O(nlog(n)) algorithm will win in the long run even if it’s written in for example Python.

page 29 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 4 Optimizing the code of HLT Date: November 4, 2010

4.3.3 Speed as a trade-off for size

The trade-off between speed and size is very common. As an example my optimization of the PatFwd- Tool is nothing but this. Temporary vectors has to be allocated somewhere and this will take place in the RAM as well as the caches. The trade-off must be weighted against the win in speed, in the case of subsection 4.4.2 five vectors of pointers need to be allocated that is iterated inside the inner loop. Fortunately a vector of pointers doesn’t take up too much size and the gain is in some case as much as 5-10% (the gain in speed depends on the data that the HLT software is running on, but this is also connected to the size of the temporary vectors) in speed so the conclusion is that it’s worth it in this case.

4.4 Optimizing PatFwdTool

This section is about my optimizations of the HLT. I used the profiler described in subsection 3.2.1 and I will try to explain in the following how I optimized the code.

4.4.1 Profiling

In order to find the biggest bottleneck of the HLT good data-samples where needed. Good data- samples in the sense that it should characterize the input that the real HLT gets from the experiment. I got two different types of data:

• Samples with micro-bias triggered events. • Samples with micro-bias triggered events with an L0-Physics ”yes”, according to L0 TCK 1F.

After a few runs over these kind of files the biggest bottleneck was found. The figure 5 shows how the general picture looks like with the compilation flags ”-O2 -g”. The biggest post clearly is a function called fitXProjection and inside this function another function called distanceForFit takes the biggest amount. Here it’s clear that when profiling with ”-O2 -g” you are limited. Since at this point I can’t go into the function distanceForFit and see what line-numbers takes the biggest amount of time. The fact that it’s not possible to go into this function might be due to the fact that the compiler inlines the function. So here is an example of what is described in 3.2.1 that profiling with ”-O0 -g” is a good complement. Note that this for-loop that is shown is the inner-loop that is it’s a loop inside a loop inside a loop... This is why for example one if-statement at line 527 can take 1.07% of the total time of the program.

Figure 5 Profiling picture with GoogleProfiler, visualized with kcachegrind and compiled with ”-O2 -g”.

page 30 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 4 Optimizing the code of HLT Date: November 4, 2010

4.4.2 New data structure: Pre-sorted lists

There are two if-statements at line 526 and 527, and also inside the function distanceForFit there are a couple of if-statement that has to be checked over and over again. Some of these if-statements are never changing for the whole running time and some if-statements do change (the if-statements are unique for every hit). But it is the fact that the change doesn’t occur in the inner loop but it occurs in an outer loop. The following pseudo-code will explain the situation:

Outerloop 1 Outerloop 2 Innerloop over hits if ( !hit.isSelected() ) continue; if ( cond never changed1 ) continue; do stuff1 if ( cond never changed2 ) if ( cond never changed3 ) do stuff2 elseif ( cond never changed4 ) do stuff3 else do stuff4 else do stuff5 End of Innerloop over hits End of Outerloop 2 Pick one element with respect to the logic of do stuff above and change the state hit.isSelected() of one hit End of Outerloop 1

Note that cond never changed# is unique for every hit. According to my knowledge all these if-statements creates two big problems for the computer namely:

• Both cache-memory and computing power are wasted on fetching and checking !hit.isSelected() and cond never changed1. This is due to the fact that if one of these statement are true nothing is computed and the computer jump to the next hit in the for-loop.

• There is no way to predict the next instruction. If there is no way to predict the instruction, stalls will be created frequently, this is very bad for the performance.

So the optimization idea is the following: Before all these loops start, create temporary lists. Every temporary list will be a set of elements that goes through the loops hitting the exact same if-statements. After every Outerloop 1 remove the element that is changing its state isSelected(). The temporary lists must be fast to insert an element, delete an element and loop over all elements. The pseudo-code has now changed to:

Outerloop over hits if ( !hit.isSelected() ) continue; if ( cond never changed1 ) continue; if ( cond never changed2 ) if ( cond never changed3 ) save hit into list 1 elseif ( cond never changed4 ) save hit into list 2 else save hit into list 3 else

page 31 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 4 Optimizing the code of HLT Date: November 4, 2010

save hit into list 4 End of Outerloop over hits Outerloop 1 Outerloop 2 Innerloop over hits for list 1 do stuff1 do stuff2 End of Innerloop over hits for list 1 Innerloop over hits for temporary list 2 do stuff1 do stuff3 End of Innerloop over hits for list 2 Innerloop over hits for temporary list 3 do stuff1 do stuff4 End of Innerloop over hits for list 3 Innerloop over hits for temporary list 4 do stuff1 do stuff5 End of Innerloop over hits for list 4 End of Outerloop 2 Delete the element that state is change for hit.isSelected() End of Outerloop 1 Delete lists

What can be seen from the new pseudo-code is that the computer has no conditions to skip element nor to branch different code except for when the inner loop ends. Therefore this code will run much faster and help the computer to do correct predictions in order to keep the stalls low. The data-structure for the temporary list has to fast to insert, delete elements and iterate over the elements. Note also that the order of the elements doesn’t matter in this case and also the size of the list doesn’t increase. A very fast structure that can be used is an array. An array does only need one malloc and thus fast to insert. Deletion is also fast if a smart tactic that can be exploit due to the fact that the order doesn’t matter. The deletion operation is explained in figure 6. When an element number ”x”

Figure 6 Fast deletion when the order of the elements doesnt matter.

is going to be deleted no free needs to be used. Just copy the last element to replace element number ”x” and decrease the size from ”n” to ”n-1”. This way fast insertion, deletion and iteration is obtained. page 32 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 4 Optimizing the code of HLT Date: November 4, 2010

4.4.3 Math

In the most inner loop it’s important to avoid unnecessary instructions, so math should be done in the smartest way. There will be one example shown here that I used to optimize distanceForFit. The constraints are: track.cosAfter() > 0. Original version

dist = dist*track.cosAfter() if ( fabs( dist - dx ) < fabs( dist + dx ) ) dist = dist - dx; else dist = dist + dx; dist = dist/track.cosAfter()

New version

if ( dist > 0. ) return dist - fabs(dx)/track.cosAfter(); else dist = dist + fabs(dx)/track.cosAfter();

In the example above it can be seen that there is one fabs function less and one less multiplication.

4.4.4 Avoiding divisions

Float and integer division takes much longer time to perform compared to multiplication addition and subtraction. Floating divisions takes around 20 - 45 clock cycles and integer divisions take around 40 - 80 clock cycles for 32-bit integers. This can be compared with for example integer multiplication (multiplication is slower than addition and subtraction) which takes around 3 - 10 clock cycles [4]. In other word divisions should be avoided inside inner loops. There is one good trick that one can use extensively on the HLT. This trick requires two conditions:

• The value in the denominator is changed fewer times than used in divisions.

• There is no need for extreme precision.

If these two conditions are true then save the reciprocal as an attribute in the class and use multipli- cation with the reciprocal. About the second point when need of extreme precision, there exists fast division algorithms that uses the reciprocal itself. This means that there should be no loss in precision by first calculating the reciprocal with this trick. The method is called Newton Raphson division [14]. If this method is used in modern computers are outside the scope of my knowledge. To demonstrate an example from subsection 4.4.3 one could actually reduce one division with multi- plication in this case. The denominator track.cosAfter() is changed a lot less times than the division is performed. So a new faster solution would be: Newer version

if ( dist > 0. ) return dist - fabs(dx)*track.invcosAfter(); else dist = dist + fabs(dx)*track.invcosAfter();

Where track.invcosAfter() is the saved reciprocal of track.cosAfter(). This do make a big difference in terms of performance. This optimization was actually not done by me due to lack of time. But if a brave programmer out there wants to optimize the speed of the HLT even more you are more than welcome to create this optimization.

page 33 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 5 References Date: November 4, 2010

4.5 Conclusions

In this section tips and tricks are explained in order to make code faster. This was shown by explaining an optimization of the HLT software that was done by the author. Hopefully some of these tips and tricks can be used in other programs in order to get a more efficient and faster HLT at LHCb.

5 References

[1] An NUMA API for Linux. Andi Kleen, SUSE Labs, http://www.halobates.de/numaapi3.pdf, 2004. [2] What Every Programmer Should Know About Memory. Ulrich Drepper, Red Hat, Inc., http://www.unilim.fr/sci/wiki/ media/cali/cpumemory.pdf, 2007.

[3] What Every Programmer Should Know About Memory, pages 43–46. Ulrich Drepper, Red Hat, Inc., http://www.unilim.fr/sci/wiki/ media/cali/cpumemory.pdf, 2007. [4] Optimizing software in C++, pages 140–141. Agner Fog. Copenhagen University College of Engi- neering, http://www.agner.org/optimize/optimizing cpp.pdf, 2010.

[5] SuSE Labs Andi Kleen. numa(3) - linux man page, 2004. http://linux.die.net/man/3/numa. [6] Tracy Carver. Magny-cours and direct connect architecture 2.0. 2010. http://developer.amd.com/documentation/articles/pages/Magny-Cours-Direct-Connect- Architecture-2.0.aspx.

[7] Brandon Hutchinson. Understanding /proc/cpuinfo, 2007. http://www.brandonhutchinson.com/Understanding proc cpuinfo.html. [8] kernel.org. Numa policy hit/miss statistics. http://www.kernel.org/doc/Documentation/numastat.txt. [9] kernel.org. sysfsdevicessystemcpu+core id+sys, 2010. http://www.kernel.org/doc/Documentation/ABI/testing/sysfs- devicessystemcpu.

[10] linux.die.net. numa maps(5) - linux man page. http://linux.die.net/man/5/numa maps. [11] linux.die.net. sched setaffinity(2) - linux man page. http://linux.die.net/man/2/sched setaffinity. [12] David Ott. Optimizing software applications for numa. 2009. http://software.intel.com/en- us/articles/optimizing-software-applications-for-numa/.

[13] Wikipedia. Cpu cache, 2010. http://en.wikipedia.org/wiki/CPU cache. [14] Wikipedia. Newtonraphson division, 2010. http://en.wikipedia.org/wiki/Division (digital).

page 34