Optimizing HLT Code for Run-Time Efficiency

Optimizing HLT code for run-time efficiency Public Note Issue: 1 Revision: 0 Reference: LHCb-PUB-2010-017 Created: September 6, 2010 Last modified: November 4, 2010 LHCb-PUB-2010-017 04/11/2010 Prepared by: Axel Thuressona, Niko Neufeldb aLund,Sweden bCERN, PH Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010 Abstract An upgrade of the High level trigger (HLT) farm at LHCb will be inevitable due to the increase in luminosity at the LHC. The upgrade will be done in two main ways. The first way is to make the software more efficient and faster. The second way is to increase the number of servers in the farm. This paper will concern both of these two ways divided into three parts. The first part is about NUMA, modern servers are all built with NUMA so an upgrade of the HLT farm will consist of this new architecture. The present HLT farm servers consists of the architecture UMA. After several tests it turned out that the Intel-servers that was used for testing (having 2 nodes) had very little penalty when comparing the worst-case the optimal-case. The conclusions for Intel-servers are that the NUMA architecture doesn’t affect the existing software negative. Several tests was done on an AMD-server having 8 nodes, and hence a more complicated structure. Non-optimal effects could be observed for this server and when comparing the worst-case with the optimal-case a big difference was found. So for the AMD-server the NUMA architecture can affect the existing software negative under certain circumstances. In the second part of the paper a program was made in order to help programmers find bottlenecks, optimize code and still maintain correctness. In the third part, the program from part two was used to optimize a bottleneck in the HLT software. This optimization gained around 1-10 Document Status Sheet 1. Document Title: Optimizing HLT code for run-time efficiency 2. Document Reference Number: LHCb-PUB-2010-017 3. Issue 4. Revision 5. Date 6. Reason for change Draft 1 September 6, 2010 First version. Pasting from NUMA Battle-reports. Contents 1 Introduction . 4 2 NUMA . 4 2.1 Introduction . 4 2.2 Linux file-system definitions . 5 2.3 /sys/devices/system . 6 2.4 How to control the memory and the cpus . 7 2.4.1 NUMA policy library . 8 2.4.2 Affinity . 8 2.5 Monitoring and the computer scientists uncertainty principle . 9 2.5.1 Numastat . 9 2.5.2 /proc/#/numa maps . 9 2.5.3 The computer scientists uncertainty principle . 10 2.6 Test-cases . 10 2.6.1 Intel-case . 10 2.6.2 AMD-case . 12 2.6.3 Reading from RAW-file on disk . 13 2.6.4 Reading from shared memory . 13 page 1 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010 2.7 Results . 14 2.7.1 INTEL: Worst case vs Optimal case: 8 virtual cpus, input from RAW-file on disk 14 2.7.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk . 15 2.7.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager . 15 2.7.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw-file . 16 2.7.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager . 17 2.7.6 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer . 18 2.7.7 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer . 19 2.7.8 AMD: full run, 48 virtual cpus, input from file . 20 2.7.9 Shared libraries . 20 2.7.10 Intel: Shared libraries . 21 2.7.11 AMD: Shared libraries . 22 2.8 Conclusions Intel . 24 2.8.1 INTEL: Worst case vs Optimal case on 8 virtual cpus . 24 2.8.2 INTEL: Interleaved: 8 virtual cpus, input from RAW-file on disk . 25 2.8.3 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from buffer manager . 25 2.8.4 INTEL: Worst case vs Optimal case, 6 virtual cpus without HyperThreading, input from raw-file . 25 2.8.5 INTEL: Worst case vs Optimal case, 10 virtual cpus without HyperThreading, input from buffer manager . 26 2.8.6 Summary . 26 2.9 Conclusions AMD . 26 2.9.1 AMD: Worst case vs Optimal case, 8 virtual cpus, input from buffer . 26 2.9.2 AMD: 5 nodes with memory, 8 virtual cpus, input from buffer . 26 2.9.3 Summary AMD . 26 2.10 Shared libraries . 27 3 Profiling, correctness and evaluating the speed of the HLT . 27 3.1 Introduction . 27 3.2 Three compulsory steps . 27 3.2.1 Step 1: Profiling and optimizing . 27 3.2.2 Step 2: Correctness . 28 3.2.3 Step 3: Evaluating the speed . 28 4 Optimizing the code of HLT . 29 4.1 Introduction . 29 4.2 Finding bottlenecks . 29 4.3 Optimizing in general . 29 4.3.1 Choose data structure . 29 4.3.2 Choose algorithm . 29 page 2 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 Date: November 4, 2010 4.3.3 Speed as a trade-off for size . 30 4.4 Optimizing PatFwdTool . 30 4.4.1 Profiling . 30 4.4.2 New data structure: Pre-sorted lists . 31 4.4.3 Math . 33 4.4.4 Avoiding divisions . 33 4.5 Conclusions . 34 5 References . 34 List of Figures 1 The UMA (Uniform Memory Access) architecture [12]. 5 2 The NUMA (Non-Uniform Memory Access) architecture [12]. 5 3 Illustration over cache layout on a mono-processor (UMA) [2]. 6 4 Illustration over cache layout on a multi-processor (UMA) [2]. 7 5 Profiling picture with GoogleProfiler, visualized with kcachegrind and compiled with ”-O2 -g”. 30 6 Fast deletion when the order of the elements doesnt matter. 32 List of Tables 1 Information from ”blade-test-01” at location /sys/devices/system/node#/cpu#/. 12 2 Information from ”blade-test-10” at location /sys/devices/system/node0/cpu#/. 13 3 Worst case vs Optimal case: 8 virtual cpus . 14 4 NUMASTAT 8 virtual cpus , memory on node1 . 14 5 NUMASTAT 8 virtual cpus , memory on both nodes . 14 6 Interleaved: 8 virtual cpus . 15 7 NUMASTAT 8 virtual cpus , interleaved . 15 8 Worst case vs Optimal case: 6 virtual cpus, buffer manager . 15 9 NUMASTAT 6 virtual cpus , memory on node1 . 16 10 NUMASTAT 6 virtual cpus , memory on both nodes . 16 11 Worst case vs Optimal case: 6 virtual cpus, raw-file . 16 12 NUMASTAT 6 virtual cpus , memory on node1 . 16 13 NUMASTAT 6 virtual cpus , memory on both nodes . 17 14 Worst case vs Optimal case: 10 virtual cpus, buffer manager . 17 15 NUMASTAT 10 virtual cpus , memory on node1 . 17 16 NUMASTAT 10 virtual cpus , memory on both nodes . 18 17 Worst case vs Optimal case: 8 virtual cpus, buffer . 18 18 NUMASTAT 8 virtual cpus , memory on node6,7 . 18 19 NUMASTAT 8 virtual cpus , memory on all nodes . 19 20 Five nodes available, 8 virtual cpus, buffer . 19 page 3 Optimizing HLT code for run-time efficiency Ref: LHCb-PUB-2010-017 Public Note Issue: 1 2 NUMA Date: November 4, 2010 21 NUMASTAT 8 virtual cpus , memory on node0,2,5,6,7 . 19 22 NUMASTAT 8 virtual cpus , memory on node0,4,5,6,7 . 19 23 NUMASTAT 8 virtual cpus , memory on node3,4,5,6,7 . 20 24 NUMASTAT 48 virtual cpus , memory on all nodes . 20 25 Intel: Optimal case vs Worst case . 21 26 Numa maps summation: No shared lib . 21 27 Numa maps summation: One shared lib to all nodes . 22 28 Intel: Optimal case vs Worst case . 22 29 Numa maps summation: No shared lib . 23 30 Numa maps summation: One shared lib to all nodes . 24 31 Invoking of numa miss by filling a node . 24 32 NUMASTAT 8 virtual cpus, both nodes available, but no memory left on node 0 . 25 1 Introduction An upgrade of the High level trigger (HLT) farm at LHCb will be inevitable due to the increase in luminosity at the LHC. The upgrade will be done in two main ways. The first way is to make the software more efficient and faster. The second way is to increase the number of servers in the farm. This paper will be about both the HLT software and an investigation about new techniques that modern servers is built on. This is divided into three parts all of which the goal is to optimize the HLT. The first part is about NUMA effects. Modern servers today are all built with NUMA (Non-Uniform Memory Access). The upgrade will only consist of servers with NUMA in contrast to the present farm that consists of servers with UMA. The HLT software is built and tested on UMA servers so there might be a non-optimal situation when transitioning into NUMA. The second part is about three compulsory steps that a programmer must fulfil in order to optimize the code of the HLT software. A program has been created in order to try to unite these three steps into one and same program. This will be a tool that will make it faster for programmers to create op- timized code.

Optimizing HLT Code for Run-Time Efficiency

Towards Scalable Multiprocessor Virtual Machines

NASM Intel X86 Assembly Language Cheat Sheet

Understanding the Linux Kernel, 3Rd Edition by Daniel P

Virtualization in Linux KVM + QEMU

MS-DOS System Programmer's Guide I I Document No.: 819-000104-3001 Rev

KVM Message Passing Performance

PIC12F752/HV752 Data Sheet

Intel X86 Assembly Language & Microarchitecture

Intel® Architecture Instruction Set Extensions and Future Features

Controlling Processor C-State Usage in Linux

1 CS 3204 Operating Systems Pgy Announcements

Torwards a More Scalable KVM Hypervisor