Ibm Power8 Processors Analýza Výkonnosti Procesorů Ibm Power8
Total Page:16
File Type:pdf, Size:1020Kb
BRNO UNIVERSITY OF TECHNOLOGY VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SYSTEMS FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÝCH SYSTÉMŮ PERFORMANCE ANALYSIS OF IBM POWER8 PROCESSORS ANALÝZA VÝKONNOSTI PROCESORŮ IBM POWER8 MASTER’S THESIS DIPLOMOVÁ PRÁCE AUTHOR Bc. JAKUB JELEN AUTOR PRÁCE SUPERVISOR Ing. JIŘÍ JAROŠ, Ph.D. VEDOUCÍ PRÁCE BRNO 2016 Abstract This paper describes the IBM Power8 system in comparison to the Intel Xeon processors, widely used in today’s solutions. The performance is not evaluated only on the whole system level but also on the level of threads, cores and a memory. Different metrics are demonstrated on typical optimized algorithms. The benchmarked Power8 processor pro- vides extremely fast memory providing sustainable bandwidth up to 145 GB/s between main memory and processor, which Intel is unable to compete. Computation power is comparable (Matrix multiplication) or worse (N-body simulation, division, more complex algorithms) in comparison with current Intel Haswell-EP. The IBM Power8 is able to com- pete Intel processors these days and it will be interesting to observe the future generation of Power9 and its performance in comparison to current and future Intel processors. Abstrakt Práce se zabývá systémem IBM Power8 v porovnání s dnes běžně používanými řešeními s procesory Intel Xeon. Výkonnost je vyhodnocována nejen na úrovni celého systému, ale také na úrovni jednotlivých vláken a jader a paměti. Různé metriky jsou demonstrovány na typických optimalizovaných algoritmech. Testovaný stroj Power8 disponuje extrémně rychlou pamětí poskytující rychlost až 145 GB/s mezi pamětí a procesorem, které se dnešní procesory Intel nevyrovnají. Výpočetní síla je pouze srovnatelná (Násobení matic) nebo slabší (N-body simulace, dělení, složitější algoritmy) v porovnání s aktuálním Intel Haswell- EP. Procesor IBM Power8 je dnes schopný konkurovat procesorům Intel a bude zajímavé sledovat následující generaci Power9 a jeho výkonnost v porovnání s aktuálními a budoucími procesory Intel. Keywords Power 8, IBM, Intel, Haswell, Haswell-EP, Performance Analysis, STREAM, Steady state, N-body Klíčová slova Power 8, IBM, Intel, Haswell, Haswell-EP, Analýza výkonnosti, STREAM, Steady state, N-body Reference JELEN, Jakub. Performance analysis of IBM Power8 processors. Brno, 2016. Master’s thesis. Brno University of Technology, Faculty of Information Technology. Supervisor Jaroš Jiří. Performance analysis of IBM Power8 processors Declaration Hereby I declare that this master’s thesis was prepared as an original author’s work under the supervision of Mr. Jiří Jaroš. All the relevant information sources, which were used during preparation of this thesis, are properly cited and included in the list of references. ....................... Jakub Jelen May 23, 2016 Acknowledgements I would like to thank my thesis supervisor Jiří Jaroš for the useful comments, remarks and engagement through the learning process of this master thesis. Furthermore I would like to thank Red Hat and IBM Linux Technology Center for providing access to the Open Power Hub with latest Power8 servers. This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project "IT4Innovations National Supercomputing Center – LM2015070". c Jakub Jelen, 2016. This thesis was created as a school work at the Brno University of Technology, Faculty of Information Technology. The thesis is protected by copyright law and its use without author’s explicit consent is illegal, except for cases defined by law. Contents 1 Motivation4 2 Architecture overview5 2.1 IBM Power8 architecture............................5 2.2 Intel Haswell-EP architecture.......................... 12 2.3 Architecture comparison............................. 14 3 Development tools overview 16 3.1 GNU Compiler Collection (GCC)........................ 16 3.2 Advance Toolchain for PowerLinux 8.0 and 9.0................ 16 3.3 IBM XL C/C++ for AIX............................ 17 3.4 Intel C++ Compiler............................... 17 4 Performed tests 18 4.1 Optimization techniques and performance monitoring............ 18 4.2 Micro-tests.................................... 22 4.3 Test-suites..................................... 27 4.4 Power Consumption............................... 33 5 Real world applications 35 5.1 N-body simulation................................ 35 5.2 Steady state heat distribution.......................... 38 5.3 Architectures comparison............................ 40 6 Conclusion 43 Bibliography 44 Appendices 45 List of Appendices................................... 46 A CD Content 47 1 List of Figures 2.1 IBM Power8 processor die with Centaur chips.................6 2.2 IBM Power8 2-hop SMP interconnect.....................9 2.3 The front view of the IBM Power S812L server................ 12 2.4 Intel Haswell-EP processor block diagram................... 13 4.1 Example optimization report from GCC.................... 19 4.2 Matrix and vector multiplication: Single thread performance comparison.. 26 4.3 Copy memory benchmark: Caches usage.................... 27 4.4 Copy memory benchmark: Bandwidth..................... 28 4.5 STREAM benchmark: IBM Power8...................... 30 4.6 STREAM benchmark: Intel Haswell-EP and different compilers....... 31 4.7 Comparison of power consumption....................... 34 5.1 N-body: IBM Power8 unroll factor effect.................... 37 5.2 N-body performance for small problem..................... 38 5.3 Steady state heat distribution: IBM Power8 scaling performance...... 40 5.4 Steady state heat distribution: Scaling performance on Intel processors... 41 2 List of Tables 2.1 List of available PAPI events on Power8 architecture............. 10 2.2 List of derived hardware counters........................ 11 2.3 Comparison of all evaluated processor architectures.............. 15 4.1 Matrix and vector multiplication: Various effects on performance...... 25 4.2 Native Linux benchmark: perf bench...................... 33 5.1 Architecture comparison............................. 42 3 Chapter 1 Motivation At the time of writing this document, there is no supercomputer in the list (from November 2015) of TOP 500 1 based on IBM Power8 architecture. There are only some based on previous generation IBM Power7, but in the latest model IBM did a good job with improving specifications in per-thread, per-core and per-socket performance, based onthe published numbers. Although Power8 does not have ambitions to became one of the top supercomputers, the overall performance is increasing and it will be an interesting base for the next generation of Power9 processors. Power8 instead aims for cloud based workloads, as well as big data and business analytics [9]. The improvements applicable in these areas are also important back in the scientific and technical computing and supercomputers. In this technical report, I would like to describe what is improved on this new archi- tecture and how it performs in comparison with a recent Intel Haswell-EP architecture, which provides similar improvements, but is better known and understood, since it is foun- dation for many supercomputers in the above mentioned list, but also for general purpose computers on our tables. I will not aim for the results of mass distributed parallel algorithms on thousands cores, but instead I will focus on the level of one processor core, one socket, instructions vectoriza- tion, simultaneous multithreading and memory management. These parameters measured on the low level also affect results of supercomputers in large scale. After gaining the theoretical knowledge about the architecture, I will describe advanced tools for development on both architectures. Later on, I will prepare a set of micro-tests to evaluate this architecture under extreme cases for memory access, raw computational power or combinations in comparison with an adequate Intel architecture. Based on these results, I will implement and optimize more complex simulation algo- rithms, such as N-body or Steady state heat distribution and compare their performance on Power8 and Intel processors with appropriate explanations and recommend workload for this architecture. 1Source http://www.top500.org/list/2015/11/ 4 Chapter 2 Architecture overview To measure the performance of specific architecture, there are different aspects to follow. First of them is certainly clock frequency, but with this one we have reached a limit of few Gigahertz, which exceeding does not provide significant gain and the energy cost is getting unbearable. A direct response to the frequency limit is placing more processor cores on one chip. This allows to run more or less separate programs on different cores, but it gets quite expensive because all the logic has to be duplicated and synchronization mechanisms on cores and memory introduced. Also, one might needs to speed up just a part of application, which brings a significant overhead of synchronization between cores. This was solved by the introduction of vector instructions working not on one value but on the vector of values. Another solution for processor is to execute more instructions from different threads at same time on the single core, so called Simultaneous multi-threading (SMT). All these features are available in nowadays processors and can make a huge difference both for everyday use or special applications. Some of them are implicit so they don’t