Message Passing and Bulk Transport on Heterogeneous Multiprocessors
Total Page:16
File Type:pdf, Size:1020Kb
Master’s Thesis Nr. 118 Systems Group, Department of Computer Science, ETH Zurich Message Passing and Bulk Transport on Heterogeneous Multiprocessors by Reto Achermann Supervised by Prof. Timothy Roscoe, Dr. Kornilios Kourtis October 17, 2014 Abstract The decade of symmetric multiprocessors is about to end. Due to physical limitations, individual cores cannot be made any faster and adding cores won’t work anymore (dark silicon). Operating systems face the next big paradigm shift: systems are getting faster by exploiting specialized hardware leading to the era of asymmetric processors and heterogeneous systems. An ever growing count and the diversity of cores yields new challenges regarding scheduling and resource management. Multiple physical address spaces within single machines let them appear as a cluster-like system. Today’s commodity operating systems are not well designed to deal with het- erogeneity and asymmetric processors. Many attempts have been made to tackle the challenges of heterogeneous hardware, most of which treat avail- able co-processors like devices or independent execution environments rather than equivalent processors. Thus, true support of a multi-architecture system is rarely seen, despite the fact that such hardware already exists. Examples are, systems like the OMAP44xx SoC or Intel’s Xeon Phi co-processor. In this thesis we will elaborate the arising challenges introduced by heteroge- neous system architectures. We ported the Barrelfish research operating system to the Xeon Phi execution environment and explored the impact on perfor- mance and system design imposed by the emerging hardware characteristics such as different instruction set architectures, multiple physical address spaces and asymmetric cores. Based on our proposed software framework, we demon- strate a parallel, heterogeneous execution of an OpenMP program using our implementation of a message passing based OpenMP library and show why more threads do not necessary imply a better performance. 1 Acknowledgments I want to thank Prof. Timothy Roscoe for giving me the opportunity to work with the new Xeon Phi hardware and for his inspirational thoughts and critical questions during our discussions; Dr. Kornilios Kourtis for his feedback on the benchmarks and Stefan Kaestle for his ideas about possible applications. Further, I’d like to thank Raphael Fuchs for the inspirational discussions about Barrelfish. 2 Contents 1 Introduction and Motivation 9 2 Related Work 12 2.1 Intel Manycore Platform Software Stack . 12 2.2 Heterogeneous Systems . 13 2.2.1 Heterogeneous Networked Environments . 13 2.2.2 The CPU plus Accelerator Model . 13 2.2.3 The System-on-a-Chip Model . 13 2.2.4 The Heterogeneous Model . 14 2.3 Barrelfish . 15 2.4 Concluding Remarks . 16 3 Barrelfish in a Nutshell 17 3.1 Architectural Overview . 17 3.1.1 CPU Driver . 18 3.2 Capabilities . 18 3.3 Major Domains . 19 3.3.1 Kernel Monitor . 19 3.3.2 System Knowledge Base (SKB) . 19 3.3.3 Memory Service . 19 3.3.4 Device Drivers . 20 3.4 Message Passing . 20 3.4.1 Exporting a Service . 20 3.4.2 Binding to a Service . 20 3.5 Hardware Register Access . 21 3.6 Hake: Barrelfish’s Build System . 21 3.6.1 Hakefiles . 21 4 The Intel Xeon Phi Co-Processor 22 4.1 System Overview . 23 4.2 Xeon Phi Co-Processor . 23 4.2.1 Hardware Specification . 24 4.2.2 In Comparison with the Xeon E5 v2 . 24 4.2.3 Comparison to General Purpose GPUs . 25 4.3 Physical Address Space Layout . 26 4.3.1 Host Side View . 26 4.3.2 Xeon Phi Side View . 27 4.3.3 Multiple Physical Address Spaces . 27 3 CONTENTS 4 4.3.4 Accessing Resources of another Processor . 29 4.4 The K1OM Architecture . 29 4.5 Booting the Xeon Phi . 31 4.5.1 Operating Systems . 31 4.5.2 Bootstrap . 31 4.5.3 Boot Modes . 32 5 Memory Access over PCI Express 33 5.1 Memory Hierarchy . 33 5.1.1 Memory Access with Co-Processors . 34 5.2 Evaluating Memory Access Performance . 35 5.3 Memory Access Latency . 35 5.3.1 Benchmark Description . 35 5.3.2 Benchmark Results . 36 5.3.3 Implications . 38 5.4 Memory Throughput . 39 5.4.1 Benchmark Description . 39 5.4.2 Benchmark Baseline: Local Transfers . 39 5.4.3 Host to Xeon Phi Transfers . 41 5.4.4 Xeon Phi to Host Transfers . 43 5.4.5 Between Xeon Phi . 43 5.4.6 Implications . 43 5.5 Message Passing over Shared Memory . 44 5.5.1 Benchmark Description . 44 5.5.2 Benchmark Baseline: Local Benchmark . 44 5.5.3 Message Passing over PCI Express . 46 5.5.4 Implications . 48 5.6 A Word on CPU Performance . 48 6 Barrelfish on the Xeon Phi 50 6.1 System Overview . 50 6.1.1 Xeon Phi Driver . 51 6.1.2 Xeon Phi Manager . 52 6.1.3 Service and Domain Identification . 53 6.2 Differences and Similarities to the MPSS . 54 6.3 Modifications to Barrelfish . 54 6.3.1 Boot-Loader . 54 6.3.2 CPU Driver and Monitor . 54 6.3.3 Kaluga and SKB . 55 6.4 Creating the Xeon Phi Bootimage . 56 6.4.1 Additional Hake Targets . 56 6.4.2 Boot-Image Anatomy . 56 6.4.3 Building Process . 57 6.5 The Boot Process Explained . 58 6.5.1 Boot Dependencies . 58 6.5.2 Host Side Driver Startup . 58 6.5.3 Chain of Boot Events . 59 6.5.4 Weever: The Xeon Phi Boot-Loader . 60 6.6 Capabilities Revisited . 61 6.6.1 The Problem of Multiple Address Spaces . 61 CONTENTS 5 6.6.2 Capabilities: Accessing Remote Resources . 62 6.6.3 Capability Translation: An Implementation . 62 6.7 Flounder over PCI Express . 65 6.7.1 Founder Bootstrap Steps . 65 6.7.2 Message Stub Interface Extensions . 65 6.7.3 Usage . 66 6.7.4 Limitations . 67 6.8 Xeon Phi Service Interfaces . 68 6.8.1 Xeon Phi Driver Service . 68 6.8.2 Name Services . 71 6.9 DMA Service . 75 6.9.1 DMA Library . 75 6.9.2 DMA Manager . 76 6.9.3 DMA Subsystem Programming Model . 76 6.10 Virtual Devices . 77 6.10.1 VirtIO . 78 6.10.2 Barrelfish Bulk Transport . 81 6.10.3 Virtual Devices: A Conclusion . 82 7 Applications 83 7.1 Work Offloading . 83 7.1.1 Requirements for Offloading . 84 7.2 OpenMP . 85 7.2.1 Computation Model . 85 7.2.2 BOMP: Barrelfish OpenMP . 87 7.2.3 XOMP: OpenMP for Exclusive Address Spaces . 87 7.2.4 Library Initialization . 90 7.2.5 Library Scaling Characteristics . 90 7.3 Use Case Example: Matrix Multiplication . 95 7.3.1 General Benchmark Description . 97 7.3.2 Baseline: Local Benchmarks . 97 7.3.3 Effect of Data Replication . 98 7.3.4 Matrix Multiply using Heterogeneous OpenMP . 98 7.4 Conclusion . 100 8 Future Work 102 8.1 Towards One System . 102 8.2 Xeon Phi as a Collection of Cores . 104 8.3 Multiple Address Spaces . 105 8.4 Barrelfish: Explore Scaling Behavior . 105 8.5 Networking to the Xeon Phi and Host VFS . 106 8.6 Extending Flounder . 107 8.7 DMA Library . 107 8.8 OpenMP . 107 8.9 Bulk Transport over PCI Express . 108 8.10 Xeon Phi Hardware Features . 108 8.11 VirtIO . 109 Appendices 110 CONTENTS 6 A Memory Access Benchmarks 111 A.1 Xeon Phi Caches . 111 A.2 Memory Latency . 111 A.2.1 Experiment Description . 111 A.2.2 Experiment Parameters . 112 A.3 Memory Throughput . ..