Behavior of Mdynamix on Intel Xeon Phi Coprocessor

2013 First International Conference on Artificial Intelligence, Modelling & Simulation Behavior of MDynaMix on Intel Xeon Phi Coprocessor Manjunatha Valmiki Nisha Kurkure Shweta Das HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India email: [email protected] email: [email protected] email: [email protected] Prashant Dinde Deepu CV Goldi Misra HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India email: [email protected] email: [email protected] email: [email protected] Pradeep Sinha HPC and Corporate R&D, C-DAC, Pune, India email: [email protected] Abstract—Over the years, computational science has witnessed exceptional growth, but still lagging in efficient programming A. About MDynaMix Code to effectively undertake research activities. Today, This program is a general purpose molecular dynamics developments in almost all areas of Science & Technology code for simulations of mixtures of either rigid or flexible heavily rely on computational capabilities. The latest TOP500 molecules, interacting by AMBER-like force field in a supercomputing list shows the relevance of computational periodic rectangular cell. Algorithms for NVE simulation & modeling using accelerator technologies. Porting, (Microcanonical ensemble), NVT (Canonical ensemble), optimization, scaling and tuning of existing High Performance NPT (Isothermal–isobaric ensemble) and anisotropic NPT Computing (HPC) Applications on such hybrid architectures is are employed, Ewald sum for treatment of the electrostatic the norm for reaping the benefits of extreme scale computing. interactions, account for quantum effects using path integral This paper gauges the performance of MDynaMix application approach, as well as possibility for free energy computations from Molecular Dynamic domain on Intel Xeon Processor using expanded ensemble method. The program can be along with Intel Xeon Phi coprocessor. Different test cases were carried out to explore the performance of Intel Xeon Phi cards executed in both sequential and parallel. within the node as well as across the nodes. B. Intel Xeon Phi Architecture Keywords-Supercomputing, High Performance Computing, Intel Xeon Phi coprocessors offer high computational Intel Xeon Phi, Molecular Dynamic, Hybrid Architecture, numerical performance. Getting that performance requires Accelerators, Coprocessor, Micro OS, Linux, Open Multi properly-tuned software; it must be highly scalable, highly Processing (OpenMP), Message Passing Interface (MPI). vectorized, utilize the available memory efficiently. Each core in the Intel Xeon Phi coprocessor is designed to be power efficient while providing a high throughput for highly I. INTRODUCTION parallel workloads and is capable of supporting 4 threads in HPC systems are not only becoming complex day-by-day hardware. The 4 hardware threads per physical core help in but also becoming challenging in terms of speedup and masking the effects of latencies on the in-order instruction scalability. Well-organized and flexible numerical algorithms execution. Application scalability is important because are essential to achieve high performance computing on applications typically have 200+ active threads on the Intel complex environments. The size of compute intensive Xeon Phi system. The computational power comes from the problems also increases as the computing technology 512 bit wide registers. The codes on the Intel Xeon Phi advances. One such addition to the ever demanding market coprocessor will needs to utilize these wide SIMD of HPC computing systems to cope with such challenges is instructions to extract desired performance levels. The best Intel's Xeon Phi Coprocessor accelerator. This paper performance will only be achieved when the number of explores the computation power of Intel’s Xeon Phi using a cores, threads and SIMD or vector operations are used molecular dynamics software package ‘MDynaMix’. The effectively. The main memory for Intel Xeon Phi paper also explains the performance of MDynaMix code coprocessor resides on the same physical card with the with different execution modes without affecting the coprocessor, and is completely separate and not synchronized accuracy of the results. with the memory on the host system. Familiar programming models like Open Multi Processing (OpenMP) and Message Passing Interface (MPI) 978-1-4799-3251-1/13 $31.00 © 2013 IEEE 340346 DOI 10.1109/AIMS.2013.71 allows the developer to execute the compute intensive part of II. EXPERIMENT IN DETAIL the code on underlying architecture. MDynaMix is a compute intensive molecular dynamic A skilled application developer can surely take advantage application. To realize the scalability and performance of of the increased processing power of large number of cores code on a hybrid architecture, it is required to understand the available on Xeon Phi by porting existing parallel scalability of the code and performance on the host CPU. applications and codes on the Coprocessor. C. Methods to compile the code for coprocessor The experiment has been carried out by using following different cases. Native compilation: In this mode, application directly Case 1: CPU Compilation: Compiled and executed on executes on coprocessor without being offloaded to Intel Xeon (Sandybridge) multicore processor by varying coprocessor. Such applications are required to be compiled number of process/threads. and built on the host system using -mmic supporting flag Case 2: Native Compilation: Compiled the code on which enables the compiler to generate an object file for CPU with –mmic flag, transferred the executable on card and Xeon Phi coprocessor. Executing applications in native mode executed thereon with varying number of threads (4-60) with requires all the dynamic libraries and executable must be different KMP_AFFINITY. present on the coprocessor. Case 3: Symmetric mode Compilation: Compiled the Symmetric Mode compilation: In this mode, MPI ranks code for both CPU and coprocessor are run on both the Xeon processors and the Xeon Phi In this experiment, the file system used is Network File coprocessor. The most important thing to remember is that System (NFS). here the Xeon Phi coprocessor cards are treated as another After both the executable is made ready, the executables node in a heterogeneous cluster. To that effect, running an has to be run on CPU and Intel Xeon Phi Card MPI job in either the native and symmetric modes is very simultaneously using the below said command. similar to running a job on Xeon processor. mpirun –f mpi_hosts –perhost 1 –n 2 ~/test_hello Offload Mode compilation: In this mode, user can simply add OpenMP-like pragmas to C/C++ or FORTRAN A. Case 1: CPU Compilation code to mark regions of code that should be offloaded to the Compiled and Benchmarked the MDynaMix code on Intel Xeon Phi Coprocessor and execute it. When the Intel Intel Xeon CPU (E5-2670) of frequency 2.60GHz i.e Sandy compiler encounters an offload pragma, it generates code for bridge architecture without HT (HyperTthread enabling). both the coprocessor and the host. The Code for transferring No of Node = 1 the data to the coprocessor is automatically created by the Total core = 16 compiler, however the programmer can influence the data Input file = md.input transfer by adding data clauses to the offload pragma. Number of steps = 500 Ex. icc -offload-build sample.c Following benchmark evaluation in Table 3 and Fig. 1 D. Hardware and software configuration shows the performance of Molecular Dynamic code on multi-core Intel’s Xeon processor (Sandy Bridge). Below is the hardware and software configuration which was used to conduct the experiment: TABLE III. RESULT ON INTEL XEON PROCESSOR TABLE I. HARDWARE CONFIGURATION Sr. No of Process (NP) Time (sec) No. Sr. Cluster Host Node Intel Xeon Phi 1 4 186 No. Parameter Configuration Co-processor 2 8 98 3 12 75 Intel Xeon 2.60 1 CPU 1.2GHz 4 16 87 GHz 2 RAM 24GB 8GB 3 Cores 16 61 Micro OS Red Hat EL6 4 OS 2.6.34.11- Kernel 2.6.32 g4af9302 5 Thread /core 1 4 8 (2 cards per 6 Nodes 4 Nodes node) TABLE II. SOFTWARE CONFIGURATION Sr. Name Description No. 1 Intel® Compiler 13.0.1 2 Intel MPI Library 4.1.0.018 Figure 1. Result on Intel Xeon Processor 3 MDynaMix Version 2.0 341347 The code scale around 80% from 4 to 8 processes, but the power available on the card is used. Linear scalability is percentage of the scalability is reduced to 30% from 8 to 12 observed as the processes increased from 8 to 16. But the processes. From the above observation, the code is scaling up scalability reduced around 35% as the processes increased to 12 processes on the sandy bridge architecture using one from 16 to 60. For 4 threads per core the compute power node with given number of time steps in the input file. available at the card is completely used. The scalability of the code is around 80% from 8 process to 16 process and around B. Case 2: Native Compilation 30% from 16 processes to 60 processes while using 4 threads 1) A: Native compilation for single card within single per core. node with different threads per core Table 4 explains that the MDynaMix application scales In this case we have executed the code by varying the well on 2 threads per core on single Xeon Phi card. number of threads per core in the Xeon phi coprocessor to 2) Native compilation with KMP_AFFINITY explore the optimal threads per core. Environment variable This case explains the runs which were completed by The Intel runtime library has the ability to bind OpenMP changing the threads utilized per core in Xeon Phi threads to physical processing units. The interface is coprocessor for finalizing the optimal configuration for controlled using the KMP_AFFINITY environment variable. MDynaMix application. Thread affinity restricts execution of certain threads (virtual Following run time environment variables were exported execution units) to a subset of the physical processing units before running the executable. in a multiprocessor computer.

Load more