2013 First International Conference on Artificial Intelligence, Modelling & Simulation

Behavior of MDynaMix on Intel Xeon Phi Coprocessor

Manjunatha Valmiki Nisha Kurkure Shweta Das HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India email: [email protected] email: [email protected] email: [email protected]

Prashant Dinde Deepu CV Goldi Misra HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India HPCS, C-DAC, Pune, India email: [email protected] email: [email protected] email: [email protected]

Pradeep Sinha HPC and Corporate R&D, C-DAC, Pune, India email: [email protected]

Abstract—Over the years, computational science has witnessed exceptional growth, but still lagging in efficient programming A. About MDynaMix Code to effectively undertake research activities. Today, This program is a general purpose developments in almost all areas of Science & Technology code for simulations of mixtures of either rigid or flexible heavily rely on computational capabilities. The latest TOP500 molecules, interacting by AMBER-like in a supercomputing list shows the relevance of computational periodic rectangular cell. Algorithms for NVE simulation & modeling using accelerator technologies. Porting, (Microcanonical ensemble), NVT (Canonical ensemble), optimization, scaling and tuning of existing High Performance NPT (Isothermal–isobaric ensemble) and anisotropic NPT Computing (HPC) Applications on such hybrid architectures is are employed, Ewald sum for treatment of the electrostatic the norm for reaping the benefits of extreme scale computing. interactions, account for quantum effects using path integral This paper gauges the performance of MDynaMix application approach, as well as possibility for free energy computations from Molecular Dynamic domain on Intel Xeon Processor using expanded ensemble method. The program can be along with Intel Xeon Phi coprocessor. Different test cases were carried out to explore the performance of Intel Xeon Phi cards executed in both sequential and parallel. within the node as well as across the nodes. B. Intel Xeon Phi Architecture Keywords-Supercomputing, High Performance Computing, Intel Xeon Phi coprocessors offer high computational Intel Xeon Phi, Molecular Dynamic, Hybrid Architecture, numerical performance. Getting that performance requires Accelerators, Coprocessor, Micro OS, , Open Multi properly-tuned ; it must be highly scalable, highly Processing (OpenMP), Message Passing Interface (MPI). vectorized, utilize the available memory efficiently. Each core in the Intel Xeon Phi coprocessor is designed to be power efficient while providing a high throughput for highly I. INTRODUCTION parallel workloads and is capable of supporting 4 threads in HPC systems are not only becoming complex day-by-day hardware. The 4 hardware threads per physical core help in but also becoming challenging in terms of speedup and masking the effects of latencies on the in-order instruction scalability. Well-organized and flexible numerical algorithms execution. Application scalability is important because are essential to achieve high performance computing on applications typically have 200+ active threads on the Intel complex environments. The size of compute intensive Xeon Phi system. The computational power comes from the problems also increases as the computing technology 512 bit wide registers. The codes on the Intel Xeon Phi advances. One such addition to the ever demanding market coprocessor will needs to utilize these wide SIMD of HPC computing systems to cope with such challenges is instructions to extract desired performance levels. The best Intel's Xeon Phi Coprocessor accelerator. This paper performance will only be achieved when the number of explores the computation power of Intel’s Xeon Phi using a cores, threads and SIMD or vector operations are used molecular dynamics software package ‘MDynaMix’. The effectively. The main memory for Intel Xeon Phi paper also explains the performance of MDynaMix code coprocessor resides on the same physical card with the with different execution modes without affecting the coprocessor, and is completely separate and not synchronized accuracy of the results. with the memory on the host system. Familiar programming models like Open Multi Processing (OpenMP) and Message Passing Interface (MPI)

978-1-4799-3251-1/13 $31.00 © 2013 IEEE 340346 DOI 10.1109/AIMS.2013.71 allows the developer to execute the compute intensive part of II. EXPERIMENT IN DETAIL the code on underlying architecture. MDynaMix is a compute intensive molecular dynamic A skilled application developer can surely take advantage application. To realize the scalability and performance of of the increased processing power of large number of cores code on a hybrid architecture, it is required to understand the available on Xeon Phi by porting existing parallel scalability of the code and performance on the host CPU. applications and codes on the Coprocessor. C. Methods to compile the code for coprocessor The experiment has been carried out by using following different cases. Native compilation: In this mode, application directly Case 1: CPU Compilation: Compiled and executed on executes on coprocessor without being offloaded to Intel Xeon (Sandybridge) multicore processor by varying coprocessor. Such applications are required to be compiled number of process/threads. and built on the host system using -mmic supporting flag Case 2: Native Compilation: Compiled the code on which enables the compiler to generate an object file for CPU with –mmic flag, transferred the executable on card and Xeon Phi coprocessor. Executing applications in native mode executed thereon with varying number of threads (4-60) with requires all the dynamic libraries and executable must be different KMP_AFFINITY. present on the coprocessor. Case 3: Symmetric mode Compilation: Compiled the Symmetric Mode compilation: In this mode, MPI ranks code for both CPU and coprocessor are run on both the Xeon processors and the Xeon Phi In this experiment, the file system used is Network File coprocessor. The most important thing to remember is that System (NFS). here the Xeon Phi coprocessor cards are treated as another After both the executable is made ready, the executables node in a heterogeneous cluster. To that effect, running an has to be run on CPU and Intel Xeon Phi Card MPI job in either the native and symmetric modes is very simultaneously using the below said command. similar to running a job on Xeon processor. mpirun –f mpi_hosts –perhost 1 –n 2 ~/test_hello Offload Mode compilation: In this mode, user can simply add OpenMP-like pragmas to C/C++ or FORTRAN A. Case 1: CPU Compilation code to mark regions of code that should be offloaded to the Compiled and Benchmarked the MDynaMix code on Intel Xeon Phi Coprocessor and execute it. When the Intel Intel Xeon CPU (E5-2670) of frequency 2.60GHz i.e Sandy compiler encounters an offload pragma, it generates code for bridge architecture without HT (HyperTthread enabling). both the coprocessor and the host. The Code for transferring No of Node = 1 the data to the coprocessor is automatically created by the Total core = 16 compiler, however the programmer can influence the data Input file = md.input transfer by adding data clauses to the offload pragma. Number of steps = 500 Ex. icc -offload-build sample.c Following benchmark evaluation in Table 3 and Fig. 1 D. Hardware and software configuration shows the performance of Molecular Dynamic code on multi-core Intel’s Xeon processor (Sandy Bridge). Below is the hardware and software configuration which was used to conduct the experiment: TABLE III. RESULT ON INTEL XEON PROCESSOR

TABLE I. HARDWARE CONFIGURATION Sr. No of Process (NP) Time (sec) No. Sr. Cluster Host Node Intel Xeon Phi 1 4 186 No. Parameter Configuration Co-processor 2 8 98 3 12 75 Intel Xeon 2.60 1 CPU 1.2GHz 4 16 87 GHz 2 RAM 24GB 8GB 3 Cores 16 61 Micro OS Red Hat EL6 4 OS 2.6.34.11- Kernel 2.6.32 g4af9302 5 Thread /core 1 4 8 (2 cards per 6 Nodes 4 Nodes node)

TABLE II. SOFTWARE CONFIGURATION Sr. Name Description No. 1 Intel® Compiler 13.0.1 2 Intel MPI Library 4.1.0.018 Figure 1. Result on Intel Xeon Processor 3 MDynaMix Version 2.0

341347 The code scale around 80% from 4 to 8 processes, but the power available on the card is used. Linear scalability is percentage of the scalability is reduced to 30% from 8 to 12 observed as the processes increased from 8 to 16. But the processes. From the above observation, the code is scaling up scalability reduced around 35% as the processes increased to 12 processes on the sandy bridge architecture using one from 16 to 60. For 4 threads per core the compute power node with given number of time steps in the input file. available at the card is completely used. The scalability of the code is around 80% from 8 process to 16 process and around B. Case 2: Native Compilation 30% from 16 processes to 60 processes while using 4 threads 1) A: Native compilation for single card within single per core. node with different threads per core Table 4 explains that the MDynaMix application scales In this case we have executed the code by varying the well on 2 threads per core on single Xeon Phi card. number of threads per core in the Xeon phi coprocessor to 2) Native compilation with KMP_AFFINITY explore the optimal threads per core. Environment variable This case explains the runs which were completed by The Intel runtime library has the ability to bind OpenMP changing the threads utilized per core in Xeon Phi threads to physical processing units. The interface is coprocessor for finalizing the optimal configuration for controlled using the KMP_AFFINITY environment variable. MDynaMix application. Thread affinity restricts execution of certain threads (virtual Following run time environment variables were exported execution units) to a subset of the physical processing units before running the executable. in a multiprocessor computer. Depending upon the topology export MIC_ENV_PREFIX=PHI of the machine, thread affinity can have a dramatic effect on export PHI_KMP_AFFINITY=balanced the execution speed of a program. There are three types of export PHI_OMP_NUM_THREADS=120 interfaces one can use to specify this binding, which are According to the threads per core the below environment collectively referred to as the Intel OpenMP Thread Affinity variable was set: Interface. Three KMP_AFFINITY are balance, scatter and export PHI_KMP_PLACE_THREADS=60c,2t compact. export PHI_KMP_PLACE_THREADS=60c,3t The meaning of the different affinity types is best export PHI_KMP_PLACE_THREADS=60c,4t explained with an example. Imagine a system with 4 cores and 4 hardware threads per core. If 8 threads are placed, the TABLE IV. RESULT ON INTEL XEON PHI COPROCESSOR WITH assignments produced by the compact, scatter, and balanced DIFFERENT THREADS PER CORE types are shown in Fig. 3 below. Notice that compact does Sr. No. of 2 Threads 3 Threads 4 Threads not fully utilize all the cores in the system. For this reason it No. processes per core per core per core is recommended that applications are run using the scatter or Time (sec) balanced options in most cases for Xeon phi cards. 1 8 954 1051 987 2 16 497 577 562 3 30 306 334 326 4. 60 195 194 198

Figure 3. KMP_AFFINITY

In this case the run was taken by setting the environment variable as, export MIC_ENV_PREFIX=PHI export PHI_OMP_NUM_THREADS=120 According to the AFFINITY used, below environment variables were set: Figure 2. Result on Intel Xeon Phi Coprocessor with different threads per exportPHI_KMP_AFFINITY=granularity=fine, balanced core export PHI_KMP_AFFINITY=granularity=fine, scatter exportPHI_KMP_AFFINITY=granularity=fine, compact From Fig. 2 it is evident that, for 2 threads per core run, the code scaled 90% as the processes increased from 8 to 16, 38% from 16 to 30 and 36% from 30 to 60 processes respectively. For 3 threads per core 75% of the computing

342348 TABLE V. RESULT ON INTEL XEON PHI COPROCESSOR WITH node0-mic0 DIFFERENT THREAD AFFINITY node0-mic1 Sr. No. of KMP_AFFINITY No. Processes BALANCE SCATTER COMPACT Time (sec.) 1. 8 934 933 932 2. 16 582 575 496 3. 30 320 323 322 4. 60 186 187 185

Figure 5. Result of MDynaMix on Intel Xeon Phi Coprocessor on two cards within a node

To run the code on 2 phi cards. mpirun –machinefile hostfile –np 8 ./mdp. Another activity is to understand the scalability and performance of code by running 50% of work load on Xeon Phi0 and 50% on the Xeon Phi1. From the Fig. 5 it is observed that, while running 2 threads per core the scalability is around 90% as the number of process increased from 8 to 16. From 16 process to 30 process the scalability is around 60 % and same with the 30 process to 60 process. While Figure 4. Result of MdynaMix on Intel Xeon Phi Coprocessor with running 3 threads per core, scalability increased 90% from 8 different Thread Affinity to 16 process and around 50% from 16 to 30 process and same with 30 to 60 process. By using 4 threads per core the From Fig. 4 it is inferred that, scalability of the code is scalability is different for each number of process. From 8 to almost similar when the affinity is set to balanced with the 16 process the code scale around 80% and from 16 to 30 increase in number of processes from 8 to 60 as compared to process the run time is reduce to around 30% and same with other affinities for this application. This is the only case 30 to 60 processes. where user is able to find the similar conduct performance 4) Native compilation for one and two cards for two w.r.t to increase the processes. nodes: 3) Native compilation for two cards within single node In this case the code is compiled in native mode and used All Coprocessor cards on the system need to have unique the phi cards in each node. Following environment variable IP address that’s accessible from any other card within the is set before running the code, cluster. Coprocessor are treated as another node within the export PHI_KMP_PLACE_THREADS=60c,2t cluster, so when the user wants to use 2 cards simultaneously The hostfile used to distributed the load evenly on each user needs to provide the hostname in hostfile while running cards for 8 number of process with 1 card using two nodes is, the code. node0-mic0 node0-mic0 TABLE VI. RESULT ON TWO PHI CARDS WITHIN A NODE node0-mic0 Sr. No. of 2 Threads per 3 Threads 4 Threads node0-mic0 No. Processes core per core per core node1-mic0 node1-mic0 Time (sec) node1-mic0 1 8 1080 938 979 2 16 536 561 561 node1-mic0 3 30 337 318 337 The hostfile for 8 number of process with 2 card using 4 60 196 192 199 two nodes is, cat hostfile In this case two phi cards i.e. node0-mic0 and node0- node0-mic0 mic1 on same node were used. The code was compiled in node0-mic0 native mode. The hostfile consist of the name of two Phi node0-mic1 cards to be used for the run. node0-mic1 Cat hostfile node1-mic0 node1-mic0

343349 node1-mic1 Table 8 shows the benchmarking result of MDynaMix node1-mic1 application by using different combination of cpu and coprocessor threads.

ESULT ON TWO AND FOUR NODES WITH DIFFERENT HI TABLE VII. R P TABLE VIII. RESULT OF SYMMETRIC MODE COMPILATION CARDS Sr. Total No. of Process Time (sec) Sr. No of 2 nodes 4 nodes No. Processes No. processes CPU Process Phi Process With 1 card With 2 card With 1 With 2 1 8 2 6 958 card card 2 10 2 8 778 1 8 942 951 947 932 3 17 1 16 471 2 16 544 525 550 498 4 22 2 20 400 5 42 2 40 253 3 32 303 310 308 309 6 62 2 60 219 4 64 290 221 228 197 7 81 1 80 266 8 82 2 80 284 From the Table 7 it is observed that the code is linearly scaling by using four nodes with two Xeon phi cards on each nodes. We get optimal performance by using 64 number of processes on four nodes with two Xeon phi cards each. From the Fig. 6 it is observed that this application scale as we increase the number of nodes with coprocessor.

Figure 7. Result of symmetric mode compilation

From Fig. 8 it is observed that, The code is scaling from 8 ( 2 cpu + 6 Phi ) to 62 ( 2 cpu + 60 phi ) processes. In this Figure 6. Result on two and four nodes with different Phi cards case one or two cpu processes and varying number of Phi processes are used. In this mode, the load can be distributed C. Case 3: Symmetric mode Compilation on both processor and coprocessor. All coprocessor cards on the system need to have a III. CONCLUSION unique IP address that's accessible from the local host means MDynaMix is the popular code used in the area of other Xeon hosts on the system and other Xeon Phi cards are molecular dynamic domain. From the above experiments, we attached to those hosts. A very simple test of this will be the conclude that the performance of codes can vary depending ability to ssh from one Xeon Phi coprocessor (let's call it on the arrangement of the threads or process on the single node0-mic0) to its own Xeon host (node0), as well as ssh to core of the code. Since MDynaMix code is a compute any other Xeon host on the cluster (node1) and their intensive and mainly implemented using MPI, it performs respective Xeon Phi card (node1-mic0). well with 2 threads per core combination enhanced on Xeon In this mode, the computing resource of both cpu and Phi. coprocessor are used. To redistribute the load on cpu and In this paper, we have explained the different modes of coprocessor accordingly, one has to understand the load compilation for Intel Xeon Phi card within the nodes and balance between ranks of the programs. In this case, the run across the nodes also. The code speed can be enhanced by was taken by using different combination of cpu and vectorization and utilization of memory/ cache in an optimal coprocessor to find out the optimal combination. way. For the Xeon phi Card. mpicc –mmic –o test_hello.MIC test.c IV. FUTURE WORK For the Xeon host. mpicc -o test_hello test.c We will continue our efforts to explore and understand the performance behavior of various open source scientific application on Xeon phi cluster by using native, offload and

344350 symmetric mode across the cluster. For hybrid architecture [9] http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi- user need to optimize or vectorize the codes, to get maximum HTML utilization of coprocessor. [10] http://software.intel.com/sites/default/files/article/335818/intel-xeon- phi-coprocessor-quick-start-developers-guide.pdf REFERENCES [11] http://goparallel.sourceforge.net/wp- content/uploads/2013/01/Colfax_Nbody_Xeon_Phi.pdf [1] http://software.intel.com/en-us/mic-developer [12] http://www.inf.ufrgs.br/gppd/wsppd/2013/papers/wsppd2013_submis [2] http://software.intel.com/en-us/articles/openmp-thread-affinity- sion_5.pdf.mod.pdf control [13] http://www.cs.rpi.edu/~chrisc/COURSES/PARALLEL/SPRING- [3] http://www.mmk.su.se/~sasha/mdynamix/versions/index.html 2013/papers/intel-phi.pdf [4] http://www.tomshardware.com/reviews/xeon-phi-larrabee-stampede- [14] http://www.pds.ewi.tudelft.nl/fileadmin/pds/homepages/fang/papers/P hpc,3342-3.html DS-2013-006.pdf [5] http://www.intel.in/content/www/us/en/architecture-and- [15] https://www.xsede.org/documents/271087/586927/CRosales_TACC_ technology/many-integrated-core/intel-many-integrated-core- porting_mic.pdf architecture.html [16] G. Misra, N. Kurkure, A. Das, M. Valmiki, S. Das, and A. [6] http://www.tacc.utexas.edu/user-services/user-guides/stampede-user- Gupta,Evaluation of Rodinia Codes on Intel Xeon Phi. 4th guide#running-table8 International Conference on Intelligent Systems Modelling & [7] http://software.intel.com/en-us/articles Simulation (2013) 415–419. [8] http://software.intel.com/en-us/articles/optimization-and- performance-tuning-for-intel-xeon-phi-coprocessors-part-1- optimization

345351