Su(3) Gluodynamics on Graphics Processing Units V

Total Page:16

File Type:pdf, Size:1020Kb

Su(3) Gluodynamics on Graphics Processing Units V Proceedings of the International School-seminar \New Physics and Quantum Chromodynamics at External Conditions", pp. 29 { 33, 3-6 May 2011, Dnipropetrovsk, Ukraine SU(3) GLUODYNAMICS ON GRAPHICS PROCESSING UNITS V. I. Demchik Dnipropetrovsk National University, Dnipropetrovsk, Ukraine SU(3) gluodynamics is implemented on graphics processing units (GPU) with the open cross-platform compu- tational language OpenCL. The architecture of the first open computer cluster for high performance computing on graphics processing units (HGPU) with peak performance over 12 TFlops is described. 1 Introduction Most of the modern achievements in science and technology are closely related to the use of high performance computing. The ever-increasing performance requirements of computing systems cause high rates of their de- velopment. Every year the computing power, which completely covers the overall performance of all existing high-end systems reached before is introduced according to popular TOP-500 list of the most powerful (non- distributed) computer systems in the world [1]. Obvious desire to reduce the cost of acquisition and maintenance of computer systems simultaneously, along with the growing demands on their performance shift the supercom- puting architecture in the direction of heterogeneous systems. These kind of architecture also contains special computational accelerators in addition to traditional central processing units (CPU). One of such accelerators becomes graphics processing units (GPU), which are traditionally designed and used primarily for narrow tasks visualization of 2D and 3D scenes in games and applications. The peak performance of modern high-end GPUs (AMD Radeon HD 6970, AMD Radeon HD 5870, nVidia GeForce GTX 580) is about 20 times faster than the peak performance of comparable CPUs (see Fig. 1). For last five years the GPU performance has been increasing twice per year. Thanks to a substantially lower cost (per Flops1), as well as high rates of productivity growth compared to CPUs, the graphics accelerators have become a very attractive platform for high performance computing. Current overall performance of all computer systems ranked in the last TOP-500 list is about 85 PFlops [1]. The number of computer systems for different countries is: USA - 255 computers (51.0%), China - 61 (12.2%), Germany - 30 (6.0%), UK - 27 (5.4%), Japan - 26 (5.2%), France - 25 (5.0%), Russia - 12 (2.4%). Some of these supercomputer systems, equipped with graphics accelerators are shown in Table 1 (peak performance of corresponding system measured in GFlops is shown in the last column). According to Flynn's taxonomy [2], which describes four classes in both a serial and parallel context, GPUs are based on so-called SIMD architecture. SIMD (Single Instruction stream - Multiple Data stream) is an architecture, in which each processing unit executes the same instruction stream as the others on its own set of data. In other words, a set of processors share the same control unit, and their execution differs only by the different data elements each processor operates on. Due to a large number of processing units in GPU and SIMD architecture it is reasonable to implement the lattice simulations on GPU. The technique of using GPU, which typically handles computations only for computer graphics, to perform computation in applications traditionally handled by the CPU is called general-purpose computing on graphics processing units (GPGPU). There are different approaches for GPU utilization. One of the novel languages for programming heterogeneous systems is an open computational language OpenCL [3]. It is maintained by the group of industry vendors, Khronos Group, that have interest in promoting a general way to target different computational devices. In spite of some performance inefficiency, OpenCL possesses a very important feature for high performance computation { portability. So, we choose OpenCL as programming environment for our task. The technical details of lattice simulations on computers are well-known [4, 5, 6, 7]. We implemented the standard lattice simulation scheme on the GPU with the OpenCL language. The scheme is described in the next Section. The details of GPU implementation is shown in Section 3. 1Flops (floating point operations per second) { is a measure of computer's performance e-mail: [email protected] c Demchik V.I. , 2011. 30 Demchik V.I. Figure 1. GPU and CPU performance growth. Figure 2. Checkerboard decomposi- tion of the lattice. 2 Lattice formulation We used hypercubic lattice L L3 (L < L ) with hypertorus geometry; L and L are the temporal and the t × s t s t s spatial sizes of the lattice, respectively. Evidently, the total number of lattice sites is Nsites = L1 L2 L3 Lt. In order to parallelize the simulation algorithm we have implemented so-called checkerboard scheme× ×(see×Fig. 2) { the whole lattice is divided into 2 parts { odd (red) and even (green) sites. The number of odd sites is Nvect = Nsites=2, as well as the number of even ones. The lattice may be initialized by cold or hot configuration. Standard Wilson one-plaquette action of SU(3) lattice gauge theory is used, 1 S = β 1 Tr U (x) ; (1) W − 3 µν x µ<ν X X Uµν (x) = Uµ(x)Uν (x + aµ^)Uµy (x + aν^)Uνy (x); (2) 2 where β = 6=g is the lattice coupling constant, g is the bare coupling, Uµ(x) is the link variable located on the link leaving the lattice site x in the direction µ, Uµν (x) is the ordered product of the link variables (here and below we omit the group indices). The defining representations of the gauge field variables are the complex 3 3 matrices for SU(3). Due to the unitarity of the SU(3) group the minimal number of parameters is 8.×There are different possible representations of SU(3) matrices in computer memory [4]. In order to speed up the practical calculations it is more convenient to use a redundant representation. The SU(3) group elements may be represented either by the complete complex matrix, or by the first two rows, corresponding to the complex 3-vectors u and v. Due to the group properties of SU(3) matrices, the third row of the matrix can be reconstructed from vectors u and v: u U (x) = v : (3) µ 0 1 u∗ v∗ × @ A So, SU(3) lattice variable could be presented as a set of 6 complex numbers (12 real numbers). No Computer Country Year Cores Rmax 1 K computer Japan 2011 548352 8162000 2 NUDT TH MPP, Tesla C2050 GPU China 2010 186368 2566000 3 Cray XT5-HE USA 2009 224162 1759000 4 Dawning TC3600 Blade, Tesla C2050 GPU China 2010 120640 1271000 5 HP ProLiant SL390s, Tesla C2050 GPU Japan 2010 73278 1192000 13 T-Platforms T-Blade2/1.1, Tesla C2070 GPU Russia 2011 33072 674105 22 Supermicro Cluster, ATI Radeon GPU Germany 2011 16368 299300 499 Cluster Platform 3000 BL2x220 Germany 2010 4800 40217 500 BladeCenter HS22 Cluster China 2010 7104 40187 Table 1. Some computer systems in TOP-500 list of World's Supercomputers [1] SU(3) gluodynamics on Graphics Processing Units 31 To avoid the accumulation of rounding errors in the multiplications of the group elements, all the matrices have to be projected to unitarity regularly. It is well-know Gram-Schmidt method for building orthonormal basis elements in vector spaces, unew = u= u ; (4) j j vnew = v0= v0 ; v0 = v unew(v u∗ ): j j − · new To generate a candidate matrix for the updating algorithm we used Cabibbo-Marinari scheme [14]. SU(3) random matrices are constructed from three 2 2 SU(2) random matrices (R, S and T ): × r11 r12 0 s11 0 s12 1 0 0 U¯ a (x) = r r 0 0 1 0 0 t t (5) µ 0 21 22 1 0 1 0 11 12 1 0 0 1 s21 0 s22 0 t21 t22 @ A @ A @ A To update the lattice, heat-bath algorithm with overrelaxation was used [8, 15]. At every MC iteration we successively replaced each lattice link U (x) U¯ (x) = wW (x); (6) µ ! µ µ W (x) = U (x)U (x + aν^)Uy (x + aµ^) + Uy (x aν^)U (x aν^)U (x + a(µ^ ν^)) ; (7) µ ν µ ν ν − µ − ν − ν X(µ<ν) h i w is a normalization constant such that U¯ (x) SU(3): (8) µ 2 The central part of the lattice simulation program is the updating subroutine. We have to systematically update all lattice links for given configuration by suggesting new variables and accepting them. One time visiting all links is called a lattice sweep. Sufficiently many updates have to be done, until the subsequently generated configurations represent the equilibrium distribution. This procedure is known as thermalization. After the thermalization, one can start to compute the observables (e.g. energy) on the Monte Carlo configurations. The configurations used for measuring are separated by several intermediate updates. There are several software packages for Monte Carlo simulations on a lattice: FermiQCD: an open C++ library for development of parallel Lattice Quantum Field Theory computa- • tions [9, 10]. MILC: an open code of high performance research software written in C for doing SU(3) lattice gauge • theory simulations on several different (MIMD) parallel computers [11]. SZIN: the open-source software system supports data-parallel programming constructs for lattice field • theory and in particular lattice QCD [12]. QUDA: a library for performing calculations in lattice QCD on graphics processing units [13]. • First three mentioned packages are central processing unit (CPU)-based, while the later one is intended for using nVidia GPUs only. In the next Section we will show how the described simulation algorithm can be implemented both on AMD/ATI, and nVidia GPUs.
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗
    Improving tasks throughput on accelerators using OpenCL command concurrency∗ A.J. L´azaro-Mu~noz1, J.M. Gonz´alez-Linares1, J. G´omez-Luna2, and N. Guil1 1 Dep. of Computer Architecture University of M´alaga,Spain [email protected] 2 Dep. of Computer Architecture and Electronics University of C´ordoba,Spain Abstract A heterogeneous architecture composed by a host and an accelerator must frequently deal with situations where several independent tasks are available to be offloaded onto the accelerator. These tasks can be generated by concurrent applications executing in the host or, in case the host is a node of a computer cluster, by applications running on other cluster nodes that are willing to offload tasks in the accelerator connected to the host. In this work we show that a runtime scheduler that selects the best execution order of a group of tasks on the accelerator can significantly reduce the total execution time of the tasks and, consequently, increase the accelerator use. Our solution is based on a temporal execution model that is able to predict with high accuracy the execution time of a set of concurrent tasks launched on the accelerator. The execution model has been validated in AMD, NVIDIA, and Xeon Phi devices using synthetic benchmarks. Moreover, employing the temporal execution model, a heuristic is proposed which is able to establish a near-optimal tasks execution ordering that signicantly reduces the total execution time, including data transfers. The heuristic has been evaluated with five different benchmarks composed of dominant kernel and dominant transfer real tasks. Experiments indicate the heuristic is able to find always an ordering with a better execution time than the average of every possible execution order and, most times, it achieves a near-optimal ordering (very close to the execution time of the best execution order) with a negligible overhead.
    [Show full text]
  • State-Of-The-Art in Parallel Computing with R
    Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann State-of-the-art in Parallel Computing with R Technical Report Number 47, 2009 Department of Statistics University of Munich http://www.stat.uni-muenchen.de State-of-the-art in Parallel Computing with R Markus Schmidberger Martin Morgan Dirk Eddelbuettel Ludwig-Maximilians-Universit¨at Fred Hutchinson Cancer Debian Project, Munchen,¨ Germany Research Center, WA, USA Chicago, IL, USA Hao Yu Luke Tierney Ulrich Mansmann University of Western Ontario, University of Iowa, Ludwig-Maximilians-Universit¨at ON, Canada IA, USA Munchen,¨ Germany Abstract R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on com- puter clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on com- puter clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix. Keywords: R, high performance computing, parallel computing, computer cluster, multi-core systems, grid computing, benchmark.
    [Show full text]
  • Towards a Scalable File System on Computer Clusters Using Declustering
    Journal of Computer Science 1 (3) : 363-368, 2005 ISSN 1549-3636 © Science Publications, 2005 Towards a Scalable File System on Computer Clusters Using Declustering Vu Anh Nguyen, Samuel Pierre and Dougoukolo Konaré Department of Computer Engineering, Ecole Polytechnique de Montreal C.P. 6079, succ. Centre-ville, Montreal, Quebec, H3C 3A7 Canada Abstract : This study addresses the scalability issues involving file systems as critical components of computer clusters, especially for commercial applications. Given that wide striping is an effective means of achieving scalability as it warrants good load balancing and allows node cooperation, we choose to implement a new data distribution scheme in order to achieve the scalability of computer clusters. We suggest combining both wide striping and replication techniques using a new data distribution technique based on “chained declustering”. Thus, we suggest a complete architecture, using a cluster of clusters, whose performance is not limited by the network and can be adjusted with one-node precision. In addition, update costs are limited as it is not necessary to redistribute data on the existing nodes every time the system is expanded. The simulations indicate that our data distribution technique and our read algorithm balance the load equally amongst all the nodes of the original cluster and the additional ones. Therefore, the scalability of the system is close to the ideal scenario: once the size of the original cluster is well defined, the total number of nodes in the system is no longer limited, and the performance increases linearly. Key words : Computer Cluster, Scalability, File System INTRODUCTION serve an increasing number of clients.
    [Show full text]
  • Programmable Interconnect Control Adaptive to Communication Pattern of Applications
    Title Programmable Interconnect Control Adaptive to Communication Pattern of Applications Author(s) 髙橋, 慧智 Citation Issue Date Text Version ETD URL https://doi.org/10.18910/72595 DOI 10.18910/72595 rights Note Osaka University Knowledge Archive : OUKA https://ir.library.osaka-u.ac.jp/ Osaka University Programmable Interconnect Control Adaptive to Communication Pattern of Applications Submitted to Graduate School of Information Science and Technology Osaka University January 2019 Keichi TAKAHASHI This work is dedicated to my parents and my wife List of Publications by the Author Journal [1] Keichi Takahashi, S. Date, D. Khureltulga, Y. Kido, H. Yamanaka, E. Kawai, and S. Shimojo, “UnisonFlow: A Software-Defined Coordination Mechanism for Message-Passing Communication and Computation”, IEEE Access, vol. 6, no. 1, pp. 23 372–23 382, 2018. : 10.1109/ACCESS.2018.2829532. [2] A. Misawa, S. Date, Keichi Takahashi, T. Yoshikawa, M. Takahashi, M. Kan, Y. Watashiba, Y. Kido, C. Lee, and S. Shimojo, “Dynamic Reconfiguration of Computer Platforms at the Hardware Device Level for High Performance Computing Infrastructure as a Service”, Cloud Computing and Service Science. CLOSER 2017. Communications in Computer and Information Science, vol. 864, pp. 177–199, 2018. : 10.1007/978-3-319-94959-8_10. [3] S. Date, H. Abe, D. Khureltulga, Keichi Takahashi, Y. Kido, Y. Watashiba, P. U- chupala, K. Ichikawa, H. Yamanaka, E. Kawai, and S. Shimojo, “SDN-accelerated HPC Infrastructure for Scientific Research”, International Journal of Information Technology, vol. 22, no. 1, pp. 789–796, 2016. International Conference (with review) [1] Y. Takigawa, Keichi Takahashi, S. Date, Y. Kido, and S. Shimojo, “A Traffic Sim- ulator with Intra-node Parallelism for Designing High-performance Interconnects”, in 2018 International Conference on High Performance Computing & Simula- tion (HPCS 2018), Jul.
    [Show full text]
  • Introduction to Parallel Computing
    Introduction to Parallel 14 Computing Lab Objective: Many modern problems involve so many computations that running them on a single processor is impractical or even impossible. There has been a consistent push in the past few decades to solve such problems with parallel computing, meaning computations are distributed to multiple processors. In this lab, we explore the basic principles of parallel computing by introducing the cluster setup, standard parallel commands, and code designs that fully utilize available resources. Parallel Architectures A serial program is executed one line at a time in a single process. Since modern computers have multiple processor cores, serial programs only use a fraction of the computer's available resources. This can be benecial for smooth multitasking on a personal computer because programs can run uninterrupted on their own core. However, to reduce the runtime of large computations, it is benecial to devote all of a computer's resources (or the resources of many computers) to a single program. In theory, this parallelization strategy can allow programs to run N times faster where N is the number of processors or processor cores that are accessible. Communication and coordination overhead prevents the improvement from being quite that good, but the dierence is still substantial. A supercomputer or computer cluster is essentially a group of regular computers that share their processors and memory. There are several common architectures that combine computing resources for parallel processing, and each architecture has a dierent protocol for sharing memory and proces- sors between computing nodes, the dierent simultaneous processing areas. Each architecture oers unique advantages and disadvantages, but the general commands used with each are very similar.
    [Show full text]
  • A Beginner's Guide to High–Performance Computing
    A Beginner's Guide to High{Performance Computing 1 Module Description Developer: Rubin H Landau, Oregon State University, http://science.oregonstate.edu/ rubin/ Goals: To present some of the general ideas behind and basic principles of high{performance com- puting (HPC) as performed on a supercomputer. These concepts should remain valid even as the technical specifications of the latest machines continually change. Although this material is aimed at HPC supercomputers, if history be a guide, present HPC hardware and software become desktop machines in less than a decade. Outcomes: Upon completion of this module and its tutorials, the student should be able to • Understand in a general sense the architecture of high performance computers. • Understand how the the architecture of high performance computers affects the speed of programs run on HPCs. • Understand how memory access affects the speed of HPC programs. • Understand Amdahl's law for parallel and serial computing. • Understand the importance of communication overhead in high performance computing. • Understand some of the general types of parallel computers. • Understand how different types of problems are best suited for different types of parallel computers. • Understand some of the practical aspects of message passing on MIMD machines. Prerequisites: • Experience with compiled language programming. • Some familiarity with, or at least access to, Fortran or C compiling. • Some familiarity with, or at least access to, Python or Java compiling. • Familiarity with matrix computing and linear algebra. 1 Intended Audience: Upper{level undergraduates or graduate students. Teaching Duration: One-two weeks. A real course in parallel computing would of course be at least a full term.
    [Show full text]
  • Performance Comparison of MPICH and Mpi4py on Raspberry Pi-3B
    Jour of Adv Research in Dynamical & Control Systems, Vol. 11, 03-Special Issue, 2019 Performance Comparison of MPICH and MPI4py on Raspberry Pi-3B Beowulf Cluster Saad Wazir, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Ataul Aziz Ikram, EPIC Lab FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Hamza Ali Imran, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islambad, Pakistan. E-mail: [email protected] Hanif Ullah, Research Scholar, Riphah International University, Islamabad, Pakistan. E-mail: [email protected] Ahmed Jamal Ikram, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Maryam Ehsan, Information Technology Department, University of Gujarat, Gujarat, Pakistan. E-mail: [email protected] Abstract--- Moore’s Law is running out. Instead of making powerful computer by increasing number of transistor now we are moving toward Parallelism. Beowulf cluster means cluster of any Commodity hardware. Our Cluster works exactly similar to current day’s supercomputers. The motivation is to create a small sized, cheap device on which students and researchers can get hands on experience. There is a master node, which interacts with user and all other nodes are slave nodes. Load is equally divided among all nodes and they send their results to master. Master combines those results and show the final output to the user. For communication between nodes we have created a network over Ethernet. We are using MPI4py, which a Python based implantation of Message Passing Interface (MPI) and MPICH which also an open source implementation of MPI and allows us to code in C, C++ and Fortran.
    [Show full text]
  • Designing a Low Cost and Scalable PC Cluster System for HPC Environment
    International Conference on Advances in Communication and Computing Technologies (ICACACT) 2012 Proceedings published by International Journal of Computer Applications® (IJCA) Designing a Low Cost and Scalable PC Cluster System for HPC Environment Laxman S. Naik Department of Computer Engineering, RMCET, Universityof Mumbai. ABSTRACT cluster system for scalable network services is characterized In recent years many organizations are trying to design an by its high performance, high availability, high scalable file advanced computing environment to get the high system, low cost, and broad range of applications. The parallel performance. Also, the small academic institutions are virtual file system version 2 (PVFS2)[2,7,8,10] is deployed in wishing to develop an effective computing and digital the system to provide a high performance and scalable parallel communication environment. It creates problems in getting the file system for PC clusters. In addition with PVFS2 the required powerful hardware components and softwares because MPICH2 (MPI-IO implementation)[1,9] is combined for the high level servers and workstations are very expensive. It message passing. is not affordable to purchase the high level servers and Clusters of servers, connected by a fast network are emerging workstations for the small academic institutions. In this paper as a viable architecture for building highly scalable and we have proposed a low cost and scalable PC cluster system by available services. This type of loosely coupled architecture is using the Commodity off the Shelf Personal Computers and more scalable, more cost effective, and more reliable than a free open source softwares. tightly coupled multiprocessor system. However, a number of challenges must be addressed to make cluster technology an Keywords efficient architecture to support scalable services [4].
    [Show full text]
  • Review Questions and Answers on Cloud Overview 1
    Review Questions and Answers on Cloud Overview 1. Cloud Computing was known to be a solution for the following problems: a. Scalability of Enterprise applications b. Disaster – failure due to unplanned demand c. Increasing Capital investment on IT Infrastructure Explain the nature of the problem and how cloud computing is able to address them. See answer for a) http://www.latisys.com/blog-post/enterprise-cloud-services-solve-it- scalability-problems See answer for b) http://www.iland.com/solutions/disaster-recovery/ http://searchdisasterrecovery.techtarget.com/feature/Disaster-recovery-in-the-cloud- explained http://www.onlinetech.com/404 See answer for c) http://www.opengroup.org/cloud/cloud/cloud_for_business/roi.htm 2. Briefly describe the ways that cloud computing are both similar and different to grid computing. In the mid 1990s, the term Grid was coined to describe technologies that would allow consumers to obtain computing power on demand. Ian Foster and others posited that by standardizing the protocols used to request computing power, we could spur the creation of a Computing Grid, analogous in form and utility to the electric power grid. Researchers subsequently developed these ideas in many exciting ways, producing for example large- scale federated systems (TeraGrid, Open Science Grid, caBIG, EGEE, Earth System Grid) that provide not just computing power, but also data and software, on demand. Standards organizations (e.g., OGF, OASIS) defined relevant standards. More prosaically, the term was also co-opted by industry as a marketing term for clusters. But no viable commercial Grid Computing providers emerged, at least not until recently. So is “Cloud Computing” just a new name for Grid? In information technology, where technology scales by an order of magnitude, and in the process reinvents itself, every five years, there is no straightforward answer to such questions.
    [Show full text]
  • A Simplified Introduction to the Architecture of High Performance Computing
    International Journal of Science and Research (IJSR) ISSN: 2319-7064 ResearchGate Impact Factor (2018): 0.28 | SJIF (2018): 7.426 A Simplified Introduction to the Architecture of High Performance Computing Utkarsh Rastogi Department of Computer Engineering and Application, GLA University, India Abstract: High-performance computing (HPC) is the ability to process data and perform complex calculations at high speeds. To put it into perspective, a laptop or desktop with a 3 GHz processor can perform around 3 billion calculations per second. While that is much faster than any human can achieve, it pales in comparison to HPC solutions that can perform quadrillions of calculations per second. One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. It’s similar to having thousands of PCs networked together, combining compute power to complete tasks faster. 1. Introduction components operate seamlessly to complete a diverse set of tasks. HPC systems often include several different types of The paper is focused on IST Gravity Group Computer nodes, which are specialised for different purposes. Cluster, Baltasar Sete S´ois and study the impact of a Head (or front-end or login) nodes are where you login to decoupled high performance computing system architecture interact with the HPC system. Compute nodes are where the for data-intensive sciences. Baltasar is part of the “DyBHo” real computing is done. You generally do not have access to ERC Starting Grant awarded to Prof. Vitor Cardoso, to study the compute nodes directly - access to these resources is and understand dynamical black holes.
    [Show full text]
  • Camx MULTIPROCESSING CAPABILITY for COMPUTER CLUSTERS USING the MESSAGE PASSING INTERFACE (MPI) PROTOCOL
    ENVIRON International Corporation Final Report CAMx MULTIPROCESSING CAPABILITY FOR COMPUTER CLUSTERS USING THE MESSAGE PASSING INTERFACE (MPI) PROTOCOL Work Order No. 582-07-84005-FY08-06 Project Number: 2008-01 Prepared for: Texas Commission on Environmental Quality 12100 Park 35 Circle Austin, TX 78753 Prepared by: Chris Emery Gary Wilson Greg Yarwood ENVIRON International Corporation 773 San Marin Dr., Suite 2115 Novato, CA 94998 August 31, 2008 773 San Marin Drive, Suite 2115, Novato, CA 94998 415.899.0700 August 2008 TABLE OF CONTENTS Page 1. INTRODUCTION................................................................................................................ 1-1 1.1 Objective and Background............................................................................................... 1-1 2. FINAL CAMx/MPI DEVELOPMENT .............................................................................. 2-1 2.1 PiG Implementation Approach ........................................................................................ 2-1 2.2 CAMx/MPI Status ........................................................................................................... 2-3 2.3 CAMx/MPI Speed Performance...................................................................................... 2-3 3. INSTALLING AND USING CAMx/MPI........................................................................... 3-1 3.1 Installing CAMx/MPI ...................................................................................................... 3-1 3.2 Preparing CAMx/MPI
    [Show full text]