Beowulf Cluster' for High-Performance Computing Tasks at the University: a Very Profitable Investment

Total Page:16

File Type:pdf, Size:1020Kb

Beowulf Cluster' for High-Performance Computing Tasks at the University: a Very Profitable Investment Informatica 25 (2001) 189-193 189 'Beowulf cluster' for high-performance computing tasks at the university: A very profitable investment. Manuel J. Galan, Fidel Garcia, Luis Alvarez, Antonio Ocon and Enrique Rubio ClCEl. Centro de Innovacion en Tecnologias de la Informacion. University of Las Palnias de Gran Canaria. Edificio de Arquitectura, Campus Universitario de TafiraBaja., 35017 Las Palmas de Gran Canaria, SPAIN. E-mail: [email protected] Keywords: Cluster, OSS, Linux, High performance computing, computer commodities. Received: January 24, 2001 We describe a high performance/low priče "computer cluster" named Beowulf Cluster. This kind of device was initially developed by T. Sterling and D. Becker (N.A.S.A) and it is a kind of cluster built primarily out of commodity hardware components, running an OSS (Open Source Software) operating system like Linux or FreeBSD, interconnected by a private high-speed network, dedicated to running high-performance computing tasks. We will show several advantages ofsuch a device, among them we can mention very high performance-price ratio, easy scalability, recyclingpossibilities ofthe hardware components and guarantee of iisability/upgradeability in the medium and long-term future. Ali these advantages make it specially suitable for the university environment and, thus, we make a description about the implementation of a Beowulf Cluster using commodity eguipment dedicated to run high- performance computing tasks at our imiversity focusing on the different areas of application of this device that range from prodiiction of high quality multimedia content to applications in numerical simidation in engineering. Now, the task of the programmer is to determining what 1 Introduction "concurrent" parts of the program should be executed in "parallel" and what parts should not. The answer to this will determine the efficiency ofthe application. 1.1 Concurrency and Parallelism In a 'so called' perfect parallel computer, the ratio of Regarding program execution, there is one very communication/processing would be equal to one and important distinction that needs to be made: the anything that is "concurrent" could be implemented in difference between "concurrency" and "parallelism". We "parallel". Unfortunately, real parallel computers, will define these two concepts as follows: including shared memory machines, do not behave "this • "Concurrency": the parts of a program that can be well". computed independently. • "Parallelism": the parallel parts of a program are 1.2 Architectures for Parallel Computing those "concurrency" parts that are executed on separate processing elements at the same tirne. 1.2.1 Hardware Architectures The distinction is very important, because "concurrency" is a property ofthe program and efficient "parallelism" is There are three common hardware architectures for parallel computing: a property ofthe machine. ldeally, "parallel" execution should result in faster • Shared memory machines, SMP, that communicate through memory, i.e. MPP (Massively Parallel performance. The limiting factor in parallel performance Processors, like the nCube, CMS, Convex SPP, Cray is the communication speed (bandwidth) and latency T3D, Cray T3E, etc). This king of configuration is between compute nodes. sustained on dedicated hardware. The main Many of the common parallel benchmarks are highly characteristics are a very high bandvvidth betvveen parallel and communication and latency are not the CPUs and mernory. bottleneck. This type of problem can be called "obviously parallel". Other applications are not so simple • Local memory machines that communicate by and executing "concurrent" parts of the program in messages, i.e. NOWs (Netvvorks of Workstations) "parallel" may actually cause the program to run slower, and clusters. In this category each workstation thus offsetting any performance gains in other maintains its individuality, however there is a tight "concurrent" parts of the program. In simple terms, the integration with the rest of the members of the cost of communication tirne must pay for the savings in cluster. So we can say they constitute a new entity computation tirne, otherwise the "parallel" execution of knowri as "The Cluster". In our proposal we will the "concurrent" part is inefficient. focus on a particular kind of cluster called "Beowulf Cluster" (see [3]). 190 Informatica 25 (2001) 189-193 M.J. Galanetal. et al. • Local memory machine that integrale in a ]oosely operation. Depending on the application, parallelizing knit collaborative netvvork. In this category we can code can be easy, extremely difficult, or in some cases include several collaborative Internet efforts which impossible due to algorithm dependencies. are able to share the load of a difficult and hard to solve problem among a large number of computers 1.3 Definition of Cluster executing an "ad hoc" client program. We can Cluster is a collection of machines connected using a mention the SETl@home (see [4]), Entropia Project network in such a way that they bebave like a single (see [5]), etc. Computer. Cluster is used for parallel processing, for load The formed classification is not strict, in the sense that it balancing and for fauU tolerance. Clustering is a popular is possible to connect many shared memory machines to strategy for implementing parallel processing create a "hybrid" shared memory machine. These hybrid applications because it enables conipanies to leverage the machines "look" like a single large SMP machine to the investment already made in PCs and workstations. In user and are often called NUMA (non-uniform memory addition, it's relatively easy to add new CPUs simply by access). It is also possible to connect SMP machines as adding a new PC or workstation to the network. local memory compute nodes. The user cannot (at this point) assign a specific task to a specific SMP processor. The user can, hovvever, start two independent processes 1.4 The Beowulf Cluster or a threaded process and expect to see a performance Beovvulf is not a special softvvare package, new netvvork increase over a single CPU system. Lastly we could add topology or the latest Linux kernel hack. It is a kind of several of this hybrid system into some collaborative cluster built primarily out of commodity hardvvare Internet effort building a super hybrid system. components, running an OSS (Open Source Softvvare) operating system like Linux or FreeBSD, interconnected 1.2.2 Software Architcctures by a private high-speed netvvork, dedicated to running high-performance computing tasks. In this part we will consider both the "basement" One of the main differences betvveen a Beovvulf Cluster softvvare (API) and the application issues. and a COW (Cluster of Workstations) is the fact that Beovvulf behaves more like a single machine rather than 1.2.2.1 Software API (Application Programming many vvorkstations. The nodes in the cluster don't sit on Interface) people's desks; they are dedicated to running cluster jobs. There are basically two ways to "express" concurrency in It is usually connected to the outside vvorld through only a program: a single node. • Using Messages sent between processors: A While most distributed computing systems provide Message is simple: some data and a destination general purpose, multi-user environments, the Beovvulf processor. Common message passing APIs are PVM distributed computing system is specifically designed for (see [6]) or MPI (see [7]). Messages require copying single user vvorkloads typical of high end scientific data whi]e Threads use data in plače. The latency vvorkstation environments. and speed at vvhich messages can be copied are the Beovvulf systems have been constructed from a variety of limiting factor with message passing models. The parts. For the sake of performance some non-commodity advantage to using messages on an SMP machine, as components (i.e. produced by a single manufacturer) opposed to Threads, is that if you decided to use have been employed. In order to account for the different clusters in the future it is easier to add machines or types of systems and to make discussions about machines scale up your application. a bit easier. It has been propose the follovving • Using operating system Threads: They were classification scheme: developed because shared memory SMP designs CLASS I: This class of machines built entirely from allovved very fast shared memory communication commodity vendor parts. We shall use the "Computer and synchronization betvveen concurrent parts of a Shopper" certification test to define "commodity off-the- program. In contrast to messages, a large amount of shelf parts. (Computer Shopper is a one-inch thick copying can be eliminated with threads. The most monthly magazine/catalog of PC systems and common API for threads is the POSIX API. It is components). A CLASS I Beovvulf is a machine that can difficult to extend threads beyond one SMP be assembled from parts found in at least 3 machine. It requires NUMA technology that is nationally/globally circulated advertising catalogs. difficult to implement. The advantages of a CLASS 1 systems are: hardvvare is Other methods do exist, but the formers are the most available form multiple sources (lovv prices, easy widely used. maintenance), no reliance on a single hardvvare vendor, driver support from O.S. commodity, usually based on 1.2.2.2 Application Issues standards (SCSI, Ethernet, etc). In order to run an application in parallel on multiple On the other side, the disadvantages of a CLASS 1 CPUs, it must be explicitly broken into concurrent parts. systems are: best performance may require CLASS II There are some tools and compilers that can break up hardvvare. programs, but parallelizing codes is not a "plug and play" CLASS II: This class is simply any machine that does not pass the Computer Shopper certification test. BEOWULF PROJECT AT ULPGC lnformatica25 (2001) 189-193 191 The advantages of a CLASS II system are: performance • LAMA's Materials Laboratory at UPV/EHU running can be quite good Monte Carlo simulations of phase transitions in The disadvantages of a CLASS H system are: driver condensed matter physics (see [14]).
Recommended publications
  • Multicomputer Cluster
    Multicomputer • Multiple (full) computers connected by network. • Distributed memory each have special address space. • Access to data another processor is explicit in program, express by call function for sending or receiving message. • Don’t need special operating System, enough libraries with function for sub sending message. • Good scalability. In this section we discuss network computing, in which the nodes are stand- alone computers that could be connected via a switch, local area network, or the Internet. The main idea is to divide the application into semi-independent parts according to the kind of processing needed. Different nodes on the network can be assigned different parts of the application. This form of network computing takes advantage of the unique capabilities of diverse system architectures. It also maximally leverages potentially idle resources within a large organization. Therefore, unused CPU cycles may be utilized during short periods of time resulting in bursts of activity followed by periods of inactivity. In what follows, we discuss the utilization of network technology in order to create a computing infrastructure using commodity computers. Cluster • In 1990 shifted from expensive and specialized parallel machines to the more cost-effective clusters of PCs and workstations. • A cluster is a collection of stand-alone computers connected using some interconnection network. • Each node in a cluster could be a workstation. • Important for it to have fast processors and fast network to enable it to use for distributed system. • Cluster workstation component: 1. Fast processor/memory and complete HW for PC. 2. Free access SW. 3. High execute, low latency. The 1990s have witnessed a significant shift from expensive and specialized parallel machines to the more cost-effective clusters of PCs and workstations.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Cluster Computing: Architectures, Operating Systems, Parallel Processing & Programming Languages
    Cluster Computing Architectures, Operating Systems, Parallel Processing & Programming Languages Author Name: Richard S. Morrison Revision Version 2.4, Monday, 28 April 2003 Copyright © Richard S. Morrison 1998 – 2003 This document is distributed under the GNU General Public Licence [39] Print date: Tuesday, 28 April 2003 Document owner: Richard S. Morrison, [email protected] ✈ +612-9928-6881 Document name: CLUSTER_COMPUTING_THEORY Stored: (\\RSM\FURTHER_RESEARCH\CLUSTER_COMPUTING) Revision Version 2.4 Copyright © 2003 Synopsis & Acknolegdements My interest in Supercomputing through the use of clusters has been long standing and was initially sparked by an article in Electronic Design [33] in August 1998 on the Avalon Beowulf Cluster [24]. Between August 1998 and August 1999 I gathered information from websites and parallel research groups. This culminated in September 1999 when I organised the collected material and wove a common thread through the subject matter producing two handbooks for my own use on cluster computing. Each handbook is of considerable length, which was governed by the wealth of information and research conducted in this area over the last 5 years. The cover the handbooks are shown in Figure 1-1 below. Figure 1-1 – Author Compiled Beowulf Class 1 Handbooks Through my experimentation using the Linux Operating system and the undertaking of the University of Technology, Sydney (UTS) undergraduate subject Operating Systems in Autumn Semester 1999 with Noel Carmody, a systems level focus was developed and is the core element of this material contained in this document. This led to my membership to the IEEE and the IEEE Technical Committee on Parallel Processing, where I am able to gather and contribute information and be kept up to date on the latest issues.
    [Show full text]
  • Su(3) Gluodynamics on Graphics Processing Units V
    Proceedings of the International School-seminar \New Physics and Quantum Chromodynamics at External Conditions", pp. 29 { 33, 3-6 May 2011, Dnipropetrovsk, Ukraine SU(3) GLUODYNAMICS ON GRAPHICS PROCESSING UNITS V. I. Demchik Dnipropetrovsk National University, Dnipropetrovsk, Ukraine SU(3) gluodynamics is implemented on graphics processing units (GPU) with the open cross-platform compu- tational language OpenCL. The architecture of the first open computer cluster for high performance computing on graphics processing units (HGPU) with peak performance over 12 TFlops is described. 1 Introduction Most of the modern achievements in science and technology are closely related to the use of high performance computing. The ever-increasing performance requirements of computing systems cause high rates of their de- velopment. Every year the computing power, which completely covers the overall performance of all existing high-end systems reached before is introduced according to popular TOP-500 list of the most powerful (non- distributed) computer systems in the world [1]. Obvious desire to reduce the cost of acquisition and maintenance of computer systems simultaneously, along with the growing demands on their performance shift the supercom- puting architecture in the direction of heterogeneous systems. These kind of architecture also contains special computational accelerators in addition to traditional central processing units (CPU). One of such accelerators becomes graphics processing units (GPU), which are traditionally designed and used primarily for narrow tasks visualization of 2D and 3D scenes in games and applications. The peak performance of modern high-end GPUs (AMD Radeon HD 6970, AMD Radeon HD 5870, nVidia GeForce GTX 580) is about 20 times faster than the peak performance of comparable CPUs (see Fig.
    [Show full text]
  • Building a Beowulf Cluster
    Building a Beowulf cluster Åsmund Ødegård April 4, 2001 1 Introduction The main part of the introduction is only contained in the slides for this session. Some of the acronyms and names in this paper may be unknown. In Appendix B we includ short descriptions for some of them. Most of this is taken from “whatis” [6] 2 Outline of the installation ² Install linux on a PC ² Configure the PC to act as a install–server for the cluster ² Wire up the network if that isn’t done already ² Install linux on the rest of the nodes ² Configure one PC, e.g the install–server, to be a server for your cluster. These are the main steps required to build a linux cluster, but each step can be done in many different ways. How you prefer to do it, depends mainly on personal taste, though. Therefor, I will translate the given outline into this list: ² Install Debian GNU/Linux on a PC ² Install and configure “FAI” on the PC ² Build the “FAI” boot–floppy ² Assemble hardware information, and finalize the “FAI” configuration ² Boot each node with the boot–floppy ² Install and configure a queue system and software for running parallel jobs on your cluster 3 Debian The choice of Linux distribution is most of all a matter of personal taste. I prefer the Debian distri- bution for various reasons. So, the first step in the cluster–building process is to pick one of the PCs as a install–server, and install Debian onto it, as follows: ² Make sure that the computer can boot from cdrom.
    [Show full text]
  • Beowulf Clusters Make Supercomputing Accessible
    Nor-Tech Contributes to NASA Article: Beowulf Clusters Make Supercomputing Accessible Original article available at NASA Spinoff: https://spinoff.nasa.gov/Spinoff2020/it_1.html NASA Technology In the Old English epic Beowulf, the warrior Unferth, jealous of the eponymous hero’s bravery, openly doubts Beowulf’s odds of slaying the monster Grendel that has tormented the Danes for 12 years, promising a “grim grappling” if he dares confront the dreaded march-stepper. A thousand years later, many in the supercomputing world were similarly skeptical of a team of NASA engineers trying achieve supercomputer-class processing on a cluster of standard desktop computers running a relatively untested open source operating system. “Not only did nobody care, but there were even a number of people hostile to this project,” says Thomas Sterling, who led the small team at NASA’s Goddard Space Flight Center in the early 1990s. “Because it was different. Because it was completely outside the scope of the Thomas Sterling, who co-invented the Beowulf supercomputing cluster at Goddard Space Flight Center, poses with the Naegling cluster at California supercomputing community at that time.” Technical Institute in 1997. Consisting of 120 Pentium Pro processors, The technology, now known as the Naegling was the first cluster to hit 10 gigaflops of sustained performance. Beowulf cluster, would ultimately succeed beyond its inventors’ imaginations. In 1993, however, its odds may indeed have seemed long. The U.S. Government, nervous about Japan’s high- performance computing effort, had already been pouring money into computer architecture research at NASA and other Federal agencies for more than a decade, and results were frustrating.
    [Show full text]
  • Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗
    Improving tasks throughput on accelerators using OpenCL command concurrency∗ A.J. L´azaro-Mu~noz1, J.M. Gonz´alez-Linares1, J. G´omez-Luna2, and N. Guil1 1 Dep. of Computer Architecture University of M´alaga,Spain [email protected] 2 Dep. of Computer Architecture and Electronics University of C´ordoba,Spain Abstract A heterogeneous architecture composed by a host and an accelerator must frequently deal with situations where several independent tasks are available to be offloaded onto the accelerator. These tasks can be generated by concurrent applications executing in the host or, in case the host is a node of a computer cluster, by applications running on other cluster nodes that are willing to offload tasks in the accelerator connected to the host. In this work we show that a runtime scheduler that selects the best execution order of a group of tasks on the accelerator can significantly reduce the total execution time of the tasks and, consequently, increase the accelerator use. Our solution is based on a temporal execution model that is able to predict with high accuracy the execution time of a set of concurrent tasks launched on the accelerator. The execution model has been validated in AMD, NVIDIA, and Xeon Phi devices using synthetic benchmarks. Moreover, employing the temporal execution model, a heuristic is proposed which is able to establish a near-optimal tasks execution ordering that signicantly reduces the total execution time, including data transfers. The heuristic has been evaluated with five different benchmarks composed of dominant kernel and dominant transfer real tasks. Experiments indicate the heuristic is able to find always an ordering with a better execution time than the average of every possible execution order and, most times, it achieves a near-optimal ordering (very close to the execution time of the best execution order) with a negligible overhead.
    [Show full text]
  • State-Of-The-Art in Parallel Computing with R
    Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann State-of-the-art in Parallel Computing with R Technical Report Number 47, 2009 Department of Statistics University of Munich http://www.stat.uni-muenchen.de State-of-the-art in Parallel Computing with R Markus Schmidberger Martin Morgan Dirk Eddelbuettel Ludwig-Maximilians-Universit¨at Fred Hutchinson Cancer Debian Project, Munchen,¨ Germany Research Center, WA, USA Chicago, IL, USA Hao Yu Luke Tierney Ulrich Mansmann University of Western Ontario, University of Iowa, Ludwig-Maximilians-Universit¨at ON, Canada IA, USA Munchen,¨ Germany Abstract R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on com- puter clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on com- puter clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix. Keywords: R, high performance computing, parallel computing, computer cluster, multi-core systems, grid computing, benchmark.
    [Show full text]
  • Spark on Hadoop Vs MPI/Openmp on Beowulf
    Procedia Computer Science Volume 53, 2015, Pages 121–130 2015 INNS Conference on Big Data Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf Jorge L. Reyes-Ortiz1, Luca Oneto2, and Davide Anguita1 1 DIBRIS, University of Genoa, Via Opera Pia 13, I-16145, Genoa, Italy ([email protected], [email protected]) 2 DITEN, University of Genoa, Via Opera Pia 11A, I-16145, Genoa, Italy ([email protected]) Abstract One of the biggest challenges of the current big data landscape is our inability to pro- cess vast amounts of information in a reasonable time. In this work, we explore and com- pare two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multi- core infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. We use the Google Cloud Platform service to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data manage- ment infrastructure and the possibility of dealing with other aspects such as node failure and data replication. Keywords: Big Data, Supervised Learning, Spark, Hadoop, MPI, OpenMP, Beowulf, Cloud, Parallel Computing 1 Introduction The information age brings along an explosion of big data from multiple sources in every aspect of our lives: human activity signals from wearable sensors, experiments from particle discovery research and stock market data systems are only a few examples [48].
    [Show full text]
  • Towards a Scalable File System on Computer Clusters Using Declustering
    Journal of Computer Science 1 (3) : 363-368, 2005 ISSN 1549-3636 © Science Publications, 2005 Towards a Scalable File System on Computer Clusters Using Declustering Vu Anh Nguyen, Samuel Pierre and Dougoukolo Konaré Department of Computer Engineering, Ecole Polytechnique de Montreal C.P. 6079, succ. Centre-ville, Montreal, Quebec, H3C 3A7 Canada Abstract : This study addresses the scalability issues involving file systems as critical components of computer clusters, especially for commercial applications. Given that wide striping is an effective means of achieving scalability as it warrants good load balancing and allows node cooperation, we choose to implement a new data distribution scheme in order to achieve the scalability of computer clusters. We suggest combining both wide striping and replication techniques using a new data distribution technique based on “chained declustering”. Thus, we suggest a complete architecture, using a cluster of clusters, whose performance is not limited by the network and can be adjusted with one-node precision. In addition, update costs are limited as it is not necessary to redistribute data on the existing nodes every time the system is expanded. The simulations indicate that our data distribution technique and our read algorithm balance the load equally amongst all the nodes of the original cluster and the additional ones. Therefore, the scalability of the system is close to the ideal scenario: once the size of the original cluster is well defined, the total number of nodes in the system is no longer limited, and the performance increases linearly. Key words : Computer Cluster, Scalability, File System INTRODUCTION serve an increasing number of clients.
    [Show full text]
  • Choosing the Right Hardware for the Beowulf Clusters Considering Price/Performance Ratio
    CHOOSING THE RIGHT HARDWARE FOR THE BEOWULF CLUSTERS CONSIDERING PRICE/PERFORMANCE RATIO Muhammet ÜNAL Selma YÜNCÜ e-mail: [email protected] e-mail: [email protected] Gazi University, Faculty of Engineering and Architecture, Department of Electrical & Electronics Engineering 06570, Maltepe, Ankara, Turkey Key words: Computer Hardware Architecture, Parallel Computers, Distributed Computing, Networks of Workstations (NOW), Beowulf, Cluster, Benchmark ABSTRACT Scalability requires: In this study, reasons of choosing Beowulf Clusters from • Functionality and Performance: The scaled-up system different types of supercomputers are given. Then Beowulf should provide more functionality or better systems are briefly summarized, and some design performance. information about the hardware system of the High Performance Computing (HPC) Laboratory that will be • Scaling in Cost: The cost paid for scaling up must be build up in Gazi University Electrical and Electronics reasonable. Engineering department is given. • Compatibility: The same components, including hardware, system software, and application software I. INTRODUCTION should still be usable with little change. The world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but II. SYSTEMS PERFORMANCE ATTRIBUTES the fast pace of change in computer hardware, software, Basic performance attributes are summarized in table 1. and algorithms often makes practical use of the newest Machine size n is the number of processors in a parallel computing technology difficult. Just two generations ago computer. Workload W is the number of computation based on Moore’s law [1], a plethora of vector operations in a program Ppeak is the peak speed of a supercomputers, non-scalable multiprocessors, and MPP processor.
    [Show full text]
  • Programmable Interconnect Control Adaptive to Communication Pattern of Applications
    Title Programmable Interconnect Control Adaptive to Communication Pattern of Applications Author(s) 髙橋, 慧智 Citation Issue Date Text Version ETD URL https://doi.org/10.18910/72595 DOI 10.18910/72595 rights Note Osaka University Knowledge Archive : OUKA https://ir.library.osaka-u.ac.jp/ Osaka University Programmable Interconnect Control Adaptive to Communication Pattern of Applications Submitted to Graduate School of Information Science and Technology Osaka University January 2019 Keichi TAKAHASHI This work is dedicated to my parents and my wife List of Publications by the Author Journal [1] Keichi Takahashi, S. Date, D. Khureltulga, Y. Kido, H. Yamanaka, E. Kawai, and S. Shimojo, “UnisonFlow: A Software-Defined Coordination Mechanism for Message-Passing Communication and Computation”, IEEE Access, vol. 6, no. 1, pp. 23 372–23 382, 2018. : 10.1109/ACCESS.2018.2829532. [2] A. Misawa, S. Date, Keichi Takahashi, T. Yoshikawa, M. Takahashi, M. Kan, Y. Watashiba, Y. Kido, C. Lee, and S. Shimojo, “Dynamic Reconfiguration of Computer Platforms at the Hardware Device Level for High Performance Computing Infrastructure as a Service”, Cloud Computing and Service Science. CLOSER 2017. Communications in Computer and Information Science, vol. 864, pp. 177–199, 2018. : 10.1007/978-3-319-94959-8_10. [3] S. Date, H. Abe, D. Khureltulga, Keichi Takahashi, Y. Kido, Y. Watashiba, P. U- chupala, K. Ichikawa, H. Yamanaka, E. Kawai, and S. Shimojo, “SDN-accelerated HPC Infrastructure for Scientific Research”, International Journal of Information Technology, vol. 22, no. 1, pp. 789–796, 2016. International Conference (with review) [1] Y. Takigawa, Keichi Takahashi, S. Date, Y. Kido, and S. Shimojo, “A Traffic Sim- ulator with Intra-node Parallelism for Designing High-performance Interconnects”, in 2018 International Conference on High Performance Computing & Simula- tion (HPCS 2018), Jul.
    [Show full text]