Application of Parallel Virtual Machine Framework to the Strong Prime Problem

Total Page:16

File Type:pdf, Size:1020Kb

Application of Parallel Virtual Machine Framework to the Strong Prime Problem Intern. J. Computer Math., 2002, Vol. 79(7), pp. 797–806 APPLICATION OF PARALLEL VIRTUAL MACHINE FRAMEWORK TO THE STRONG PRIME PROBLEM DER-CHUYAN LOU,* CHIA-LONG WU and RONG-YI OU Department of Electrical Engineering, Chung Cheng Institute of Technology, National Defense University, Tahsi, Taoyuan 33509, Taiwan (Received 9 April 2001) This paper use the well-discussed PVM (Parallel Virtual Machine) software with several personal computers, and adopt the widespread Microsoft Windows ‘98 operating system as our operation platform to construct a heterogeneous PCs cluster. By engaging the related researches of PC cluster system and cluster computing theory, we apply our heterogeneous PC cluster computing system to generate more secure parameters for some public key cryptosystems such as RSA. Copes with each parameter’s related mathematic theory’s restriction, enormous computation power is needed to get better computation performance in generating these parameters. In this paper, we contribute heterogeneous PCs combined with the PVM software to cryptosystem parameters, which is conformed to today’s safety specification and requirement. We practically generate these data to prove that computer cluster can effectively accumulate enormous computation power, and then demonstrate the cluster computation application in finding strong primes which are needed in some public key cryptosystems. Keywords: Keywords:#Parallel virtual machine; Cluster computing; Cryptography; Primality test; Strong prime C.R. Categories: C.R. Categories: E.3, F.1.2, I.1.1.2, K.6.5 1. INTRODUCTION In this section, the Parallel Virtual Machine (PVM) system that is based on the message- passing model will be introduced. Message-passing parallel programming can be considered and designed among those different machines for our integrated system based on their unique information and data format, and allow different machines make communication. Based on this property, we can have PVM [1, 2] connect through different working platforms to each other, combine them as one virtual machine with strong operation power, even each machine might has its different specification, this also specifies how the name ‘‘PVM’’ comes from. In 1989, a parallel computation program called PVM is proceeded in Oak Ridge National Lab [3]. This project was expected to offer a parallel computing environment with heteroge- neous and general properties, which not only can support multi-party protocols effectively but also can be adapted to the distributed computation algorithm. Although the PVM was motioned as the most popular distributed computation operation system in 1992, and has most of the user population, it doesn’t necessarily means PVM can finish all jobs automati- *Corresponding Author. Fax: 886-3-3801407; E-mail: [email protected] ISSN 0020-7160 print; ISSN 1029-0265 online # 2002 Taylor & Francis Ltd DOI: 10.1080=00207160290029228 798 D.-C. LOU et al. cally. PVM [4] can only provides an environment that makes the parallel program executable. Program designers must depend on their manual processes and clearly specify those program instructions where the parallel computation task is needed. PVM does not have the ability to distribute the instruction and data automatically. That means, it does not offer the automatic parallel mechanism. PVM provides for a software environment for message passing between homogeneous computers. In PVM main design program, users must define all the parallel procedures and they must understand the fact that even though PVM is a parallel computation interface, but all the controlling main programs are still controlled by sequential pattern. Its proceeding control can let PVM process be interrupted and become an Unix or a Window 32 procedure (which doesn’t have the parallel capability), or become a PVM procedure in general process. In general speaking, PVM is still a sequential control procedure. In this paper, we utilize the well-discussed PVM software that uses message-passing model as interface, accompanied with our personal computers and windows operating system Window’98 to build an experimental personal computer cluster. The PVM software can constructs a framework through different computer platforms. Different computers are used in this paper to construct a powerful computation virtual machine to satisfy the computer cryptosystem requirement that is urging the computation power. In this paper, we use three different rank’s PCs to demonstrate the heterogeneous property and to show homely personal-computers can also accumulate adequate computation power in solving the strong prime problem. Here are these computers’ specifications shown as Table I. The rest of the paper is organized as follows. Section 2 has focus on the strong prime problem and the bottleneck of the RSA public-key cryptosystem as well as the popular ‘‘cluster computing’’ topic. In Section 3, we then introduce and discuss several different theo- rems for primality test. Section 4 and Section 5 we here have demonstrated our experimental design and experimental performance results using primality test algorithms for RSA public- key. Finally, we put our research contribution and future work aspect in Section 6 as our conclusion. 2. THE STRONG PRIME PROBLEM As we know number theory has play an important role in the public-key cryptographic system [5]. Prime number is an essential issue in number theory. It has been well discussed to construct the strong prime as the mainly secure parameter in some the public-key crypto- systems. Here we will discuss the RSA public-key cryptosystem and its bottleneck as well as the strong prime number problem, next we concentrate on the cluster computing and PVM system concepts. 2.1. Bottleneck of the RSA Public-key Cryptosystem In 1978, three MIT professors: Rivest, Shamir, and Adleman brought the public-key crypto- system using security-based modular exponential function with complex factoring large prime numbers difficulties, is what people known the RSA public-key cryptosystem [6]. TABLE I System specifications Name Specification D-Celeron CPU: Celeron-450  2, Memory 128 MB Celeron CPU: Celeron-300, Memory 64 MB Pentium CPU: Pentium-75, Memory 48 MB CLUSTER COMPUTING PRIME NUMBERS 799 The RSA algorithm is widely used in public-key cryptosystems [7]. Public-key crypto- system, though to some extent advantages, still its disadvantages does exist. Especially in encryption=decryption operations respect, these operation processes are quite complex, enor- mous operation capability is needed. Comparing the RSA public-key cryptosystem with the DES (Data Encryption Standard) secret-key cryptosystem. The DES hardware chip can reach the speed with approximately 45 Mega bits per second, while the RSA cryptosystem only has 50 Kilo bits per second, there is approximate 1000 times difference, enough to specify the bottleneck of the RSA public-key cryptosystem. Nowadays, the DES cryptosystem is no longer secure and its major safety concern is coming from the Wiener’s [8] assumption (based on a known plaintext attack). Because these systems are vulnerable to a shortcut attack, they must use key sizes substan- tially greater than those required for comparable levels of security with traditional single-key methods. The AES [9] now has its secret-key length extended to 128 192 bits, the RSA cryptosystem is also being recommended to extend its public key from 512 bits to 1024 bits to keep its safety, therefore the computation capability we need to have is then enor- mously increased. 2.2. Strong Prime Number The RSA cryptosystem is a block cipher that will process the input one block of elements at a time and produce an output block for each input block. Plaintext is encrypted in blocks, and every binary value in each block is no greater than some number N. Assume we have two given prime numbers p and q, such that N can be calculated as N ¼ pq. By using the Euler’s theorem, we can then have fð pqÞ¼ðp À 1Þðq À 1Þ and d eÀ1 mod fðnÞ: That is ed is of the form ed ¼ kfðnÞþ1; therefore ed 1 mod fðnÞ: According to the statement shown above, we can understand the RSA cryptosystem is build its security-based property on the complexity of the factorization problem. It is obliv- ious that for in the public key (e, N) of the RSA cryptosystem, if N can be successfully factor- ized by factor p or q, then the trapdoor T ¼ fðNÞ¼ðp À 1Þðq À 1Þ and decryption key d which are the decryption process depending on is no place to hide. Therefore, the decryption key d can no longer keeps itself as a ‘‘secret’’ key, that means, ‘‘there exist no security’’ what- soever. Although it is not yet ‘‘identify’’ or ‘‘prove’’ the difficulty of how to break the RSA public key cryptosystem is as same as the effort of how we factorize the number N, but in general it is ‘‘believed’’ that the difficulty of breakdown the RSA cryptosystem is equal to factorize the number N. Therefore, for the RSA cryptosystem, how to choose its parameters should be considered most prudently and carefully. Since the RSA cryptosystem build its security-based property on the complexity of breaking down number N, the prime factors of N should satisfy the property of strong prime to assure that: it is computationally infeasible. The strong prime property is introduced as follows. r1, s1, r2, s2 be four extreme large prime numbers, we call them as ‘‘simple primes’’. Let xjy demote y is divisible by x.Ifwe have r1jp1 À 1; s1jp1 þ 1; r2jp2 À 1; s2jp2 þ 1; such p1, p2, we call them as ‘‘complex primes’’. To process these assemble steps furthermore we can have p1jp À 1; p2jp þ 1; then we can get p as so called ‘‘strong prime’’ [10]. The structure of a strong prime is shown as Figure 1. It is truly oblivious that any general prime number can also be called as simple prime.
Recommended publications
  • A Case for High Performance Computing with Virtual Machines
    A Case for High Performance Computing with Virtual Machines Wei Huangy Jiuxing Liuz Bulent Abaliz Dhabaleswar K. Panday y Computer Science and Engineering z IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 fhuanwei, [email protected] fjl, [email protected] ABSTRACT in the 1960s [9], but are experiencing a resurgence in both Virtual machine (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called hypervisor). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest operating system. They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of virtualization. Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps).
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Su(3) Gluodynamics on Graphics Processing Units V
    Proceedings of the International School-seminar \New Physics and Quantum Chromodynamics at External Conditions", pp. 29 { 33, 3-6 May 2011, Dnipropetrovsk, Ukraine SU(3) GLUODYNAMICS ON GRAPHICS PROCESSING UNITS V. I. Demchik Dnipropetrovsk National University, Dnipropetrovsk, Ukraine SU(3) gluodynamics is implemented on graphics processing units (GPU) with the open cross-platform compu- tational language OpenCL. The architecture of the first open computer cluster for high performance computing on graphics processing units (HGPU) with peak performance over 12 TFlops is described. 1 Introduction Most of the modern achievements in science and technology are closely related to the use of high performance computing. The ever-increasing performance requirements of computing systems cause high rates of their de- velopment. Every year the computing power, which completely covers the overall performance of all existing high-end systems reached before is introduced according to popular TOP-500 list of the most powerful (non- distributed) computer systems in the world [1]. Obvious desire to reduce the cost of acquisition and maintenance of computer systems simultaneously, along with the growing demands on their performance shift the supercom- puting architecture in the direction of heterogeneous systems. These kind of architecture also contains special computational accelerators in addition to traditional central processing units (CPU). One of such accelerators becomes graphics processing units (GPU), which are traditionally designed and used primarily for narrow tasks visualization of 2D and 3D scenes in games and applications. The peak performance of modern high-end GPUs (AMD Radeon HD 6970, AMD Radeon HD 5870, nVidia GeForce GTX 580) is about 20 times faster than the peak performance of comparable CPUs (see Fig.
    [Show full text]
  • Exploring Massive Parallel Computation with GPU
    Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Exploring Massive Parallel Computation with GPU Ian Bond Massey University, Auckland, New Zealand 2011 Sagan Exoplanet Workshop Pasadena, July 25-29 2011 Ian Bond | Microlensing parallelism 1/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Assumptions/Purpose You are all involved in microlensing modelling and you have (or are working on) your own code this lecture shows how to get started on getting code to run on a GPU then its over to you . Ian Bond | Microlensing parallelism 2/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Outline 1 Need for parallelism 2 Graphical Processor Units 3 Gravitational Microlensing Modelling Ian Bond | Microlensing parallelism 3/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Paralel Computing Parallel Computing is use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer (Wilkinson 2002). How does one achieve parallelism? Ian Bond | Microlensing parallelism 4/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with todays computers’ Examples: – Modelling large DNA structures – Global weather forecasting – N body problem (N very large) – brain simulation Has microlensing
    [Show full text]
  • Dcuda: Hardware Supported Overlap of Computation and Communication
    dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi Jeremia Bar¨ Torsten Hoefler Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich [email protected] [email protected] [email protected] Abstract—Over the last decade, CUDA and the underlying utilization of the costly compute and network hardware. To GPU hardware architecture have continuously gained popularity mitigate this problem, application developers can implement in various high-performance computing application domains manual overlap of computation and communication [23], [27]. such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent In particular, there exist various approaches [13], [22] to programming model for GPU clusters. We therefore introduce overlap the communication with the computation on an inner the dCUDA programming model, which implements device- domain that has no inter-node data dependencies. However, side remote memory access with target notification. To hide these code transformations significantly increase code com- instruction pipeline latencies, CUDA programs over-decompose plexity which results in reduced real-world applicability. the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a High-performance system design often involves trading thread stalls, the hardware scheduler immediately proceeds with off sequential performance against parallel throughput. The the execution of another thread ready for execution. This latency architectural difference between host and device processors hiding technique is key to make best use of the available hardware perfectly showcases the two extremes of this design space.
    [Show full text]
  • Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗
    Improving tasks throughput on accelerators using OpenCL command concurrency∗ A.J. L´azaro-Mu~noz1, J.M. Gonz´alez-Linares1, J. G´omez-Luna2, and N. Guil1 1 Dep. of Computer Architecture University of M´alaga,Spain [email protected] 2 Dep. of Computer Architecture and Electronics University of C´ordoba,Spain Abstract A heterogeneous architecture composed by a host and an accelerator must frequently deal with situations where several independent tasks are available to be offloaded onto the accelerator. These tasks can be generated by concurrent applications executing in the host or, in case the host is a node of a computer cluster, by applications running on other cluster nodes that are willing to offload tasks in the accelerator connected to the host. In this work we show that a runtime scheduler that selects the best execution order of a group of tasks on the accelerator can significantly reduce the total execution time of the tasks and, consequently, increase the accelerator use. Our solution is based on a temporal execution model that is able to predict with high accuracy the execution time of a set of concurrent tasks launched on the accelerator. The execution model has been validated in AMD, NVIDIA, and Xeon Phi devices using synthetic benchmarks. Moreover, employing the temporal execution model, a heuristic is proposed which is able to establish a near-optimal tasks execution ordering that signicantly reduces the total execution time, including data transfers. The heuristic has been evaluated with five different benchmarks composed of dominant kernel and dominant transfer real tasks. Experiments indicate the heuristic is able to find always an ordering with a better execution time than the average of every possible execution order and, most times, it achieves a near-optimal ordering (very close to the execution time of the best execution order) with a negligible overhead.
    [Show full text]
  • On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework
    On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework Montella, R., Giunta, G., Laccetti, G., Lapegna, M., Palmieri, C., Ferraro, C., Pelliccia, V., Hong, C-H., Spence, I., & Nikolopoulos, D. (2017). On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework. International Journal of Parallel Programming, 45(5), 1142-1163. https://doi.org/10.1007/s10766-016-0462-1 Published in: International Journal of Parallel Programming Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016 Springer Verlag. The final publication is available at Springer via http://dx.doi.org/ 10.1007/s10766-016-0462-1 General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:08. Oct. 2021 Noname manuscript No. (will be inserted by the editor) On the virtualization of CUDA based GPU remoting on ARM and X86 machines in the GVirtuS framework Raffaele Montella · Giulio Giunta · Giuliano Laccetti · Marco Lapegna · Carlo Palmieri · Carmine Ferraro · Valentina Pelliccia · Cheol-Ho Hong · Ivor Spence · Dimitrios S.
    [Show full text]
  • State-Of-The-Art in Parallel Computing with R
    Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann State-of-the-art in Parallel Computing with R Technical Report Number 47, 2009 Department of Statistics University of Munich http://www.stat.uni-muenchen.de State-of-the-art in Parallel Computing with R Markus Schmidberger Martin Morgan Dirk Eddelbuettel Ludwig-Maximilians-Universit¨at Fred Hutchinson Cancer Debian Project, Munchen,¨ Germany Research Center, WA, USA Chicago, IL, USA Hao Yu Luke Tierney Ulrich Mansmann University of Western Ontario, University of Iowa, Ludwig-Maximilians-Universit¨at ON, Canada IA, USA Munchen,¨ Germany Abstract R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on com- puter clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on com- puter clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix. Keywords: R, high performance computing, parallel computing, computer cluster, multi-core systems, grid computing, benchmark.
    [Show full text]
  • HPVM: Heterogeneous Parallel Virtual Machine
    HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler interme- diate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems.
    [Show full text]
  • Towards a Scalable File System on Computer Clusters Using Declustering
    Journal of Computer Science 1 (3) : 363-368, 2005 ISSN 1549-3636 © Science Publications, 2005 Towards a Scalable File System on Computer Clusters Using Declustering Vu Anh Nguyen, Samuel Pierre and Dougoukolo Konaré Department of Computer Engineering, Ecole Polytechnique de Montreal C.P. 6079, succ. Centre-ville, Montreal, Quebec, H3C 3A7 Canada Abstract : This study addresses the scalability issues involving file systems as critical components of computer clusters, especially for commercial applications. Given that wide striping is an effective means of achieving scalability as it warrants good load balancing and allows node cooperation, we choose to implement a new data distribution scheme in order to achieve the scalability of computer clusters. We suggest combining both wide striping and replication techniques using a new data distribution technique based on “chained declustering”. Thus, we suggest a complete architecture, using a cluster of clusters, whose performance is not limited by the network and can be adjusted with one-node precision. In addition, update costs are limited as it is not necessary to redistribute data on the existing nodes every time the system is expanded. The simulations indicate that our data distribution technique and our read algorithm balance the load equally amongst all the nodes of the original cluster and the additional ones. Therefore, the scalability of the system is close to the ideal scenario: once the size of the original cluster is well defined, the total number of nodes in the system is no longer limited, and the performance increases linearly. Key words : Computer Cluster, Scalability, File System INTRODUCTION serve an increasing number of clients.
    [Show full text]
  • Productive High Performance Parallel Programming with Auto-Tuned Domain-Specific Embedded Languages
    Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages By Shoaib Ashraf Kamil A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Armando Fox, Co-Chair Professor Katherine Yelick, Co-Chair Professor James Demmel Professor Berend Smit Fall 2012 Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages Copyright c 2012 Shoaib Kamil. Abstract Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages by Shoaib Ashraf Kamil Doctor of Philosophy in Computer Science University of California, Berkeley Professor Armando Fox, Co-Chair Professor Katherine Yelick, Co-Chair As the complexity of machines and architectures has increased, performance tuning has become more challenging, leading to the failure of general compilers to generate the best possible optimized code. Expert performance programmers can often hand-write code that outperforms compiler- optimized low-level code by an order of magnitude. At the same time, the complexity of programs has also increased, with modern programs built on a variety of abstraction layers to manage complexity, yet these layers hinder efforts at optimization. In fact, it is common to lose one or two additional orders of magnitude in performance when going from a low-level language such as Fortran or C to a high-level language like Python, Ruby, or Matlab. General purpose compilers are limited by the inability of program analysis to determine pro- grammer intent, as well as the lack of detailed performance models that always determine the best executable code for a given computation and architecture.
    [Show full text]
  • Programmable Interconnect Control Adaptive to Communication Pattern of Applications
    Title Programmable Interconnect Control Adaptive to Communication Pattern of Applications Author(s) 髙橋, 慧智 Citation Issue Date Text Version ETD URL https://doi.org/10.18910/72595 DOI 10.18910/72595 rights Note Osaka University Knowledge Archive : OUKA https://ir.library.osaka-u.ac.jp/ Osaka University Programmable Interconnect Control Adaptive to Communication Pattern of Applications Submitted to Graduate School of Information Science and Technology Osaka University January 2019 Keichi TAKAHASHI This work is dedicated to my parents and my wife List of Publications by the Author Journal [1] Keichi Takahashi, S. Date, D. Khureltulga, Y. Kido, H. Yamanaka, E. Kawai, and S. Shimojo, “UnisonFlow: A Software-Defined Coordination Mechanism for Message-Passing Communication and Computation”, IEEE Access, vol. 6, no. 1, pp. 23 372–23 382, 2018. : 10.1109/ACCESS.2018.2829532. [2] A. Misawa, S. Date, Keichi Takahashi, T. Yoshikawa, M. Takahashi, M. Kan, Y. Watashiba, Y. Kido, C. Lee, and S. Shimojo, “Dynamic Reconfiguration of Computer Platforms at the Hardware Device Level for High Performance Computing Infrastructure as a Service”, Cloud Computing and Service Science. CLOSER 2017. Communications in Computer and Information Science, vol. 864, pp. 177–199, 2018. : 10.1007/978-3-319-94959-8_10. [3] S. Date, H. Abe, D. Khureltulga, Keichi Takahashi, Y. Kido, Y. Watashiba, P. U- chupala, K. Ichikawa, H. Yamanaka, E. Kawai, and S. Shimojo, “SDN-accelerated HPC Infrastructure for Scientific Research”, International Journal of Information Technology, vol. 22, no. 1, pp. 789–796, 2016. International Conference (with review) [1] Y. Takigawa, Keichi Takahashi, S. Date, Y. Kido, and S. Shimojo, “A Traffic Sim- ulator with Intra-node Parallelism for Designing High-performance Interconnects”, in 2018 International Conference on High Performance Computing & Simula- tion (HPCS 2018), Jul.
    [Show full text]