Scientific Programming and Computer Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Scientific Programming and Computer Architecture Scientific Programming and Computer Architecture Scientific and Engineering Computation William Gropp and Ewing Lusk, editors; Janusz Kowalik, founding editor A complete list of books published in the Scientific and Engineering Computation series appears at the back of this book. Scientific Programming and Computer Architecture Divakar Viswanath The MIT Press Cambridge, Massachusetts London, England © 2017 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set in LyX by the author. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Names: Viswanath, Divakar, author. Title: Scientific programming and computer architecture / Divakar Viswanath. Description: Cambridge, MA : The MIT Press, [2017] | Series: Scientific and engineering computation | Includes bibliographical references and index. Identifiers: LCCN 2016043792 | ISBN 9780262036290 (hardcover : alk. paper) Subjects: LCSH: Computer programming. | Computer architecture. | Software engineering. | C (Computer program language) Classification: LCC QA76.6 .V573 2017 | DDC 005.1--dc23 LC record available at https://lccn.loc.gov/2016043792 10 9 8 7 6 5 4 3 2 1 To all my teachers, with thanks. https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Table of Contents Chapter: Preface Chapter 1: C/C++: Review Section 1.1: An example: The Aitken transformation Subsection 1.1.1: Leibniz series and the logarithmic series Subsection 1.1.2: Modular organization of sources Section 1.2: C review Subsection 1.2.1: Header files Subsection 1.2.2: Arrays and pointers Subsection 1.2.3: The Aitken iteration using arrays and pointers Subsection 1.2.4: Declarations and definitions Subsection 1.2.5: Function calls and the compilation process Section 1.3: C++ review Subsection 1.3.1: The Vector class Subsection 1.3.2: Aitken transformation in C++ Section 1.4: A little Fortran Section 1.5: References Chapter 2: C/C++: Libraries and Makefiles Section 2.1: Mixed-language programming Subsection 2.1.1: Transmutation of names from source to object files Subsection 2.1.2: Linking Fortran programs with C and C++ Section 2.2: Using BLAS and LAPACK libraries Subsection 2.2.1: Arrays, matrices, and leading dimensions Subsection 2.2.2: BLAS and LAPACK Subsection 2.2.3: C++ class interface to BLAS/LAPACK Section 2.3: Building programs using GNU Make https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Subsection 2.3.1: The utils/ folder Subsection 2.3.2: Targets, prerequisites, and dependency graphs Subsection 2.3.3: Make variables in makevars.mk Subsection 2.3.4: Pattern rules in makevars.mk Subsection 2.3.5: Phony targets in makevars.mk Subsection 2.3.6: Recursive make and .d files Subsection 2.3.7: Beyond recursive make Subsection 2.3.8: Building your own library Section 2.4: The Fast Fourier Transform Subsection 2.4.1: The FFT algorithm in outline Subsection 2.4.2: FFT using MKL Subsection 2.4.3: FFT using FFTW Subsection 2.4.4: Cycles and histograms Subsection 2.4.5: Optimality of FFT implementations Section 2.5: References Chapter 3: The Processor Section 3.1: Overview of the x86 architecture Subsection 3.1.1: 64-bit x86 architecture Subsection 3.1.2: 64-bit x86 assembly programming Subsection 3.1.3: The Time Stamp Counter Subsection 3.1.4: Cache parameters and the CPUID instruction Section 3.2: Compiler optimizations Subsection 3.2.1: Preliminaries Subsection 3.2.2: Loop unrolling Subsection 3.2.3: Loop fusion Subsection 3.2.4: Unroll and jam Subsection 3.2.5: Loop interchange Subsection 3.2.6: C++ overhead Subsection 3.2.7: A little compiler theory Section 3.3: Optimizing for the instruction pipeline Subsection 3.3.1: Instruction pipelines Subsection 3.3.2: Chipsets Subsection 3.3.3: Peak floating point performance Subsection 3.3.4: Microkernel for matrix multiplication Section 3.4: References Chapter 4: Memory Section 4.1: DRAM and cache memory Subsection 4.1.1: DRAM memory https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Subsection 4.1.2: Cache memory Subsection 4.1.3: Physical memory and virtual memory Subsection 4.1.4: Latency to DRAM memory: First attempts Subsection 4.1.5: Latency to DRAM Section 4.2: Optimizing memory access Subsection 4.2.1: Bandwidth to DRAM Subsection 4.2.2: Matrix transpose Subsection 4.2.3: Optimized matrix multiplication Section 4.3: Reading from and writing to disk Subsection 4.3.1: C versus C++ Subsection 4.3.2: Latency to disk Subsection 4.3.3: Bandwidth to disk Section 4.4: Page tables and virtual memory Subsection 4.4.1: Partitioning the virtual address space Subsection 4.4.2: Physical address space and page tables Section 4.5: References Chapter 5: Threads and Shared Memory Section 5.1: Introduction to OpenMP Subsection 5.1.1: OpenMP syntax Subsection 5.1.2: Shared variables and OpenMP’s memory model Subsection 5.1.3: Overheads of OpenMP constructs Section 5.2: Optimizing OpenMP programs Subsection 5.2.1: Near memory and far memory Subsection 5.2.2: Bandwidth to DRAM memory Subsection 5.2.3: Matrix transpose Subsection 5.2.4: Fast Fourier transform Section 5.3: Introduction to Pthreads Subsection 5.3.1: Pthreads Subsection 5.3.2: Overhead of thread creation Subsection 5.3.3: Parallel regions using Pthreads Section 5.4: Program memory Subsection 5.4.1: An easy system call Subsection 5.4.2: Stacks Subsection 5.4.3: Segmentation faults and memory errors Section 5.5: References Chapter 6: Special Topic: Networks and Message Passing Section 6.1: MPI: Getting started Subsection 6.1.1: Initializing MPI https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Subsection 6.1.2: Unsafe communication in MPI Section 6.2: High-performance network architecture Subsection 6.2.1: Fat-tree network Subsection 6.2.2: Infiniband network architecture Section 6.3: MPI examples Subsection 6.3.1: Variants of MPI send and receive Subsection 6.3.2: Jacobi iteration Subsection 6.3.3: Matrix transpose Subsection 6.3.4: Collective communication Subsection 6.3.5: Parallel I/O in MPI Section 6.4: The Internet Subsection 6.4.1: IP addresses Subsection 6.4.2: Send and receive Subsection 6.4.3: Server Subsection 6.4.4: Client Subsection 6.4.5: Internet latency Subsection 6.4.6: Internet bandwidth Section 6.5: References Chapter 7: Special Topic: The Xeon Phi Coprocessor Section 7.1: Xeon Phi architecture Subsection 7.1.1: Peak floating point bandwidth Subsection 7.1.2: A simple Phi program Subsection 7.1.3: Xeon Phi memory system Section 7.2: Offload Subsection 7.2.1: Initializing to use the MIC device Subsection 7.2.2: The target(mic) declaration specification Subsection 7.2.3: Summing the Leibniz series Subsection 7.2.4: Offload bandwidth Section 7.3: Two examples: FFT and matrix multiplication Subsection 7.3.1: FFT Subsection 7.3.2: Matrix multiplication Chapter 8: Special Topic: Graphics Coprocessor Section 8.1: Graphics coprocessor architecture Subsection 8.1.1: Graphics processor capability Subsection 8.1.2: Host and device memory Subsection 8.1.3: Timing CUDA kernels Subsection 8.1.4: Warps and thread blocks Section 8.2: Introduction to CUDA https://divakarvi.github.io/bk-spca/spca.html[20-1-2019 23:44:49] Scientific Programming and Computer Architecture Subsection 8.2.1: Summing the Leibniz series Subsection 8.2.2: CUDA compilation Section 8.3: Two examples Subsection 8.3.1: Bandwidth to memory Subsection 8.3.2: Matrix multiplication Section 8.4: References Chapter 9: Machines Used, Plotting, Python, GIT, Cscope, and gcc Section 9.1: Machines used Section 9.2: Plotting in C/C++ and other preliminaries Section 9.3: C/C++ versus Python versus MATLAB Section 9.4: GIT Section 9.5: Cscope Section 9.6: Compiling with gcc/g++ The website https://github.com/divakarvi/bk-spca has all the programs discussed in this book. Preface It is a common experience that minor changes to C/C++ programs can make a big difference to their speed. Although all programmers who opt for C/C++ do so at least partly, and much of the time mainly, because programs in these languages can be fast, writing fast programs in these languages is not so straightforward. Well-optimized programs in C/C++ can be even 10 or more times faster than programs that are not well optimized. At the heart of this book is the following question: what makes computer programs fast or slow? Programming languages provide a level of abstraction that makes computers look simpler than they are. As soon as we ask this question about program speed, we have to get behind the abstractions and understand how a computer really works and how programming constructs map to different parts of the computer’s architecture. Although there is much that can be understood, the modern computer is such a complicated device that this basic question cannot be answered perfectly. Writing fast programs is the major theme of this book, but it is not the only theme. The other theme is modularity of programs.