GPU Multiprocessing

Total Page:16

File Type:pdf, Size:1020Kb

GPU Multiprocessing GPU multiprocessing Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain) Outline 1. Multichip solutions [10 slides] 2. Multicard solutions [2 slides] 3. Multichip + multicard [3] 4. Performance on matrix decompositions [2] 5. CUDA programming [5] 6. Scalability on 3DFD [4] A world of possibilities From lower to higher cost, we have: 1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte). 2. Multicard: SLI(Nvidia) / CrossFire(ATI). Gigabyte (2005) NVIDIA (2007) ATI (2007) NVIDIA (2008) 3. Combination: Two chips/card and/orEvans & Sutherland (2004): two cards/connector. 3 I. Multichip solutions 4 First choice: Multichip. A retrospective: Voodoo 5 Rage Fury 5500 Maxx 3Dfx ATI (1999) (2000) Volari V8 2 Rad9800 Duo (prototype) XGI Sapphire (2002) (2003) 5 First choice: Multichip. Example 1: 3D1 (Gigabyte - 2005). A double GeForce 6600GT GPU on the same card (december 2005). Each GPU endowed with 128 MB of memory and a 128 bits bus width. 6 First choice: Multichip. Example 2: GeForce 7950 GX2 (Nvidia – 2006) 7 First choice: Multichip. Example 3: GeForce 9800 GX2 (Nvidia - 2008) . Double GeForce 8800 GPU, double printed circuit board and double video memory of 512 MB. A single PCI-express connector. 8 First choice: Multichip. 3D1 (Gigabyte). Cost and performance 3DMark 2003 3DMark 2005 Card 1024x768 1600x1200 1024x768 1600x1200 GeForce 6600 GT 8234 2059 3534 2503 3D1 using a single GPU 8529 2063 3572 2262 GeForce 6800 GT 11493 3846 4858 3956 GeForce 6600 GT SLI 14049 3924 6122 3542 3D1 using two GPUs 14482 4353 6307 3609 Cost: row 3 > row 4 > row 5 > row 1 > row 1 9 First choice: Multichip. 3D1 (Gigabyte). Analysis. As compared to a single GeForce 6800 GT, 3D1 has: . Lower cost. Higher arithmetic performance. Better at poorer resolution and software innovations (shaders). Similar bandwidth. Lower memory space and usability: . Vertices and textures must be replicated. A GPU cannot see the memory of its twin. As compared to two GeForce 6600 GT connected through SLI: . Slightly lower cost. Greater performance without demanding CPU bandwidth. Less versatile: Future expansion and/or single-card use. 10 First choice: Multichip. GeForce 7950 GX2 (2006) . GPU developed by Nvidia in June 2006. The GPU has “twin soul” (duality affects design). Clocks are slower than the single-GPU model: . GPU: 500 MHz (twin) versus 650 MHz (stand alone). Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone). Drivers were released almost a year later, which penalized initially the popularity of this card. It allows to use 48 pixel processors (24 on each GPU) and a video memory of 1 GB (512 MB connected to each GPU through a couple of buses 256 bits wide). 11 First choice: Multichip (2006). Transistors. A smaller chip with smaller transistors allows growing through a GPU replication 12 First choice: Multichip (2006). Frequency. A double GPU allows to relax clocks, with less heat and power consumption. 13 First choice: Multichip (2006). Bandwidth. Two GPUs placed on parallel planes make it easier to duplicate the bus width to 512 bits. 14 II. Multicard solutions 15 Second choice: Multicard. A couple of GPUs . SLI (Nvidia on GeForces) . CrossFire (ATI on Radeons) 16 Second choice: Multicard. SLI (Nvidia). Elements. - The motherboard must have several slots PCI-express 2.0 and PCI-express x16: - The power supply must reach at least 700 Watts. - Performance issues: A twin card may increment performance 60-80%. A new generation of GPUs may increment even more. Time frame becomes crucial! 17 III. Multichip + multicard 18 1+2 choice: Multichip+multicard . First solution available on the marketplace: Gigabyte (2005) based on GeForce 6 GPUs. It allows heterogeneous graphics cards, but workload balance gets complicated. 19 1+2 choice: Multichip+multicard. Implementation details 20 1+2 choice: Multichip+multitarjeta. Newer designs . It combines a number of GeForce 9800 GX2 GPUs and a multi-socket motherboard to configure up to quad-SLI: 2 GPUs/card x up to 4 cards = 8 GPUs. 2 GPUs 4 GPUs 8 GPUs 21 IV. Performance on matrix decompositions 22 Multicard performance versus a newer generation (LU decomposition) . A second (twin) GPU improves 1.6x, but does not reach the performance of a single card coming from the next generation. 23 CPU+GPU performance versus a single quad-core CPU (more on this later) . The benchmark is composed of three popular matrix decompositions used in linear algebra 24 V. CUDA programming for multi-GPU applications 25 Device Management . CPU can query and select GPU devices . cudaGetDeviceCount( int *count ) . cudaSetDevice( int device ) . cudaGetDevice( int *current_device ) . cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) . cudaChooseDevice( int *device, cudaDeviceProp* prop ) . Multi-GPU setup: . device 0 is used by default . one CPU thread can control only one GPU . multiple CPU threads can control the same GPU . calls are serialized by the driver 41 26 Multiple CPU Threads and CUDA . CUDA resources allocated by a CPU thread can be consumed only by CUDA calls from the same CPU thread. Violation example: . CPU thread 2 allocates GPU memory, stores address in p . thread 3 issues a CUDA call that accesses memory via p 42 27 When using several GPUs, the implementation gets complicated . GPUs don’t share video memory, so programmer must move data around PCI-express (even when GPUs belong to the same graphics card, as in the GeForce 9800 GX2). Steps to follow: . Copy data from GPU A to CPU thread A. Copy data from CPU thread A to CPU thread B using MPI. Copy data from CPU thread B to GPU B. We can use asynchronous copies to overlap the kernel execution on the GPU with data copies, and “pinned memory” to share copies among CPU threads (use cudaHostAlloc()) 28 Host Synchronization . All kernel launches are asynchronous . control returns to CPU immediately . kernel executes after all previous CUDA calls have completed . cudaMemcpy is synchronous . control returns to CPU after copy completes . copy starts after all previous CUDA calls have completed . cudaThreadSynchronize() . blocks until all previous CUDA calls complete 39 29 CPU↔GPU interactions: Conclusions . CPU↔GPU mem BW much lower than GPU mem BW. Use page-locked host memory (cudaMallocHost()) for maximum CPU ↔ GPU bandwidth . 3.2 GB/s common on PCI-e x16. ~4 GB/s measured on nForce 680i chipsets (8 GB/s for PCI-e 2.0). Be cautious however since allocating too much page-locked memory can reduce overall system performance. Minimize CPU ↔ GPU data transfers by moving more code from CPU to GPU: . Even if that means running kernels with low parallelism. Intermediate data structs. can be allocated, operated on, and deallocated without ever copying them to CPU memory. Group data transfers: . One large transfer much better than many small ones. 30 VI. Scalability for 3DFD (Nvidia code) 31 Example: Multi-GPU implementation for 3DFD . 3DFD is a finite differences code for the discretization of the seismic wave equation. 8th order in space, 2nd order in time. Using a regular mesh. Fixed X and Y dimensions, varying Z. Data is partitioned among GPUs along Z axis. Computation increases with z, communication (per node) stays constant. A GPU has to exchange 4 xy-planes (ghost nodes) with each of its neighbors. Executed on a cluster of 2 GPUS per node and Infiniband SDR network. 32 Performance for a couple of GPUs . Linear scaling is achieved when computation time exceeds communication time. 33 Three or more cluster nodes . Times are per cluster node. At least one cluster node needs two MPI communications, one with each of the neighbors. 34 Performance with 8 GPUs . 8x improvement factor is sustained at Z>1300, exactly where computation exceeds communication. 35.
Recommended publications
  • DRAFT - DATED MATERIAL the CONTENTS of THIS REVIEWER’S GUIDE IS INTENDED SOLELY for REFERENCE WHEN REVIEWING Rev a VERSIONS of VOODOO5 5500 REFERENCE BOARDS
    Voodoo5™ 5500 for the PC Reviewer’s Guide DRAFT - DATED MATERIAL THE CONTENTS OF THIS REVIEWER’S GUIDE IS INTENDED SOLELY FOR REFERENCE WHEN REVIEWING Rev A VERSIONS OF VOODOO5 5500 REFERENCE BOARDS. THIS INFORMATION WILL BE REGULARLY UPDATED, AND REVIEWERS SHOULD CONTACT THE PERSONS LISTED IN THIS GUIDE FOR UPDATES BEFORE EVALUATING ANY VOODOO5 5500BOARD. 3dfx Interactive, Inc. 4435 Fortran Dr. San Jose, Ca 95134 408-935-4400 Brian Burke 214-570-2113 [email protected] PT Barnum 214-570-2226 [email protected] Bubba Wolford 281-578-7782 [email protected] www.3dfx.com www.3dfxgamers.com Copyright 2000 3dfx Interactive, Inc. All Rights Reserved. All other trademarks are the property of their respective owners. Visit the 3dfx Virtual Press Room at http://www.3dfx.com/comp/pressweb/index.html. Voodoo5 5500 Reviewers Guide xxxx 2000 Table of Contents Benchmarking the Voodoo5 5500 Page 5 • Cures for common benchmarking and image quality mistakes • Tips for benchmarking • Overclocking Guide INTRODUCTION Page 11 • Optimized for DVD Acceleration • Scalable Performance • Trend-setting Image quality Features • Texture Compression • Glide 2.x and 3.x Compatibility: The Most Titles • World-class Window’s GUI/video Acceleration SECTION 1: Voodoo5 5500 Board Overview Page 12 • General Features • 3D Features set • 2D Features set • Voodoo5 Video Subsystem • SLI • Texture compression • Fill Rate vs T&L • T-Buffer introduction • Alternative APIs (Al Reyes on Mac, Linux, BeOS and more) • Pricing & Availability • Warranty • Technical Support • Photos, screenshots,
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Multiprocessing and Scalability
    Multiprocessing and Scalability A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Multiprocessing and Scalability Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uni- processor systems. However, due to a number of difficult problems, the potential of these machines has been difficult to realize. This is because of the: Fall 2004 2 Multiprocessing and Scalability Advances in technology ─ Rate of increase in performance of uni-processor, Complexity of multiprocessor system design ─ This drastically effected the cost and implementation cycle. Programmability of multiprocessor system ─ Design complexity of parallel algorithms and parallel programs. Fall 2004 3 Multiprocessing and Scalability Programming a parallel machine is more difficult than a sequential one. In addition, it takes much effort to port an existing sequential program to a parallel machine than to a newly developed sequential machine. Lack of good parallel programming environments and standard parallel languages also has further aggravated this issue. Fall 2004 4 Multiprocessing and Scalability As a result, absolute performance of many early concurrent machines was not significantly better than available or soon- to-be available uni-processors. Fall 2004 5 Multiprocessing and Scalability Recently, there has been an increased interest in large-scale or massively parallel processing systems. This interest stems from many factors, including: Advances in integrated technology. Very
    [Show full text]
  • Design of a Message Passing Interface for Multiprocessing with Atmel Microcontrollers
    DESIGN OF A MESSAGE PASSING INTERFACE FOR MULTIPROCESSING WITH ATMEL MICROCONTROLLERS A Design Project Report Presented to the Engineering Division of the Graduate School of Cornell University in Partial Fulfillment of the Requirements of the Degree of Master of Engineering (Electrical) by Kalim Moghul Project Advisor: Dr. Bruce R. Land Degree Date: May 2006 Abstract Master of Electrical Engineering Program Cornell University Design Project Report Project Title: Design of a Message Passing Interface for Multiprocessing with Atmel Microcontrollers Author: Kalim Moghul Abstract: Microcontrollers are versatile integrated circuits typically incorporating a microprocessor, memory, and I/O ports on a single chip. These self-contained units are central to embedded systems design where low-cost, dedicated processors are often preferred over general-purpose processors. Embedded designs may require several microcontrollers, each with a specific task. Efficient communication is central to such systems. The focus of this project is the design of a communication layer and application programming interface for exchanging messages among microcontrollers. In order to demonstrate the library, a small- scale cluster computer is constructed using Atmel ATmega32 microcontrollers as processing nodes and an Atmega16 microcontroller for message routing. The communication library is integrated into aOS, a preemptive multitasking real-time operating system for Atmel microcontrollers. Report Approved by Project Advisor:__________________________________ Date:___________ Executive Summary Microcontrollers are versatile integrated circuits typically incorporating a microprocessor, memory, and I/O ports on a single chip. These self-contained units are central to embedded systems design where low-cost, dedicated processors are often preferred over general-purpose processors. Some embedded designs may incorporate several microcontrollers, each with a specific task.
    [Show full text]
  • Rasterized Cnlor
    US006825851B1 (12) United States Patent (10) Patent N0.: US 6,825,851 B1 Leather (45) Date of Patent: Nov. 30, 2004 (54) METHOD AND APPARATUS FOR (74) Attorney, Agent, or Firm—Nixon & Vanderhye PC ENVIRONMENT-MAPPED BUMP-MAPPING IN A GRAPHICS SYSTEM (57) ABSTRACT (75) Inventor: Mark M. Leather, Saratoga, CA (US) A graphics system including a custom graphics and audio (73) Assignee: Nintendo Co., Ltd., Kyoto (JP) processor produces exciting 2D and 3D graphics and sur round sound. The system includes a graphics and audio ( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 processor including a 3D graphics pipeline and an audio U.S.C. 154(b) by 369 days. digital signal processor. Realistic looking surfaces on ren dered images are generated by EMBM using an indirect (21) Appl. No.: 09/722,381 texture lookup to a “bump map” followed by an environ (22) Filed: Nov. 28, 2000 ment or light mapping. Apparatus and example methods for Related US. Application Data environment-mapped style of bump-mapping (EMBM) are (60) Provisional application No. 60/226,893, ?led on Aug. 23, provided that use a pre-completed bump-map texture 2000. accessed as an indirect texture along With pre-computed (51) Int. Cl.7 ......................... .. G09G 5/00; G06T 15/60 object surface normals (i.e., the Normal, Tangent and Binor (52) US. Cl. ..................... .. 345/584; 345/581; 345/426; 345/848 mal vectors) from each vertex of rendered polygons to (58) Field of Search ....................... .. 345/426, 581—582, effectively generate a neW perturbed Normal vector per 345/584, 501, 506, 427, 522, 848 vertex.
    [Show full text]
  • 3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli
    3dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli Interviewed by: Shayne Hodge Recorded: July 29, 2013 Mountain View, California CHM Reference number: X6887.2013 © 2013 Computer History Museum 3dfx Oral History Panel Shayne Hodge: OK. My name is Shayne Hodge. This is July 29, 2013 at the afternoon in the Computer History Museum. We have with us today the founders of 3dfx, a graphics company from the 1990s of considerable influence. From left to right on the camera-- I'll let you guys introduce yourselves. Gary Tarolli: I'm Gary Tarolli. Scott Sellers: I'm Scott Sellers. Ross Smith: Ross Smith. Gordon Campbell: And Gordon Campbell. Hodge: And so why don't each of you take about a minute or two and describe your lives roughly up to the point where you need to say 3dfx to continue describing them. Tarolli: All right. Where do you want us to start? Hodge: Birth. Tarolli: Birth. Oh, born in New York, grew up in rural New York. Had a pretty uneventful childhood, but excelled at math and science. So I went to school for math at RPI [Rensselaer Polytechnic Institute] in Troy, New York. And there is where I met my first computer, a good old IBM mainframe that we were just talking about before [this taping], with punch cards. So I wrote my first computer program there and sort of fell in love with computer. So I became a computer scientist really. So I took all their computer science courses, went on to Caltech for VLSI engineering, which is where I met some people that influenced my career life afterwards.
    [Show full text]
  • Piotr Warczak Quarter: Fall 2011 Student ID: 99XXXXX Credit: 2 Grading: Decimal
    CSS 600: Independent Study Contract ­ Final Report Student: Piotr Warczak Quarter: Fall 2011 Student ID: 99XXXXX Credit: 2 Grading: Decimal Independent Study Title The GPU version of the MASS library. Focus and Goals The current version of the MASS library is written in java programming language and combines the power of every computing node on the network by utilizing both multithreading and multiprocessing. The new version will be implemented in C and C++. It will also support multithreading and multiprocessing and finally CUDA parallel computing architecture utilizing both single and multiple devices. This version will harness the power of the GPU a general purpose parallel processor. My specific goals of the past quarter were: • To create a baseline in C language using Wave2D application as an example. • To implement single thread Wave2D • To implement multithreading Wave2D • To implement CUDA enabled Wave2D Work Completed During this quarter I have created Wave2D application in C language using a single thread, multithreaded and CUDA enabled application. The reason for this is that CUDA is an extension of C and thus we need to create baseline against which other versions of the program can be measured. This baseline illustrates how much faster the CUDA enabled program is versus single thread and multithreaded versions. Furthermore, once the GPU version of the MASS library is implemented, we can compare the results to identify if there is any difference in program’s total execution time due to the potential MASS library overhead. And if any major delays in execution are found, the problem ares will be identified and corrected.
    [Show full text]
  • 12) United States Patent 10) Patent No.: US 7.003.588 B1
    US007003588B1 12) United States Patent 10) Patent No.: US 7.003.5889 9 B1 Takeda et al. 45) Date of Patent: Feb. 21,5 2006 (54) PERIPHERAL DEVICES FOR A VIDEO 4,567,516 A 1/1986 Scherer et al. .............. 358/114 GAME SYSTEM 4,570,233 A. 2/1986 Yan et al. 4,589,089 A 5/1986 Frederiksen ................ 364/900 (75) Inventors: Genyo Takeda, Kyoto (JP), Ko Shiota, 4,592,012 A 5/1986 Braun ........................ 364/900 Kyoto- (JP);2 Munchitoe Oira,2 Kyoto- 4,725,831† A 2/19884/1987 Colemanº ºatoshi º Nishiumi, Kyoto sº (JP) (JP); 4,829,2954,799,635 A 5/19891/1989 HiroyukiNakagawa .................. 364/900 - e 4,837,488 A 6/1989 Donahue ..................... 324/66 (73) Assignee: Nintendo Co., Ltd., Kyoto (JP) 4,850,591. A 7/1989 Takezawa et al. ............ 273/85 (*) Notice: Subject to any disclaimer, the term of this (Continued) patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. FOREIGN PATENT DOCUMENTS CA 2070934 12/1993 (21) Appl. No.: 10/225,488 (Continued) (22) Filed: Aug. 22, 2002 OTHER PUBLICATIONS Related U.S. Application Data “Pinouts for Various Connectors in Real Life (tm)”, ?on (60) Provisional application No. 60/313,812, filed on Aug. line], retrieved from the Internet, http://repairfaq.ece.drex 22, 2001. el.edu/REPAIR/F, Pinoutsl.html, updated May 20, 1997 (per document). (51) Int. Cl. A63F I3/00 (2006.01) (Continued) (52) U.S. Cl. .............................. .."...º.º. Primary Examiner—Christopher Shin (58) Fielde of Classificatione e Search ................2 710/100,2 (74) Attorney, Agent, or Firm—Nixon & Vanderhye, P.C.
    [Show full text]
  • Message Passing Fundamentals
    1. Message Passing Fundamentals Message Passing Fundamentals As a programmer, you may find that you need to solve ever larger, more memory intensive problems, or simply solve problems with greater speed than is possible on a serial computer. You can turn to parallel programming and parallel computers to satisfy these needs. Using parallel programming methods on parallel computers gives you access to greater memory and Central Processing Unit (CPU) resources not available on serial computers. Hence, you are able to solve large problems that may not have been possible otherwise, as well as solve problems more quickly. One of the basic methods of programming for parallel computing is the use of message passing libraries. These libraries manage transfer of data between instances of a parallel program running (usually) on multiple processors in a parallel computing architecture. The topics to be discussed in this chapter are The basics of parallel computer architectures. The difference between domain and functional decomposition. The difference between data parallel and message passing models. A brief survey of important parallel programming issues. 1.1. Parallel Architectures Parallel Architectures Parallel computers have two basic architectures: distributed memory and shared memory. Distributed memory parallel computers are essentially a collection of serial computers (nodes) working together to solve a problem. Each node has rapid access to its own local memory and access to the memory of other nodes via some sort of communications network, usually a proprietary high-speed communications network. Data are exchanged between nodes as messages over the network. In a shared memory computer, multiple processor units share access to a global memory space via a high-speed memory bus.
    [Show full text]
  • Composable Multi-Threading and Multi-Processing for Numeric Libraries
    18 PROC. OF THE 17th PYTHON IN SCIENCE CONF. (SCIPY 2018) Composable Multi-Threading and Multi-Processing for Numeric Libraries Anton Malakhov‡∗, David Liu‡, Anton Gorshkov‡†, Terry Wilmarth‡ https://youtu.be/HKjM3peINtw F Abstract—Python is popular among scientific communities that value its sim- Scaling parallel programs is challenging. There are two fun- plicity and power, especially as it comes along with numeric libraries such as damental laws which mathematically describe and predict scala- [NumPy], [SciPy], [Dask], and [Numba]. As CPU core counts keep increasing, bility of a program: Amdahl’s Law and Gustafson-Barsis’ Law these modules can make use of many cores via multi-threading for efficient [AGlaws]. According to Amdahl’s Law, speedup is limited by multi-core parallelism. However, threads can interfere with each other leading the serial portion of the work, which effectively puts a limit on to overhead and inefficiency if used together in a single application on machines scalability of parallel processing for a fixed-size job. Python is with a large number of cores. This performance loss can be prevented if all multi-threaded modules are coordinated. This paper continues the work started especially vulnerable to this because it makes the serial part of in [AMala16] by introducing more approaches to coordination for both multi- the same code much slower compared to implementations in other threading and multi-processing cases. In particular, we investigate the use of languages due to its deeply dynamic and interpretative nature. In static settings, limiting the number of simultaneously active [OpenMP] parallel addition, the GIL serializes operations that could be potentially regions, and optional parallelism with Intel® Threading Building Blocks (Intel® executed in parallel, further adding to the serial portion of a [TBB]).
    [Show full text]
  • 14. Parallel Computing 14.1 Introduction 14.2 Independent
    14. Parallel Computing 14.1 Introduction This chapter describes approaches to problems to which multiple computing agents are applied simultaneously. By "parallel computing", we mean using several computing agents concurrently to achieve a common result. Another term used for this meaning is "concurrency". We will use the terms parallelism and concurrency synonymously in this book, although some authors differentiate them. Some of the issues to be addressed are: What is the role of parallelism in providing clear decomposition of problems into sub-problems? How is parallelism specified in computation? How is parallelism effected in computation? Is parallel processing worth the extra effort? 14.2 Independent Parallelism Undoubtedly the simplest form of parallelism entails computing with totally independent tasks, i.e. there is no need for these tasks to communicate. Imagine that there is a large field to be plowed. It takes a certain amount of time to plow the field with one tractor. If two equal tractors are available, along with equally capable personnel to man them, then the field can be plowed in about half the time. The field can be divided in half initially and each tractor given half the field to plow. One tractor doesn't get into another's way if they are plowing disjoint halves of the field. Thus they don't need to communicate. Note however that there is some initial overhead that was not present with the one-tractor model, namely the need to divide the field. This takes some measurements and might not be that trivial. In fact, if the field is relatively small, the time to do the divisions might be more than the time saved by the second tractor.
    [Show full text]
  • Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence
    Utah State University DigitalCommons@USU All Graduate Plan B and other Reports Graduate Studies 5-2015 Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence Mario Harper Utah State University Follow this and additional works at: https://digitalcommons.usu.edu/gradreports Part of the Business Commons Recommended Citation Harper, Mario, "Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence" (2015). All Graduate Plan B and other Reports. 519. https://digitalcommons.usu.edu/gradreports/519 This Thesis is brought to you for free and open access by the Graduate Studies at DigitalCommons@USU. It has been accepted for inclusion in All Graduate Plan B and other Reports by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. EFFICIENT PARALLEL APPROACHES TO FINANCIAL DERIVATIVES AND RAPID STOCHASTIC CONVERGENCE by Mario Y. Harper A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Financial Economics Approved: Dr. Tyler Brough Dr. Jason Smith Major Professor Committee Member Dr. Jared DeLisle Dr. Richard Inouye Committee Member Associate Dean of the School of Gradu- ate Studies UTAH STATE UNIVERSITY Logan, Utah 2015 ii Copyright c Mario Y. Harper 2015 All Rights Reserved iii Abstract Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence by Mario Y. Harper, Master of Science Utah State University, 2015 Major Professor: Dr. Tyler Brough Department: Economics and Finance This Thesis explores the use of different programming paradigms, platforms and lan- guages to maximize the speed of convergence in Financial Derivatives Models. The study focuses on the strengths and drawbacks of various libraries and their associated languages as well as the difficulty of porting code into massively parallel processes.
    [Show full text]