Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma

Total Page:16

File Type:pdf, Size:1020Kb

Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma TCSS 562: Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma TCSS 562: OBJECTIVES SOFTWARE ENGINEERING FOR CLOUD COMPUTING Syllabus and course introduction Cloud Computing – How did we get here? Introduction Introduction to parallel and distributed systems (Marinescu Ch. 2 - 1st edition, Ch. 4 - 2nd edition) Wes J. Lloyd School of Engineering and Technology University of Washington - Tacoma TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.2 School of Engineering and Technology, University of Washington - Tacoma 1 2 TCSS562 – SOFTWARE ENGINEERING DEMOGRAPHICS SURVEY FOR CLOUD COMPUTING Please complete the ONLINE demographics survey: Syllabus online at: http://faculty.washington.edu/wlloyd/courses/tcss562/ https://forms.gle/dE4Q7Lt13rAXtahJ9 Grading Linked from course webpage in Canvas: http://faculty.washington.edu/wlloyd/courses/tcss562/ Schedule announcements.html Assignments TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.3 September 25, 2019 L1.4 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 3 4 REFERENCES REFERENCES - 2 [1] Cloud Computing: Concepts, Technology and Architecture* [4] Systems Performance: Enterprise and the Cloud * Thomas Erl, Prentice Hall 2013 Brendan Gregg, First Edition 2013 [2] Cloud Computing - Theory and Practice [5] AWS Administration – The Definitive Guide * Dan Marinescu, First Edition 2013 *, Second Edition 2018 Yohan Wadia, First Edition 2016 [3] Cloud Computing: Research papers A Hands-On Approach Arshdeep Bahga 2013 * - available online via UW library * - available online via UW library TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.5 September 25, 2019 L1.6 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 5 6 Slides by Wes J. Lloyd L1.1 TCSS 562: Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma TCSS 562 – Fall 2019 Mondays/Wednesdays TCS562 COURSE WORK . Lecture, midterm, quiz, activities . No class: Monday Nov 11 TCSS 562 Project Proposal Class online: Wednesday Nov 27 FALL . Key Topics: 2019 Project Status Reports / Activities / Quiz . IaaS, Virtualization, Serverless computing, FaaS, Containerization . ~ 2-4 total items (??) . Variety of formats: in class, online, reading, activity Tutorials . Lab Day: Tutorials, group project work Midterm No Final exam . Open book, note, etc. Midterm Wednesday October 30th (tentative) Class Presentation Term Project: Build and evaluate alternate implementations of a native cloud serverless application; or group proposed Term Project / Paper / Presentation cloud research project TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 September 25, 2019 L1.8 School of Engineering and Technology, University of Washington - Tacoma L1.7 School of Engineering and Technology, University of Washington - Tacoma 7 8 CLASS PRESENTATION TCS562 TERM PROJECT Each student will make one presentation in a team of ~3 Project description to be posted Teams of ~3, self formed, one project leader Technology sharing presentation Proposal due: Friday October 11, 11:59pm (tentative) . PPT Slides, demonstration Focus: . Provide technology overview of one cloud service offering . Build a native cloud serverless application . Present overview of features, performance, etc. Compose multiple FaaS functions (services) . Compare alternate implementation of: Cloud Research Paper Presentation . Service compositions . PPT slides, identify research contributions, strengths and . Application flow control - AWS Step Functions, laptop client, etc. weaknesses of paper, possible areas for future work . External cloud components (e.g. database, key-value store) . How does application design impact cost and performance? TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.9 September 25, 2019 L1.10 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 9 10 TCSS562 TERM PROJECT - 2 ALTERNATE TERM PROJECT IDEAS Deliverables GOAL: propose cloud development project that serves as a vehicle to compare and contrast the use of alternative cloud . Demo in class at end of quarter (TBD) services . Project report paper (4-6 pgs IEEE format, template Examples: provided) Object/blob storage services . GitHub (project source) . Amazon S3, Google blobstore, Azure blobstore, vs. self-hosted . How-To document (via GitHub markdown) Cloud Relational Database services . Amazon Relational Database Service (RDS), Aurora, Self-Hosted DB Platform-as-a-Service hosting (PaaS) alternatives . Amazon Elastic Beanstalk, Heroku, others Function-as-a-Service platforms . Google Cloud Functions, Azure Functions, IBM Cloud Functions TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.11 September 25, 2019 L1.12 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 11 12 Slides by Wes J. Lloyd L1.2 TCSS 562: Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma TERM PROJECT IDEAS - 2 TERM PROJECT: RESEARCH File-based storage systems Alternative: conduct a cloud-related research project on any . Amazon EBS, Amazon EFS, others topic to answer a set of research questions Container orchestration services . Can be used to help spur MS Capstone/Thesis work . Amazon ECS, AKS, Azure Kubernetes Service If you’re interested in this option, please talk with the Queueing services comparison instructor . Amazon SQS, Amazon MQ, Apache Kafka, RabbitMQ, 0Mq, others First step is to identify 1 – 2 research questions NoSQL database services comparison . DynamoDB, Google BigTable, MongoDB, Cassandra Instructor will help guide projects throughout the quarter Approval based on team preparedness to execute project TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.13 September 25, 2019 L1.14 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 13 14 PROJECT SUPPORT TCSS562 TERM PROJECT OPPORTUNITIES Project cloud infrastructure support: Projects can lead to papers or posters presented at Sign up for the Github Student Developer Pack: ACM/IEEE/USENIX conferences, workshops . https://education.github.com/pack . Includes up to $150 in Amazon Cloud Credits . Networking and travel opportunity . Includes up to $100 in Microsoft Azure Credits . Conference participation (posters, papers) . AWS credit extensions provided as needed helps differentiate your resume from others Microsoft Azure for Students . $100 free credit per account valid for 1 year Project can support preliminary work for: . https://azure.microsoft.com/en-us/free/students/ UWT - MS capstone/thesis project proposals . Also: $200 free credit option for 1 month Google Cloud Research projects provide valuable practicum experience . $300 free credit for 1 year with cloud systems analysis, prototyping . https://cloud.google.com/free/ Chameleon / CloudLab Publications are key for building your resume/CV, . Bare metal NSF cloud - free Also key if applying to PhD programs TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.15 September 25, 2019 L1.16 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 15 16 TCSS562 TERM PROJECT - 3 OBJECTIVES Project status report / term project check-ins Cloud Computing: How did we get here? . Written status report . Parallel and distributed systems . 2-3 times in quarter (Marinescu Ch. 2 - 1st edition, Ch. 4 - 2nd edition) . Part of: “Project Status Reports / Activities / Quizzes” . Data, thread-level, task-level parallelism category . Parallel architectures . 10% of grade . SIMD architectures, vector processing, multimedia extensions Project meetings with instructor . Graphics processing units . After class, end half of class, office hours . Speed-up, Amdahl's Law, Scaled Speedup . Properties of distributed systems . Modularity TCSS562: Software Engineering for Cloud Computing [Fall 2019] TCSS562: Software Engineering for Cloud Computing [Fall 2019] September 25, 2019 L1.17 September 25, 2019 L1.18 School of Engineering and Technology, University of Washington - Tacoma School of Engineering and Technology, University of Washington - Tacoma 17 18 Slides by Wes J. Lloyd L1.3 TCSS 562: Software Engineering for Cloud Computing [Fall 2019] School of Engineering and Technology, UW-Tacoma CLOUD COMPUTING: HOW DID WE GET HERE? General interest in parallel computing . Moore’s Law - # of transistors doubles every 18 months . Post
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]
  • Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!
    CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over next 2 weeks)! Lecture 23" Introduction to Parallel Processing! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 3! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 4! Processor components! Multicore processors and programming! Processor comparison! vs.! Goal: Explain and articulate why modern microprocessors now have more than one core andCSE how software 30321 must! adapt to accommodate the now prevalent multi- core approach to computing. " Introduction and Overview! Writing more ! efficient code! The right HW for the HLL code translation! right application! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 6! Pipelining and “Parallelism”! ! Load! Mem! Reg! DM! Reg! ALU ! Instruction 1! Mem! Reg! DM! Reg! ALU ! Instruction 2! Mem! Reg! DM! Reg! ALU ! Instruction 3! Mem! Reg! DM! Reg! ALU ! Instruction 4! Mem! Reg! DM! Reg! ALU Time! Instructions execution overlaps (psuedo-parallel)" but instructions in program issued sequentially." University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! Multiprocessing (Parallel) Machines! Flynn#s
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • An Introduction to CUDA/Opencl and Graphics Processors
    An Introduction to CUDA/OpenCL and Graphics Processors Bryan Catanzaro, NVIDIA Research Overview ¡ Terminology ¡ The CUDA and OpenCL programming models ¡ Understanding how CUDA maps to NVIDIA GPUs ¡ Thrust 2/74 Heterogeneous Parallel Computing Latency Throughput Optimized CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/74 Latency vs. Throughput Latency Throughput ¡ Latency: yoke of oxen § Each core optimized for executing a single thread ¡ Throughput: flock of chickens § Cores optimized for aggregate throughput, deemphasizing individual performance ¡ (apologies to Seymour Cray) 4/74 Latency vs. Throughput, cont. Specificaons Sandy Bridge- Kepler EP (Tesla K20) 8 cores, 2 issue, 14 SMs, 6 issue, 32 Processing Elements 8 way SIMD way SIMD @3.1 GHz @730 MHz 8 cores, 2 threads, 8 14 SMs, 64 SIMD Resident Strands/ way SIMD: vectors, 32 way Threads (max) SIMD: Sandy Bridge-EP (32nm) 96 strands 28672 threads SP GFLOP/s 396 3924 Memory Bandwidth 51 GB/s 250 GB/s Register File 128 kB (?) 3.5 MB Local Store/L1 Cache 256 kB 896 kB L2 Cache 2 MB 1.5 MB L3 Cache 20 MB - Kepler (28nm) 5/74 Why Heterogeneity? ¡ Different goals produce different designs § Manycore assumes work load is highly parallel § Multicore must be good at everything, parallel or not ¡ Multicore: minimize latency experienced by 1 thread § lots of big on-chip caches § extremely sophisticated control ¡ Manycore: maximize throughput of all threads § lots of big ALUs § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over
    [Show full text]
  • Chapter 1 Introduction
    Chapter 1 Introduction 1.1 Parallel Processing There is a continual demand for greater computational speed from a computer system than is currently possible (i.e. sequential systems). Areas need great computational speed include numerical modeling and simulation of scientific and engineering problems. For example; weather forecasting, predicting the motion of the astronomical bodies in the space, virtual reality, etc. Such problems are known as grand challenge problems. On the other hand, the grand challenge problem is the problem that cannot be solved in a reasonable amount of time [1]. One way of increasing the computational speed is by using multiple processors in single case box or network of computers like cluster operate together on a single problem. Therefore, the overall problem is needed to split into partitions, with each partition is performed by a separate processor in parallel. Writing programs for this form of computation is known as parallel programming [1]. How to execute the programs of applications in very fast way and on a concurrent manner? This is known as parallel processing. In the parallel processing, we must have underline parallel architectures, as well as, parallel programming languages and algorithms. 1.2 Parallel Architectures The main feature of a parallel architecture is that there is more than one processor. These processors may communicate and cooperate with one another to execute the program instructions. There are diverse classifications for the parallel architectures and the most popular one is the Flynn taxonomy (see Figure 1.1) [2]. 1 1.2.1 The Flynn Taxonomy Michael Flynn [2] has introduced taxonomy for various computer architectures based on notions of Instruction Streams (IS) and Data Streams (DS).
    [Show full text]
  • Transforming TLP Into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal
    Transforming TLP into DLP with the dynamic inter-thread vectorization architecture Sajith Kalathingal To cite this version: Sajith Kalathingal. Transforming TLP into DLP with the dynamic inter-thread vectorization archi- tecture. Hardware Architecture [cs.AR]. Université Rennes 1, 2016. English. NNT : 2016REN1S133. tel-01426915v3 HAL Id: tel-01426915 https://tel.archives-ouvertes.fr/tel-01426915v3 Submitted on 28 Aug 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ANNEE´ 2016 THESE` / UNIVERSITE´ DE RENNES 1 sous le sceau de l’Universit´eBretagne Loire pour le grade de DOCTEUR DE L’UNIVERSITE´ DE RENNES 1 Mention : Informatique Ecole´ doctorale Matisse pr´esent´eepar Sajith Kalathingal pr´epar´ee`al’unit´ede recherche INRIA Institut National de Recherche en Informatique et Automatique Universit´ede Rennes 1 Th`esesoutenue `aRennes Transforming TLP into le 13 D´ecembre 2016 DLP with the Dynamic devant le jury compos´ede : Bernard GOOSSENS Inter-Thread Vector- Professeur `al’Universit´ede Perpignan Via Domitia / Rapporteur Smail NIAR ization Architecture Professeur `al’Universit´ede Valenciennes / Rapporteur Laure GONNORD Maˆitre de conf´erences `a l’Universit´e Lyon 1 / Examinatrice C´edricTEDESCHI Maˆitre de conf´erences `a l’Universit´e Rennes 1 / Examinateur Andr´eSEZNEC Directeur de recherches Inria / Directeur de th´ese Sylvain COLLANGE Charg´ede recherche INRIA / Co-directeur de th´ese Acknowledgement I would like to express my sincere gratitude to my thesis advisors, Andr´eSEZNEC and Sylvain COLLANGE.
    [Show full text]
  • Multiprocessors and Multithreading Multiprocessors Classifying Multiprocessors Flynn Taxonomy
    Multiprocessors • why would you want a multiprocessor? Multiprocessors and Multithreading Processor Processor Processor more is better? Cache Cache Cache Single bus Memory I/O CSE 141 Dean Tullsen CSE 141 Dean Tullsen Classifying Multiprocessors Flynn Taxonomy • SISD (Single Instruction Single Data) – Uniprocessors • SIMD (Single Instruction Multiple Data) Flynn Taxonomy • – Examples: Illiac-IV, CM-2, Nvidia GPUs, etc. » Simple programming model • Interconnection Network » Low overhead • MIMD (Multiple Instruction Multiple Data) • Memory Topology – Examples: many, nearly all modern multiprocessors or multicores » Flexible • Programming Model » Use off-the-shelf microprocessors or microprocessor cores • MISD (Multiple Instruction Single Data) – ??? CSE 141 Dean Tullsen CSE 141 Dean Tullsen Interconnection Network Memory Topology P • Bus 0 • UMA (Uniform Memory Access) P1 Network P NUMA (Non-uniform Memory Access) • 2 • P • pros/cons? 3 • pros/cons? P4 P5 Processor Processor Processor P6 Processor Processor Processor P7 Cache Cache Cache Processor Processor Processor Cache Cache Cache Single bus Cache Cache Cache Single bus Memory I/O cpu M Memory Memory Memory Memory I/O cpu M Network Network cpu M . CSE 141 Dean Tullsen CSE 141. Dean Tullsen cpu M Programming Model Parallel Programming i = 47 • Shared Memory -- every processor can name every address Processor A Processor B location • Message Passing -- each processor can name only it’s local index = i++; index = i++; memory. Communication is through explicit messages. • pros/cons? Processor Processor Processor ChCache ChCache ChCache Memory Memory Memory • Shared-memory programming requires synchronization to provide mu tua l excl usi on and prevent race conditi ons Network – locks (semaphores) • find the max of 100,000 integers on 10 processors.
    [Show full text]
  • Parallel Architectures Impact of Parallel Architectures
    An Introduction to Parallel Architectures Impact of Parallel Architectures • From cell phones to supercomputers • In regular CPUs as well as GPUs Parallel HW Processing • Why? • Which types? • Which novel issues? Why Multicores? The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year). [from Patterson & Hennessy] Diminishing returns from uniprocessor designs Power Wall [from Patterson & Hennessy] • The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • by adding more transistors to a smaller chip. • Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques The Multicore Approach Multiple cores on the same chip – Simpler – Slower – Less power demanding Transition to Multicore Intel Xeon (18cores) Sequential App Performance 5/4/2018 Spring 2011 ‐‐ Lecture #15 7 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD SIMD Instruction Streams Multiple MISD MIMD 5/4/2018 Spring 2011 ‐‐ Lecture #13 8 Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) – Sequential computer that exploits no parallelism in either the instruction or data Processing Unit streams. Examples of SISD architecture are traditional uniprocessor machines 5/4/2018 Spring 2011 ‐‐ Lecture #13 9 Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors.
    [Show full text]
  • Accelerating CUDA Graph Algorithms at Maximum Warp
    Accelerating CUDA Graph Algorithms at Maximum Warp Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi and Kunle Olukotun Graph Analysis Graph Analysis Fundamental data structure; random relationship between entities Wide usage of graph analysis Social Networks, Computational Biology, … Abundant data -level parallelism Still, is computationally challenging Growing data size Expensive algorithms e.g. betweenness centrality: O(NM) Random memory access Hard to partition for cluster execution (large surface to volume ratio) Machines for Graph Analysis Graphics Memory Supercomputers (e.g. Cray XMT) Memory Control Unit Instr Instr Instr Large, single shared address space ALU ALU ALU Uniform memory access time (cache-less) Shared Many processors, heavily multithreaded Mem (parallelism, latency hiding) Reg File SM Unit SM Unit SM Unit Large memory bandwidth But, rare and expensive GPU architecture ~ supercomputers Difference GPU has limited memory capacity (a few GB; no VM) Let’s use GPU as long as the problem size fits. Example Algorithm: BFS Breadth First Search (BFS) Starting from a node, visit all nodes in breadth-first order Node visit at each level is parallel. A building block for many other algorithm 1 Assigns BFS level to each node 1 1 e.g. Kevin-Bacon Number 2 3 2 2 2 3 3 3 3 3 Previous Work GPU Implementation [Harish et al, HiPC 2007] Frontier-expansion method Good for CUDA; no atomic operation required Each thread process a Foreach (v: G. Nodes ) int v = THREAD_ID;
    [Show full text]