Proceedings of CA 2004 the Fall 2004 ENGR 3410 Computer Architecture

Total Page:16

File Type:pdf, Size:1020Kb

Proceedings of CA 2004 the Fall 2004 ENGR 3410 Computer Architecture Franklin W. Olin College of Engineering Proceedings of CA 2004 the Fall 2004 ENGR 3410 Computer Architecture Class <[email protected]> Needham, Massachusetts, December 2004 Contents Preface Mark L. Chang . ii Digital Signal Processors Grant R. Hutchins, Leighton M. Ige . 2 FPGA Architecture, Algorithms, and Applications Kathleen King, Sarah Zwicker . 7 A History of Nintendo Architecture Jesus Fernandez, Steve Shannon . 13 Graphical Processing Units: An Examination of Use and Function Michael Crayton, Mike Foss, Dan Lindquist, Kevin Tostado . 17 Dual Core On-Chip Multiprocessors and the IBM Power5 Thomas C. Cecil . 24 Pong on a Xilinx Spartan3 Field Programmable Gate Array Kate Cummings, Chris Murphy, Joy Poisel . 28 The PowerPC Architecture: A Taxonomy, History, and Analysis Alex Dorsk, Katie Rivard, Nick Zola . 32 A Brief Introduction to Quantum Computation Michael W. Curtis . 38 Beauty and the Beast: AltiVec and MMX Vector Processing Units Drew Harry, Janet Tsai . 44 VLIW and Superscalar Processing Sean Munson, Clara Cho . 49 i Preface The members of the Fall 2004 ENGR 3410 Computer Architecture class are pleased to present the Proceedings of CA 2004. This is the first offering of the class, and this document represents only a portion of the work done by the students this semester. Presented here are the final papers describing their end of semester projects. The students were given a few weeks to research any Computer Architecture topic that interested them. The deliverables included a brief in-class presenta- tion as well as the documents compiled here. The goal of the project was to encourage group learning of advanced Computer Architecture topics not covered in class and give the students a chance to develop independent research skills. The projects were wonderfully varied, ranging from literature reviews to actual hardware implementations. It is important to note for the reader that several students chose to complete an optional lab as part of the final project and implemented a pipelined MIPS processor in Verilog. This work is not reflected in these proceedings, however the scope of these students’ research projects were adjusted accordingly. As the instructor for CA 2004, I am most proud of the work accomplished by the students this semester. I would like to thank them for their hard work and their creative zeal. For more information regarding Computer Architec- ture at F. W. Olin College of Engineering, please visit our class web page at http://ca.ece.olin.edu. Mark L. Chang Assistant Professor Franklin W. Olin College of Engineering ii PROCEEDINGS OF THE FALL 2004 ENGR 3410: COMPUTER ARCHITECTURE CLASS (CA2004), FRANKLIN W. OLIN COLLEGE OF ENGINEERING 2 Digital Signal Processors Grant R. Hutchins, Leighton M. Ige Abstract— Digital Signal Processors (DSPs) operate on digital components. Many audio receivers offer the capability of signals in realtime applications. They have several benefits real time equalization adjustments to audio playback. The over other processor architectures, due to specialized designs computationally intensive algorithms for transforming audio and instruction sets. This article offers an overview of general properties of DSPs and comments on their applications in the to sound as if it is being played in a large concert hall or a real world. dance hall are handled by DSPs. Additionally, many filtering and noise reduction algorithms can be implemented to clean I. INTRODUCTION up audio signals. Convolution and Fast Fourier Transforms We live in an analog world composed of continuous func- require many sequential multiplications and additions, a pro- tions: the light you see, the sound you hear, and the words you cedure for which DSPs are optimized. speak. However, for many reasons these signals are converted DSPs are also very flexible in their implementation as they into digital representations for processing and transmission. It can be reprogrammed to perform different calculations. For is often cheaper, more accurate, and computationally simpler example, a cell phone provider could decide to support a new to manipulate a digital representation of a continuous signal, data feature on its phones. The DSP could then be updated to since there are discrete time slices of data to process instead decode the new data packets being sent to the phone. of a signal with infinite resolution. The power of the digital While there are currently many DSPs implemented in revolution lies in the ability of digital systems to approximate everyday devices, they are becoming more prevalent due to the analog world. Once in a discrete representation, signals their high performance, flexibility, and continually decreasing can be digitally processed to have different effects, such as prices. The capability to perform complex manipulations on the reduction of background noise in a microphone feed or data in real time can enable many systems to incorporate more the encryption and compression of a cell phone call. inputs from the world. Digital signal processors (DSP) are specific microprocessors that specialize in the manipulation of signals in real time. II. COMMON DSP TASKS In the specific applications of devices that interface with the In digital signal processing, a few simple types of algorithms world, live inputs must be sampled, processed, decoded, and dominate. Thus, DSPs highlight their abilities to perform outputted on the fly. Thus, DSPs must be designed to be able functions to to complete a lot of computations very quickly to be effective. The rate at which these DSPs operate must be at least as fast A. Multiply-Accumulate as the rate of its input, otherwise backlogs will be created. Once the process no longer proceeds in real time you lose the Time domain filters are a common operation in digital signal synchronization between inputs and outputs, with the signal processing. Digital filters are implemented by taking streams becoming decoupled. This scenario can be likened to that of a of current and prior input and output data, scaling them with cell phone conversation in which you had to wait a long time coefficients that perform a certain filter function, and summing for your phone to decode the received audio signal from your them to generate the next output. correspondent. When there is too long of a delay many of the Some filters have hundreds or thousands of coefficients. benefits of a synchronous conversation are lost and one would Each of these multiplication operations must occur in order be no worse off sending a written letter. to generate each output sample. However, data rates in audio Today, by far, the largest implementation of DSPs is in DSP applications are often as high as 96,000 samples per cell phones. Every cell phone sold contains a DSP, which second (video can go much higher), meaning that millions of is necessary to perform all the encoding and decoding of multiplications and accumulations must happen every second. the audio signals which are transmitted through the wireless For this reason, DSPs are optimized for and benchmarked communications network. Additionally, most new phones now on their ability to perform multiply-accumulate operations. A have advanced multimedia capabilities which need to be common metric for DSP processors is MMACS, or millions processed in real time. These functions such as picture taking, of multiply-accumulates per second. music playing, and video recording, along with their non- integrated device counterparts, require DSPs for many fast B. Fast Fourier Transform (FFT) computations. DSPs can be found at the core of many of the In most situations, signal data is represented in the time products in the growing multimedia device segment. domain. That is, each separate word of signal data represents Within the audio domain, DSPs are rather prevalent and a discrete sample of the signal in time. However, many can be found in CD players as well as high end home audio functions, such as equalization and pitch-shifting, operate on PROCEEDINGS OF THE FALL 2004 ENGR 3410: COMPUTER ARCHITECTURE CLASS (CA2004), FRANKLIN W. OLIN COLLEGE OF ENGINEERING 3 Fig. 2. Super Harvard Architecture (taken from [1]) Fig. 1. Typical architecture for a DSP (taken from[1]) 2) Harvard Architecture: The Harvard architecture was developed by a group led by Howard Aiken at Harvard Univer- the frequency domain, where each word represents the amount sity during the 1940s. The Harvard architecture employs two of information present in the signal at a particular frequency. separate memories, one for instructions and one for data. The The Fast Fourier Transform is an algorithm which takes processor has independent buses, so that program instructions time domain data and converts it into corresponding frequency and data can be fetched simultaneously. domain data. In order to do so, the data must be reshuffled Most DSPs use some sort of Harvard architecture. However, and processed through multiple loops. Using a similar routine, most digital signal processing routines are data intensive the Inverse Fast Fourier Transform converts frequency domain and thus require more than one load from data memory for data back into time domain data. each load from instruction memory. The original Harvard architecture does not support III. ARCHITECTURE AND DATAPATH 3) Super Harvard Architecture (SHARC: Some products DSPs have several architectural properties that set them from Analog Devices use what they call a “Super Harvard ar- apart from other classes of processors. Each optimization has chitecture” (SHARC), such as their ADSP-2016x and ADSP- the common applications of DSPs in mind and aims to reduce 211xx families of DSPs. The SHARC processors add an the overhead associated with manipulating large digital signals instruction cache to the CPU and a dedicated I/O controller to in a repetitive fashion. the data memory. A. Memory The SHARC instruction cache takes advantage of the fact that digital signal processing routines make use of long repeti- DSPs access memory in differing ways.
Recommended publications
  • CPU ボードカタログ サポート CPU Intel :Core I7、Xeon-E5 Freescale :T4240、P4080、MPC8640D AMD :Radeon HD 6970M、HD 7970M GPGPU NVIDIA :Fermi、Kepler Architecture GPGPU
    組込みシステム向け CPU ボードカタログ サポート CPU Intel :Core i7、Xeon-E5 Freescale :T4240、P4080、MPC8640D AMD :Radeon HD 6970M、HD 7970M GPGPU NVIDIA :Fermi、Kepler Architecture GPGPU サポートバス規格 OpenVPX VME/VXS CompactPCI PMC/XMC ATCA/AMC PCI Express 403102 Ⓒ MISH International Co., Ltd. MISH International Co., Ltd. ミッシュインターナショナルでは CPU ボードをスピーディに 導入頂けますよう、次のような サービスを提供しております CPU ボードのお貸出しサービス CPU ボードの性能評価検証サービス ミッシュインターナショナルでは、ユーザが実際に製品を導入する前に性能評価を実施していただけ ミッシュインターナショナルでは、専門の CPU ボードサポート技術者がお客様のご要望に応じて CPU ますよう各種評価用 CPU ボードをお貸出ししています。お貸出し時には、リアルタイム OS を含めた ボードの性能を評価・検証させていただきます。たとえばFFT の処理速度やボード間のデータ転送スピー CPU ボードに関するトータルな技術サポートを行っております。 ドの測定などユーザがシステムインテグレーションする上で必要なデータを検証の上、レポートさせて いただきます。(お客様のご要望内容によっては別途有償の場合もあります) CPU ボードの技術サポート ミッシュインターナショナルでは、専門のCPU ボードサポート技術者が導入前はもちろん、導入後もハー ド・ソフトの両面からお客様の技術サポートをいたします。CPU ボードのドライバソフトウェアやアプ リケーションの開発方法等をトータルにバックアップいたします。また、リアルタイム OS を含んだシ CPU ボード用フレームワークソフトウェアの開発サービス ステムインテグレーッションに関するアドバイスも対応しています。 CPU ボードを含んだ組込み用システムを構 築する上では、CPU ボードのハード・ソフ トに関する技術的な知識経験はもちろんです が、CPU ボード以外の A/D、D/A、DIO ボー ド等の各種 I/O ボードとのシームレスな高速 データ通信やリアルタイム OS を使用したイ ンテグレーションが必要です。当社では複数 のボードを使ったマルチ CPU ボードシステ ムやレーダ、ソナー、移動体通信等の無線信 号のリアルタイム処理等をトータルにサポートしています。全体的なデータのパスをサポートした『フ レームワークソフトウェア』の開発もお手伝いしています。ユーザは『フレームワークソフトウェア』 の開発を当社へ外注することにより、アプリケーションソフトウェアの開発や FPGA の開発に専念する ことが出来ます。(お客様のご要望内容によっては別途有償の場合もあります) インテル製 プロセッサ搭載 CPU ボード ボード CPU スピード 拡張 USB 耐環境 型名 プロセッサ メモリ NVRAM Ethernet インテル製 プロセッサ Core i7(Ivy Bridge)、 タイプ (Max) メザニン 2.0 仕様 Xeon E5-2648L x 2 32GB DDR3- 8MB NOR 1000BASE-T x 1 Level HDS6601 6U VPX 1.8GHz - 3 Xeon(8 Core) 搭載 CPU ボード (Sandy Bridge)
    [Show full text]
  • Charactersing the Limits of the Openflow Slow-Path
    Charactersing the Limits of the OpenFlow Slow-Path Richard Sanger, [email protected] Brad Cowie, [email protected] Matthew Luckie, [email protected] Richard Nelson, [email protected] University of Waikato, New Zealand 28 November 2018 The Question How slow is the slow-path? © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 2 Contents • Introduction to the Slow-Path • Motivation • Test Suite • Test Methodology • Results • Conclusions © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 3 OpenFlow Packet-in and Packet-out To move packets between the controller and network, packets are encapsulated in OpenFlow packet-in and packet-out messages and sent via the slow-path. © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 4 The Fast-Path ASIC OpenFlow Agent Ingress Egress OpenFlow Switch Network © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 5 The Slow-Path (Packet In) ASIC OpenFlow Agent Packet in OpenFlow Switch Network Control-Plane Network OpenFlow Application NIC Controller © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 6 Motivation: Control Traffic Requirements Control traffic is sensitive to bandwidth and latency Latency • Keep-alives • Flow Establishment (Reactive control) Bandwidth • Initial route exchange (BGP etc.) • Capture (Network debugging) • DoS (Misconfiguration, ICMP, etc.) © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 7 Motivation: Control Traffic Requirements Control traffic requirements must be met simultaneously. Example: consider the requirement of link detection probing. • Typical Bidirectional Forwarding Detection (BFD) requirements • < 50ms • 2,880pps (48 port switch) © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 8 Motivation: Shared Resource The slow-path is shared with all other OpenFlow messages.
    [Show full text]
  • SO630 Manual
    SO630 ATX Industrial Motherboard User’s Manual A-604-M-2045 ver.MP_MP_4-1521_15-14 Copyright FCC and DOC Statement on Class B This publication contains information that is protected by copyright. No part of it may be repro- This equipment has been tested and found to comply with the limits for a Class B digital duced in any form or by any means or used to make any transformation/adaptation without the device, pursuant to Part 15 of the FCC rules. These limits are designed to provide reason- prior written permission from the copyright holders. able protection against harmful interference when the equipment is operated in a residential installation. This equipment generates, uses and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference This publication is provided for informational purposes only. The manufacturer makes no to radio communications. However, there is no guarantee that interference will not occur in a representations or warranties with respect to the contents or use of this manual and specifi- particular installation. If this equipment does cause harmful interference to radio or television cally disclaims any express or implied warranties of merchantability or fitness for any particular reception, which can be determined by turning the equipment off and on, the user is encour- purpose. The user will assume the entire risk of the use or the results of the use of this docu- aged to try to correct the interference by one or more of the following measures: ment. Further, the manufacturer reserves the right to revise this publication and make changes to its contents at any time, without obligation to notify any person or entity of such revisions or • Reorient or relocate the receiving antenna.
    [Show full text]
  • Avionics Hardware Issues 2010/11/19 Chih-Hao Sun Avionics Software--Hardware Issue -History
    Avionics Hardware Issues 2010/11/19 Chih-hao Sun Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC The Gyroscope, the first auto-pilot device, was -Issues on • Avionics Computer introduced a decade after the Wright Brothers -Avionics (1910s) Computer -PowerPC • holds the plane level automatically -Examples -Energy Issue • is connected to computers for missions(B-17 and - Certification B-29 bombers) and Verification • German V-2 rocket(WWII) used the earliest automatic computer control system (automatic gyro control) • contains two free gyroscopes (a horizontal and a vertical) 2 Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC Avro Canada CF-105 Arrow fighter (1958) first used -Issues on • Avionics Computer analog computer to improve flyability -Avionics Computer is used to reduce tendency to yaw back and forth -PowerPC • -Examples F-16 (1970s) was the first operational jet fighter to use a -Energy Issue • fully-automatic analog flight control system (FLCS) - Certification and Verification • the rudder pedals and joysticks are connected to “Fly-by-wire” control system, and the system adjusts controls to maintain planes • contains three computers (for redundancy) 3 Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC NASA modified Navy F-8 with digital fly-by wire system in -Issues on • Avionics Computer 1972. -Avionics Computer • MD-11(1970s) was the first commercial aircraft to adopt -PowerPC computer-assisted flight control -Examples -Energy Issue The Airbus A320 series, late 1980s, used the first fully-digital - • Certification fly-by-wire controls in a commercial airliner and Verification • incorporates “flight envelope protection” • calculates that flight envelope (and adds a margin of safety) and uses this information to stop pilots from making aircraft outside that flight envelope.
    [Show full text]
  • Ebook - Informations About Operating Systems Version: August 15, 2006 | Download
    eBook - Informations about Operating Systems Version: August 15, 2006 | Download: www.operating-system.org AIX Internet: AIX AmigaOS Internet: AmigaOS AtheOS Internet: AtheOS BeIA Internet: BeIA BeOS Internet: BeOS BSDi Internet: BSDi CP/M Internet: CP/M Darwin Internet: Darwin EPOC Internet: EPOC FreeBSD Internet: FreeBSD HP-UX Internet: HP-UX Hurd Internet: Hurd Inferno Internet: Inferno IRIX Internet: IRIX JavaOS Internet: JavaOS LFS Internet: LFS Linspire Internet: Linspire Linux Internet: Linux MacOS Internet: MacOS Minix Internet: Minix MorphOS Internet: MorphOS MS-DOS Internet: MS-DOS MVS Internet: MVS NetBSD Internet: NetBSD NetWare Internet: NetWare Newdeal Internet: Newdeal NEXTSTEP Internet: NEXTSTEP OpenBSD Internet: OpenBSD OS/2 Internet: OS/2 Further operating systems Internet: Further operating systems PalmOS Internet: PalmOS Plan9 Internet: Plan9 QNX Internet: QNX RiscOS Internet: RiscOS Solaris Internet: Solaris SuSE Linux Internet: SuSE Linux Unicos Internet: Unicos Unix Internet: Unix Unixware Internet: Unixware Windows 2000 Internet: Windows 2000 Windows 3.11 Internet: Windows 3.11 Windows 95 Internet: Windows 95 Windows 98 Internet: Windows 98 Windows CE Internet: Windows CE Windows Family Internet: Windows Family Windows ME Internet: Windows ME Seite 1 von 138 eBook - Informations about Operating Systems Version: August 15, 2006 | Download: www.operating-system.org Windows NT 3.1 Internet: Windows NT 3.1 Windows NT 4.0 Internet: Windows NT 4.0 Windows Server 2003 Internet: Windows Server 2003 Windows Vista Internet: Windows Vista Windows XP Internet: Windows XP Apple - Company Internet: Apple - Company AT&T - Company Internet: AT&T - Company Be Inc. - Company Internet: Be Inc. - Company BSD Family Internet: BSD Family Cray Inc.
    [Show full text]
  • Parallel Rendering on Hybrid Multi-GPU Clusters
    Eurographics Symposium on Parallel Graphics and Visualization (2012) H. Childs and T. Kuhlen (Editors) Parallel Rendering on Hybrid Multi-GPU Clusters S. Eilemann†1, A. Bilgili1, M. Abdellah1, J. Hernando2, M. Makhinya3, R. Pajarola3, F. Schürmann1 1Blue Brain Project, EPFL; 2CeSViMa, UPM; 3Visualization and MultiMedia Lab, University of Zürich Abstract Achieving efficient scalable parallel rendering for interactive visualization applications on medium-sized graphics clusters remains a challenging problem. Framerates of up to 60hz require a carefully designed and fine-tuned parallel rendering implementation that fits all required operations into the 16ms time budget available for each rendered frame. Furthermore, modern commodity hardware embraces more and more a NUMA architecture, where multiple processor sockets each have their locally attached memory and where auxiliary devices such as GPUs and network interfaces are directly attached to one of the processors. Such so called fat NUMA processing and graphics nodes are increasingly used to build cost-effective hybrid shared/distributed memory visualization clusters. In this paper we present a thorough analysis of the asynchronous parallelization of the rendering stages and we derive and implement important optimizations to achieve highly interactive framerates on such hybrid multi-GPU clusters. We use both a benchmark program and a real-world scientific application used to visualize, navigate and interact with simulations of cortical neuron circuit models. Categories and Subject Descriptors (according to ACM CCS): I.3.2 [Computer Graphics]: Graphics Systems— Distributed Graphics; I.3.m [Computer Graphics]: Miscellaneous—Parallel Rendering 1. Introduction Hybrid GPU clusters are often build from so called fat render nodes using a NUMA architecture with multiple The decomposition of parallel rendering systems across mul- CPUs and GPUs per machine.
    [Show full text]
  • Average Memory Access Time: Reducing Misses
    Review: Cache performance 332 Miss-oriented Approach to Memory Access: Advanced Computer Architecture ⎛ MemAccess ⎞ CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime Chapter 2 ⎝ Execution Inst ⎠ ⎛ MemMisses ⎞ CPUtime = IC × ⎜ CPI + × MissPenalt y ⎟ × CycleTime Caches and Memory Systems ⎝ Execution Inst ⎠ CPIExecution includes ALU and Memory instructions January 2007 Separating out Memory component entirely Paul H J Kelly AMAT = Average Memory Access Time CPIALUOps does not include memory instructions ⎛ AluOps MemAccess ⎞ These lecture notes are partly based on the course text, Hennessy CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime and Patterson’s Computer Architecture, a quantitative approach (3rd ⎝ Inst AluOps Inst ⎠ and 4th eds), and on the lecture slides of David Patterson and John AMAT = HitTime + MissRate × MissPenalt y Kubiatowicz’s Berkeley course = ()HitTime Inst + MissRate Inst × MissPenalt y Inst + ()HitTime Data + MissRate Data × MissPenalt y Data Advanced Computer Architecture Chapter 2.1 Advanced Computer Architecture Chapter 2.2 Average memory access time: Reducing Misses Classifying Misses: 3 Cs AMAT = HitTime + MissRate × MissPenalt y Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. There are three ways to improve cache (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during performance: execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) 1. Reduce the miss rate, Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will 2.
    [Show full text]
  • Chapter 1. Origins of Mac OS X
    1 Chapter 1. Origins of Mac OS X "Most ideas come from previous ideas." Alan Curtis Kay The Mac OS X operating system represents a rather successful coming together of paradigms, ideologies, and technologies that have often resisted each other in the past. A good example is the cordial relationship that exists between the command-line and graphical interfaces in Mac OS X. The system is a result of the trials and tribulations of Apple and NeXT, as well as their user and developer communities. Mac OS X exemplifies how a capable system can result from the direct or indirect efforts of corporations, academic and research communities, the Open Source and Free Software movements, and, of course, individuals. Apple has been around since 1976, and many accounts of its history have been told. If the story of Apple as a company is fascinating, so is the technical history of Apple's operating systems. In this chapter,[1] we will trace the history of Mac OS X, discussing several technologies whose confluence eventually led to the modern-day Apple operating system. [1] This book's accompanying web site (www.osxbook.com) provides a more detailed technical history of all of Apple's operating systems. 1 2 2 1 1.1. Apple's Quest for the[2] Operating System [2] Whereas the word "the" is used here to designate prominence and desirability, it is an interesting coincidence that "THE" was the name of a multiprogramming system described by Edsger W. Dijkstra in a 1968 paper. It was March 1988. The Macintosh had been around for four years.
    [Show full text]
  • Parallel Texture-Based Vector Field Visualization on Curved Surfaces Using GPU Cluster Computers
    Eurographics Symposium on Parallel Graphics and Visualization (2006) Alan Heirich, Bruno Raffin, and Luis Paulo dos Santos (Editors) Parallel Texture-Based Vector Field Visualization on Curved Surfaces Using GPU Cluster Computers S. Bachthaler1, M. Strengert1, D. Weiskopf2, and T. Ertl1 1Institute of Visualization and Interactive Systems, University of Stuttgart, Germany 2School of Computing Science, Simon Fraser University, Canada Abstract We adopt a technique for texture-based visualization of flow fields on curved surfaces for parallel computation on a GPU cluster. The underlying LIC method relies on image-space calculations and allows the user to visualize a full 3D vector field on arbitrary and changing hypersurfaces. By using parallelization, both the visualization speed and the maximum data set size are scaled with the number of cluster nodes. A sort-first strategy with image-space decomposition is employed to distribute the workload for the LIC computation, while a sort-last approach with an object-space partitioning of the vector field is used to increase the total amount of available GPU memory. We specifically address issues for parallel GPU-based vector field visualization, such as reduced locality of memory accesses caused by particle tracing, dynamic load balancing for changing camera parameters, and the combination of image-space and object-space decomposition in a hybrid approach. Performance measurements document the behavior of our implementation on a GPU cluster with AMD Opteron CPUs, NVIDIA GeForce 6800 Ultra GPUs, and Infiniband network connection. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Viewing algorithms I.3.3 [Three-Dimensional Graphics and Realism]: Color, shading, shadowing, and texture 1.
    [Show full text]
  • A Developer's Guide to the POWER Architecture
    http://www.ibm.com/developerworks/linux/library/l-powarch/ 7/26/2011 10:53 AM English Sign in (or register) Technical topics Evaluation software Community Events A developer's guide to the POWER architecture POWER programming by the book Brett Olsson , Processor architect, IBM Anthony Marsala , Software engineer, IBM Summary: POWER® processors are found in everything from supercomputers to game consoles and from servers to cell phones -- and they all share a common architecture. This introduction to the PowerPC application-level programming model will give you an overview of the instruction set, important registers, and other details necessary for developing reliable, high performing POWER applications and maintaining code compatibility among processors. Date: 30 Mar 2004 Level: Intermediate Also available in: Japanese Activity: 22383 views Comments: The POWER architecture and the application-level programming model are common across all branches of the POWER architecture family tree. For detailed information, see the product user's manuals available in the IBM® POWER Web site technical library (see Resources for a link). The POWER architecture is a Reduced Instruction Set Computer (RISC) architecture, with over two hundred defined instructions. POWER is RISC in that most instructions execute in a single cycle and typically perform a single operation (such as loading storage to a register, or storing a register to memory). The POWER architecture is broken up into three levels, or "books." By segmenting the architecture in this way, code compatibility can be maintained across implementations while leaving room for implementations to choose levels of complexity for price/performances trade-offs. The levels are: Book I.
    [Show full text]
  • Contents of DOWNLOAD.ZIP Contents of This
    F2 motherboard BIOS update „WN DOS 21/01“ Contents of DOWNLOAD.ZIP *UPDATE????IMG.EXE Win-Image executable for Windows: Creates a bootable DOS based update disk on 1,44MB/3,5" floppy. *UPDATE????DD.BIN Image file for Linux (dd): Creates a bootable DOS based update disk on 1,44MB/3,5" floppy. \DOS\*.* Contains all update files for USB stick or other media README.PDF This information ???? .............................. Actual BIOS release number * …………………………. Motherboard type Contents of this README file CONTENTS OF DOWNLOAD.ZIP............................................................................. 1 CONTENTS OF THIS README FILE ....................................................................... 1 GENERAL REMARKS............................................................................................... 1 Known problems or restrictions .......................................................................................................... 1 LIMITS AND FEATURES OF F2 DOS BIOS ............................................................. 2 Default setup values.............................................................................................................................. 3 CRISIS DISK FOR BIOS RECOVERY....................................................................... 4 DOS BASED UPDATE PROCEDURE RELEASE 4.6............................................... 4 Running DOS BIOS update from external USB boot media.............................................................. 4 Running DOS BIOS update from CD-ROM.........................................................................................
    [Show full text]
  • Evolution of the Graphical Processing Unit
    University of Nevada Reno Evolution of the Graphical Processing Unit A professional paper submitted in partial fulfillment of the requirements for the degree of Master of Science with a major in Computer Science by Thomas Scott Crow Dr. Frederick C. Harris, Jr., Advisor December 2004 Dedication To my wife Windee, thank you for all of your patience, intelligence and love. i Acknowledgements I would like to thank my advisor Dr. Harris for his patience and the help he has provided me. The field of Computer Science needs more individuals like Dr. Harris. I would like to thank Dr. Mensing for unknowingly giving me an excellent model of what a Man can be and for his confidence in my work. I am very grateful to Dr. Egbert and Dr. Mensing for agreeing to be committee members and for their valuable time. Thank you jeffs. ii Abstract In this paper we discuss some major contributions to the field of computer graphics that have led to the implementation of the modern graphical processing unit. We also compare the performance of matrix‐matrix multiplication on the GPU to the same computation on the CPU. Although the CPU performs better in this comparison, modern GPUs have a lot of potential since their rate of growth far exceeds that of the CPU. The history of the rate of growth of the GPU shows that the transistor count doubles every 6 months where that of the CPU is only every 18 months. There is currently much research going on regarding general purpose computing on GPUs and although there has been moderate success, there are several issues that keep the commodity GPU from expanding out from pure graphics computing with limited cache bandwidth being one.
    [Show full text]